[ZKLLVM] Trustless ZK-proof TVL oracle

Overview

Lido Accounting Oracle contract is trustful - it relies on an oracle committee (9 members at the time of writing) of trusted third-parties and quorum mechanism (5 out of 9) to maintain its state. This constitutes a potential attack vector - if the majority of the oracle committee members are hacked or malicious, an attacker will be able to submit a false value. Such an attack will directly endanger funds in DeFI pools (such as ~$1B Curve pool), undermine trust in Lido protocol security, and cause other reputational and financial losses.

In the past, our team has shown it is possible to compute one of the key parts of the accounting oracle report - total value locked - in a trustless manner, secured by a ZK-proof. Compared to legacy Lido Oracle, Accounting oracle carries on multiple additional responsibilities, including reporting active and exited validator counts, withdrawal and rewards balances, and more.

Proposal

We propose using zkLLVM to build an additional sanity/correctness check for the Accounting oracle, thus mitigating the risk outlined above.

Short-term (3-4 weeks):

  • Report Lido’s total value locked, active and exited validator counts to a dedicated Execution Layer contract in a verifiable, trustless manner.
  • Use the new contract as an additional correctness check for Accounting oracle report.

Long-term (4+ months):

  • Expand the solution to cover other parts of the Accounting oracle report.

We are asking the following commitments from Lido DAO:

  • Grant. We estimate the overall costs of developing the first production iteration of the solution (“mid-term” above) to be around to be around 50K USD (to be transferred to this address).
  • Support necessary audits for the smart-contract(s) and oracle(s) to be built.
  • Cover the additional expenses for oracle operators - tentatively <$1K a month

Solution

zkLLVM technology overview

At a high level, zkLLVM technology stack used in this proposal consists of the following components

zkLLVM compiler and development toolchain - the compiler allows building zk-SNARK or zk-STARK circuits using mainstream programming languages - currently supporting C++ and (partially) Rust, with more languages coming in the future. Other toolchain utilities enable a range of other activities - such as generating the proofs locally, or transpiling the circuit into verification gates deployable to Ethereum Execution Layer.

Proof market - while it is possible to generate ZK-proofs locally, it is a resource- and time-demanding process. Proof market allows delegating producing the proof to 3rd parties, without compromising correctness or security of the proof.

Proof verifier contract - proof verification happens completely on-chain, via verifier contract + verification gates generated from a circuit. The applications using zkLLVM stack can use a shared instance deployed and operated by =nil; Foundation team, however this is not mandatory. The verifier is completely standalone - does not depend on any external data sources, oracles, libraries, etc. - which allows deploying a dedicated instance for additional security.

General approach

This is a high-level description aimed to provide a birds-eye view on the process end-to-end and intentionally omits lengthy and complex explanations. For concrete details, please refer to a separate Detailed spec document.

The work is split between three components: oracle, proof producer and contract.

  • Oracle obtains necessary information from Consensus and Execution layers - such as Beacon Block Header, Beacon State, Lido contract addresses, etc.
  • Oracle computes the report - total locked value and validator counts for all Lido validators.
    • While “bulletproof” check if a validator belongs to Lido requires obtaining Lido validator keys from Lido Node Operator Keys Service, we found a simple heuristic that can substitute this check with 100% accuracy.
  • Oracle produces additional data necessary to produce and verify the proof, and passes it to a proof producer
    • Additional data includes Beacon Block hash, Beacon State Merkle root, validators’ inclusion witness, and more.
  • Proof producer accepts the input and “runs” it through a zk circuit, producing a zk-proof.
    • At a high level, the circuit repeats the computations performed in the oracle (including additional verification witness), producing a verifiable “trail” of operations performed.
  • Oracle fetches the proof from the proof-producer, and submits the report, additional witness and zk-proof to a contract.
  • Contract performs necessary checks - verifies zk-proof with zkLLVM verifier contract, checks Beacon Block hash against an expected value, etc.
  • If all checks pass, contract stores the report for future retrieval, otherwise rejects the report.

Note: until EIP-4788 is available, we would rely on an auxiliary oracle+contract to deliver BeaconBlock hashes to the Execution layer. This oracle+contract will use the oracle committee and consensus mechanism similar to the Accounting module.

FAQ

Q: How is a trustless solution more secure than consensus-based?
A: With a trustless solution, the correctness of the oracle report will be guaranteed by verifying a ZK-proof, and “anchoring” both report and proof to the actual blockchain state (via block hash and Merkle inclusion proofs). This allows making no assumptions about the oracle operator, proof producer and other involved parties - as long as the report passes all checks, it is known to be legitimate, even if the sender is compromised. In short, anyone can submit the report, and the correctness of the report can be verified against blockchain state.

Q: What exactly does ZK check verify?
A: Oracle report is produced from two “ingredients”: data (validators and balances from Beacon State) and algorithm (computations to perform on them). Proof is essentially an execution trace of a computation performed on some input data. The circuit encodes the expected algorithm, so verifying the provided proof against the circuit ensures that the correct computations were performed on some data. Additional witness (in particular, BeaconBlock hash and Merkle inclusion witness) is passed to the contract to check if the oracle used the correct data. Lastly, to prevent tampering with the data after proof was produced, but before the report is submitted, the circuit includes computing the witness from the raw data. Put together, these ensure that a correct computation was performed on the correct data.

Q: Why build a supplementary check and not a complete replacement for the accounting oracle?
A: While building a complete replacement is possible, we believe gradual development and rollout is more beneficial, compared to a “big leap” approach. Having the trustless solution act as an additional check for a battle-tested existing oracle will gradually build confidence and expertise operating trustless solutions (especially in terms of development speed, correctness, operational cost, etc.), and simultaneously expand the trustless solution to cover the full range of accounting oracle responsibilities.

Q: Why zkLLVM and not (something else)?
A: The answer will likely be different for each other technology, but here are a few unique advantages zkLLVM stack provides:

  • Proof verification happens completely in the Execution Layer - i.e. Layer 1 solution with no external dependencies, as secure and reliable as Ethereum blockchain itself.
  • Verifier contract is open-sourced, so dedicated copies can be deployed for additional security, if necessary.
  • “Immediate” verification - the proofs can be securely verified on-chain immediately after they are produced.
  • Proofs can be generated via a centralized on-premises or cloud setup, or delegated to a decentralized proof generation system (proof market). Both options achieve the same security and correctness guarantees.
  • Speed of development, ecosystem and talent pool - ZK-circuits are developed using mainstream programming languages, as opposed to vendor-specific DSLs.
  • zkLLVM is powered by a Placeholder proof system which features no trusted setup. This eliminates last proof system-related attack vectors many other systems (e.g. Halo2 or Groth16-based ones) are subject to.
  • zkLLVM was chosen by Ethereum, Mina and Solana Foundations for zkBridging usages with an audit by ABDK in progress.
  • Proof Market applies market-dynamics to proof generation which means performance optimizations become a market-driven metric incentivized by applications, which means further optimizations do not require additional financing.

Demo

Video: ZKLLVM_Oracle_Demo.webm - Google Drive

Description: ZKLLVM Oracle Demo transcript - Google Docs

Implementation timeline

Phase 1: Initial implementation (main logic)

What: building an initial implementation

  • zkLLVM circuit covering computing main report and Merkle inclusion proofs.
  • Reading BeaconState and BeaconBlockHeader from Consensus Layer
  • Computing SSZ hash tree roots and inclusion proofs.

When: 2-3 weeks from now.
Outcome: “happy path” works end-to-end in a devnet/forked mainnet
Details:

  • ZK circuit proving validity of computations.
  • Oracle implementation - fetching data from Consensus Layer, computing the report, generating the proof and sending it to Ethereum contract.
  • Contract to perform validity and correctness check (incl. proof verification).
  • The solution runs in a production-like environment (devnet/forked mainnet), controlled by “scenario script(s)” - end-to-end tests, emulating different real-world scenarios.

Phase 2: Productionization

What: Productionization
When: 3-6 weeks from Phase 1
Outcome: “code complete” - all parts of the solution are finalized and ready for security audit.
Details:

  • All components of the final solution are lifted to Lido production quality (monitoring, logging, ACL, dockerization, etc.), finalized (no code changes expected) and ready for the audit.
    • Oracle is fully productionized (logging+monitoring+dockerization), documented and open-sourced.
    • Contract is expanded to include administrative functionality (ACLs, redeployment, DAO capabilities, etc.), where necessary.
    • Contracts pass necessary audit(s)

Phase 3: Audit and final polishes

What: Audit
When: End of Phase 2 + time for audit
Outcome: ready for testnet and mainnet deployment ceremony
Details:

  • Testnet deployment and user-acceptance testing
  • Security audit and fixes
8 Likes

Update: I deleted a post that described how the DAO should think about prioritization and work on these priorities with external partners. However, after looking into this, I wasn’t aware of all the history and back and forth that already went into making this proposal. It would be unfair to change the process now that you already put work into it. I will retract my message and repost it as an individual thread in a week or two to discuss it separately.

On the proposal itself, I pretty much agree with @vsh and think it’s of good value to Lido – that part didn’t change.

4 Likes

IMO: this proposal is for something of great value to Lido protocol. Incorporating zk oracle checks will make a risk of oracle misbehavior or an oracle software bug smaller. The challenges on Lido protocol side here are:

  • there’s quite a few different things lido oracle do, and it’s not practically possible to replace them with zk oracles all at once
  • zk tech is pretty green, I think that the safer way to incoporate it is not to
  • additional computation and gas costs here is pretty bulky; having a reasonable upper limit on time and gas costs of delivering a value are important
  • the changes in oracle code onchain should be aligned with the cadence of upgrades for the protocol itself, which is usually 6-12 months between upgrades.

This proposal is mostly aligned with this, with an exception to the timelines (there’s no time pressure to implement things fast, actually - they won’t be incorporated fast, there’s time to spare for polish), and thus maybe we can afford a code that would use a more precise way to understand what validators are a part of Lido set than checking by withdrawal credentials.

On the ask side:

  • $90k for a zk proof oracle implementation for a three of the simpler outputs is a reasonable ask, IMO
  • the hack used to compute the balance (relying on withdrawal credentials) is not fully reliable; e.g. it would fail to deliver correct value on the testnet now, because making a deposit with Lido withdrawal credentials is permissionless
  • I think it’s reasonable that audits of smart contract code used by Lido are fully or partially compensated by Lido DAO but I don’t think it’s a good idea to precommit to them before the actually usable system is in place, or there’s an understanding of who can actually audit zk code; I suggest leaving this question for the future, when there’s understanding that:
    • the code in question goes into upgrade proposal
    • who could be the best firms to audit this code, and what’s their ask.
5 Likes

It was a great pleasure to dive into such a well-crafted and high-quality proposal. Developing a trustless zk oracle is undoubtedly a high-priority project for Lido and will bring a significant value to the DAO.

I’d like to emphasize @vsh’s notice about the hack. It probably would not work on the testnet and, more importantly, it can lead to a DoS attack if someone were to set up the validator with Lido credentials but without involving the Lido contract. The Lido contract checks for the invariant deposited_validators >= reported_validators upon report, and if it fails, no reports would go through.

After reviewing the detailed spec, I have some feedback to offer:

  • active_validators and exited_validators are not the correct values to be reported by the oracle. The current report provides the number of all visible Lido validators on the Beacon chain side, including active , exited and even validators not activated yet.
  • Using the term TVL for the sum of Lido validators’ balances might be confusing, as it’s only a portion of Lido’s total TVL. I suggest adhering to the oracle code and consistently referring to it as clBalance instead.

Overall, I’m excited to see this proposal and eager to assist to make it live from the side of the Protocol team.

4 Likes

I think there’s no strict need to have an MVP value to be exactly the same as offchain oracle provides - just useful for a sanity check. E.g. total balance of all validators with Lido withdrawal credentials is >= clBalance reported by offchain oracles, and making the difference more than 1% is insanely expensive, so we coud make sanity check to be clBalance <= zkLidoWCTotalBalance <= clBalance * 1.01; and stop the oracle update if it fails.

Second value (number of validators) to be useful has to be in the same ballpark (very close to one reported by offchain oracles); and the third - related to exited validators - has to be useful to sanity check exited validators by module for the period.

Yep. We can make some ballpark checks, of course.
But I considered the approach from the further perspective of complete replacement of our conventional oracle as the proposal outlines. And if we want it, we need some other method to prove inclusion.
As for the third one, oracle now reports only changed values of exited validators per module, and we’ll need some additional internal logic to include values that did not change in this check. It’s also possible, but makes the integration more complex and will require to call StakingRouter and iterate over modules’ stats and so on. And possible impact of forging this values is pretty limited, so, I’d rather make it simple for an MVP and concentrate on clBalance mainly.

2 Likes

Hi all! Thanks for responses and feedback - really appreciate it!

@Hasu +1 - totally agree on the importance of solid processes, and appreciate the flexibility!

@vsh, @folkyatina

there’s no time pressure to implement things fast, actually - they won’t be incorporated fast, there’s time to spare for polish

That’s good to know, thanks!

we can afford a code that would use a more precise way to understand what validators are a part of Lido set than checking by withdrawal credentials
the hack used to compute the balance (relying on withdrawal credentials) is not fully reliable; e.g. it would fail to deliver correct value on the testnet now, because making a deposit with Lido withdrawal credentials is permissionless

Thanks for pointing out! In general, withdrawalCredentials check was used as “the most reliable out of simple solutions” - and unfortunately as you pointed out it’s not reliable enough. There is a range of more complex ones that can achieve higher reliability - ranging from additional oracles, to stateful contracts (+something to update their state), to changes in Lido protocol - each with different tradeoffs and risks. Happy to provide more insights into the challenges, potential options and their tradeoffs/risks, let me know if that would be helpful and what’s the best format for it (e.g. comment here, separate thread, shared doc, video call, etc.)

It probably would not work on the testnet and, more importantly, it can lead to a DoS attack if someone were to set up the validator with Lido credentials but without involving the Lido contract. The Lido contract checks for the invariant deposited_validators >= reported_validators upon report, and if it fails, no reports would go through.

Acknowledged. Yes, it doesn’t tally on testnet - there are already some old validators on testnet that use Lido withdrawalCredentials, but not part of the Lido protocol (at least from the perspective of keys-api).

  • active_validators and exited_validators are not the correct values to be reported by the oracle. The current report provides the number of all visible Lido validators on the Beacon chain side, including active , exited and even validators not activated yet.

Good point - it is a relatively simple change to make. Re: “not activated yet” validators - the actual check is “activation_eligibility_epoch <= current_epoch” - if I’m not wrong this means “validator have deposited eth and is in the activation queue, or activated”. Anyway, for this logic I assumed lido-oracle accounting module being a “reference implementation” and what I ended up with tallied with accounting module report (on multiple runs on multiple days). This might mean that I haven’t hit a edge case though, so appreciate your insight on how it should behave.

Using the term TVL for the sum of Lido validators’ balances might be confusing

Sure, would CL prefix be a good replacement (i.e. clBalance, ZKCLOracleContract, etc.)?

Overall, I’m excited to see this proposal and eager to assist to make it live from the side of the Protocol team.

Thanks!

But I considered the approach from the further perspective of complete replacement of our conventional oracle as the proposal outlines. And if we want it, we need some other method to prove inclusion.

Definitely - there are quite a few things before this could completely replace the conventional oracle. I think the approach of using zk-oracle as an additional sanity check to the conventional oracle (with some “relaxed” checks, as @vsh noted) allows capturing some value and learnings earlier, and gradually enhance precision, add other reported fields, etc.

I’d rather make it simple for an MVP and concentrate on clBalance mainly

I think if making a check between zk and conventional oracle incurs some unwanted complexity/overhead/etc., this could be left for future enhancement. On the other hand, It is valuable (at least from “obtain learnings” perspective) to have it reported and proved by zk-oracle, so I think the best course of action is to keep it reported by zk-oracle.

1 Like

Updates on some additional steps we’ve taken over the week.

Tested with mainnet

We have tested the solution with the real data from mainnet and compared it to the lido-oracle accounting report. In short, ZK-oracle and accounting module reports perfectly match in CL balance, all validators and exited validators count.

Report: slot 6984000, all validators 247929, exited validators 1739, CL balance 7880438321299961 Gwei

Oracle logs: gist
Screencast: video

The report was built on lido-oracle/develop with minimal adjustments:

  • Enable running reports not being a part of the membership committee.
  • Always report the count of exited validators.
  • Compute report on the “first slot of the next epoch after currentFrame.ref_slot” - practically just +1 slot. This is a temporary means to align lido-oracle with BeaconState checkpoints available through public services. In production oracle relies on a private Beacon Chain node, and should be able to obtain BeaconState for any slot.

Note: the screencast was recorded for a smaller problem size - see next section for details.

UPD: another run, without “+1 slot” (i.e. canonical lido-oracle logic): gist

Proof generation time and verification cost

We successfully ran the entire solution end-to-end on a smaller subsets, and confirmed the following.

2^10 validators (CPU-only Single-Core Intel Xeon):

  • Correctness: generated proofs consistently pass verification.
  • Proof generation time: 45 minutes.
  • Proof verification cost: largest expense we observed was ~3.6m gas.

2^10 validators (CPU-only 128-Core AMD EPYC):

  • Correctness: generated proofs consistently pass verification.
  • Proof generation time: <= 60 seconds.
  • Proof verification cost: largest expense we observed was ~3.6m gas.

2^20 validators (CPU-only 128-Core AMD EPYC):

  • Correctness: generated proofs consistently pass verification.
  • Proof generation time: ~292 minutes.
  • Proof verification cost: largest expense we observed was slightly above ~5m gas.

Since there is still room for optimizations, following numbers are achievable:

  • Proof generation time: ~25-30 minutes
  • Proof verification cost: ~5m gas

No hardware will be required to be used for proof generation from Lido side as all the computations will be outsourced to the Proof Market.

Code

Note: there are a bunch of shortcuts I’ve made to speed up development (largely around putting things in certain locations, hardcoding absolute paths, setting environment, etc.) This will be cleared out later, but if someone would like to run it - let me know, so I can walk you though “environment setup”.

4 Likes

Alrighty. Here goes a little bit of update to this thread.

First of all, =nil; Foundation is happy to join and help facilitate this effort.

Second of all, just to confirm, following estimations and key points seem to be aligned with our vision:

  • Report generation time to be about 30 min from the end of epoch.
  • Cost of verification to be about 1.5 mln gas.
  • Solution audit to be arranged in 8-10 weeks.
  • Audits are left to be decided at a discretion of the DAO.
  • DAO financing requested can be split 50% up-front and 50% upon reaching the deadline.
  • The prover and the verifier of the solution are open-source and used within many different projects.
  • The solution’s Placeholder proof system supposes no trusted setup.
6 Likes

@nemothenoone hey hey, the LEGO council reviewed and approved this grant by the majority of votes, I will provide the update with tx when it’s ready, ty!

3 Likes

@nemothenoone the first leg of payment is sent: Ethereum Transaction Hash (Txhash) Details | Etherscan

1 Like

Hey everyone, sharing a brief update on the state of =nil; Foundation’s zkOracle for Lido.

Recently, we came to a crossroads for development. After running some preliminary tests on our first Prover version, we realized we would not be able to meet the desired requirement for Lido for this zkOracle. Specifically, those requirements are:

  • Proof generation time: 30min
  • Proof generation cost: $200

Our initial tests came out to approximately 2 hours for ~$1k. You can see our first update here.

After discussing with the Lido core contributors, it’s obvious this is not a viable solution for the production grade zkOracle. The contributors suggested we evaluate the DendrETH approach to zkOracle.

We have reviewed the proposed design by DendrETH, which uses a data caching technique to mitigate the bloat from memory. We’ve run into a similar problem with another project of ours (CasperFFG) and believe this is the optimal design. We’ve reassessed our solution to identify how it might be repurposed to fit this proposed design but unfortunately there doesn’t appear to be much reusable work.

The DendrETH approach can reduce the validator merklization cost by 100x, 200x - depending on the size of the subproblem. Practically, they saw 9-10x speed up (no further details on setup and branching factor). At a glance, if we are to replicate this, we’d need an “recursive proof with explicit circuit”. This would require to make a dedicated circle to compute validator merkle subtrees, change our main circuit to accept pairs (merkle_subtree_root, merkle_subtree_proof), verify proofs, and compute overall merkle tree root from subtree roots. Additionally, we’d need to provide a caching mechanism to store subtree proofs (could be a centralized one, or baked into oracle - the former is faster/more performant, the latter is devops overhead and smaller attack surface). Finally, we’d need to change the oracle to use the “subproblem circuit” first and caching in case validators didn’t change.

Accordingly, we must reset internally and assess how we might allocate resources within our organization to deliver additional work. We’re also speaking with several third-party teams to takeover the completion of this work. We expect this assessment to take approximately 2 months as we restructure our team.

If you have any questions please let us know!

2 Likes