Ethereum Node Operator EL Diversity Improvement Commitments

Node Operators using the Lido Protocol have been making efforts to improve Execution Layer (EL) client diversity since minority clients stabilized and matured post-Merge (I know, NOM workstream contributors bug them a LOT about it and it’s @yorickdowne’s part time job in the validators chat).

While aggregate data for Q4/24 is still being crunched, it’s as good a time as any to for some public commitments by Node Operators. This would be useful to signal to not only the DAO, stakers, and the wider network that NOs are cognizant of room for improvement and taking steps towards further safeguarding the network against a possible supermajority client block validity bug that might affect a (super)majority client.

As such, I’d like to propose through this thread that Node Operators make public commitments about their plans to further decrease majority client (currently geth) utilization with regards to validators that they run as a part of the Lido protocol (and in general). It’s definitely also a place for those who are leading the charge (and offering help either through the maintenance of software that helps prevent such an event or through infra migration advice) to received plaudits for doing so!

I suggest the following (which, best I can tell, is already happening but twitter gets angry if you don’t cc them on everything):

“Majority client” is defined as the EL client which is most widely used in Ethereum attached to validator nodes, and having a utilization rate >= 50%*

  • Node Operators using the Lido protocol target < 2/3rd majority client utilization across Lido validators (in aggregate) by end of Q1 and maintain indefinitely,
  • Node Operators using the Lido protocol target <= 55% majority client utilization across Lido validators (in aggregate) by end of the year with a view to gradually decrease even more,
  • Node Operators target < 2/3rd majority client utilization on a per operator basis by end of Q3, with a view to progress below 55% ASAP

* It’s possible for Node Operators to use a majority client in multi-client setups where the client is used to propose blocks but not validate them, or in setups where consensus amongst multiple clients is needed, etc, so that should be taken into account. It’s not a simple case of “using geth (as an example) or not”, but how the client is utilized in the NO’s validation setup.

24 Likes

We have been talking about this internally for some time.
Thank you, Izzy, for putting it here.

Kukis Global is committed to reducing Geth usage to less than two-thirds of Lido validators by the end of Q1.

10 Likes

Attestant does not use geth for attestations.

13 Likes

Vouch is an amazing piece of software and I can’t wait to hear more about the new beta features today’s Node Operator Community Call.

6 Likes

I’m glad to see this is getting some attention here!

That said, I think these goals are extremely unambitious. The recent context (Besu and Nethermind issues) have really highlighted how real a risk we face of a client bug in Geth. Even if Geth’s devs are 100x less likely to put in a bug, taking that risk is currently not a sound choice. See the modeling in https://docs.google.com/spreadsheets/d/1N9Rjia84SQSedFzmBtnipnWj8_ND0tFS0p1C6q8lybc/edit#gid=0 to get a feel for the risks here – having a supermajority bug is disastrous to stakers, Lido, and Ethereum.

Apologies I don’t have enough reputation on this forum to post links.
Cheers, and hope y’all help make Ethereum safer for everyone.

Current state

https://twitter.com/LidoFinance/status/1749860092885819577

Izzy’s proposed targets

First target: This requires a <1% change in the next ~2 months for Lido. Why? Now is a great moment for change. There is huge momentum. Any changes that happen now are fantastic marketing.

Second target: This requires a 12% share drop in ~11 months. Lido’s 24 biggest Node Operators each contribute 3% or more. This is aiming for as little as 4 Node Operators to switch (out of 35). I believe Lido can and should aspire much higher than this for a year’s progress. I don’t have a breakdown per-operator, but I suspect it would be hard to fail this target if meeting the third target.

Third target: This requires action from all operators, which I love. However, the target is quite modest.

In general, while I understand that the aggregate is what matters to the chain, I think individual goals make more sense here as they affect all Node Operators equally, and Node Operators can directly speak to their progress or lack thereof. Beware diffusion of responsibility. If wiggle room is desired, it might make more sense to allow it for node operators with smaller shares within Lido.

Val’s target thoughts

Disclaimer: I’m not part of Lido DAO. I’m very active in Rocket Pool. I’m pushing for Lido to aspire higher. I’m not married to these and don’t care at all if you use the specific things I’m proposing below – I just wanted to provide something concrete, as I find it easier to discuss with a solid starting point.

  • End of Q1: at least 5 Lido NOs have switched away from majority clients [edit: for attestation]. Celebrate the Geth-less NOs (pre-existing ones and new ones) and your new stats on Twitter.
  • End of Q3: Node operators with >2.5% of Lido validator aim to be below 2/3 majority client utilization [edit: for attestation]. Post about NOs that are fully majority-less, hit the goal, or missed the mark (and hopefully they can talk about this on forum).
  • End of Q4: at least 20 Lido NOs use no majority clients [edit: for attestation]. Celebrate the majority-less NOs and your new stats on Twitter.
5 Likes

Hey Val, always nice to see cross-community constructive feedback!

In my mind, the work that had been done in Q3+Q4 (Q3 improvement looked small but one has to take into account that improving diversity via onboarding rounds has a lagging effect due to organic stake redistribution taking a bit of time) would have ended up with Lido validators at ~70% geth utilization which wouldn’t be great but was OK EOY. Obviously NO efforts came in better than hoped (which is great), but I wrote the thread in the morning (and we’re almost a month into the quarter) when there hadn’t even been time to take a look at the preliminary Q4 data. In general I think the target for < 2/3rds is still a safe one (if a few more big staking solutions can match this then the network as a whole will be a lot better off), but it would obviously be great to see something like ~60%.

I think if you take into account that many operators at this scale operate very deliberately when considering infra changes of this magnitude (just look how slow large CEX stakers are responding to this) it’s not that modest. Most operators have to re-tool their entire infrastructure for something like this (e.g. adding support for monitoring, updating logs monitoring for different messages, perhaps modifying RPC calls), which isn’t always so easy (depends on their setup to begin with), and it likely also affects other validators/customers which they may have paper contracts with, perhaps certain performance SLAs they have to work around (which would require robust testing first to make sure you can hit) etc. At the same time, the community is asking operators to embrace clients which have recently shown that they’re a bit unstable (harder to square against stuff like SLAs) and from a performance perspective may not be ideal matches for infra choices the NO has made thus far (e.g. certain clients really need very strong SSDs). There’s also other NOs (e.g. client teams, smaller teams) who although may be super keen to work on this, simply don’t have the resources to do so at the same speed as others (especially since there’s a HF en route).

The only part I disagree with is here :

I don’t think aiming for no majority client use is necessarily desirable (some NOs may elect such a strategy or may be “monoclient shops” (even though that has its own cons), but all clients potentially have a place in a robust setup, and I don’t think it’s the DAO’s job to set these kind of targets. For example, Attestant’s setup (as indicated by Jim in his reply above and in detail on today’s community call uses geth but does not use it as an EL for attestations (but e.g. may use it for block proposals). This kind of usage is perfectly fine and is very useful to avoid possibly landing on a non-canonical chain.

7 Likes

Hello everyone, we wanted to share some of our plans.

First off, thanks again for raising this topic. We have been discussing and testing different set ups internally, and we should be ready to make some commitments for the first half of 2024.

We have it as a priority for Q1 and Q2 to set up our EL clients to be distributed 30/30/30 in Besu/Nethermind/Geth. We are also making other infrastructure commitments, like achieving a CL client distribution of 50/50 Teku/Lighthouse in the same period, as well as other various improvements.

Finally, we are in the research phase for Vouch and in contact with the Attestant team to hopefully start some testing soon!

7 Likes

I want to point out some technicalities here, for the planning NOs are doing.

While Geth has a supermajority, you’ll want to avoid attesting with Geth. Moving from a Geth monoculture to a multi-client setup alas does nothing (other than false sense of security) if Geth is still used for attestations.

Possible solutions:

  • Vouch, remove Geth from attestations
  • Vouch, use new majority strategy with Geth, Nethermind and Besu, so that a malfunctioning Geth will be overruled by the other two
  • Nimbus with two different EL
  • Any other VC, have some keys that are backed by nodes with Geth (and maybe some minority), and some keys that are backed only by minority ELs. That latter portion is then contributing to resolving this issue.

Our survey is not accurate for multi-client setups. This also can lull us into a false sense of security once the numbers show 66% and below for Geth. Right now, if someone were to run Geth and Besu, validators are counted 50:50. But that’s not accurate. The percentage of validators on a client can rise above 100% total with a multi-client setup. As long as Geth is still used for attestations, the validators need to be counted towards it.

We could solve this with a field that captures the VC, and three possible values for Vouch.

Any VC other than Vouch: Count validators multiple times, for each EL.
Exception Nimbus with two EL, only one of which is Geth: Should be counted like Vouch and majority, below.

Vouch and default setup: Same as above
Vouch and Geth excluded from attestations: Count for the ELs that aren’t Geth
Vouch and majority strategy: Oh, tough. I am not sure. It’s not accurate, but for purposes of “are we at risk” we could treat it like “Vouch and Geth excluded from attestations”

A more accurate reporting might have two numbers:

  • Number of validators that will attest on a wrong fork, and their EL/CL setup. That’s everything above, with the exception of not counting: Vouch with majority, and Nimbus with two EL, and any other VC with a majority strategy. Those validators aren’t counted at all here
  • Number of validators that won’t attest on a wrong fork, and their multi-client EL/CL setup. That’s everything that uses Vouch and majority strategy, and any other VC that implements such a function, and Nimbus with two different ELs.

Yes that gets us to >100% because of multi-client - but that’s a better view of things for assessing this risk.

7 Likes

Yep this is all very correct (there’s also the spectre of two buggy implementations which in concert could lead to a supermajority finalizing the “incorrect” chain), but can probably set that aside for now.

Keen to work with you to improve the survey by end of Q1 so that we can tease out these details appropriately!

3 Likes

As a dedicated node operator within the Lido protocol, Ebunker is committed to the continual improvement of the network’s security and stability. In line with the ongoing efforts to enhance Execution Layer (EL) client diversity, we would like to outline our strategy and commitment in this regard.

We have historically utilized Geth as our primary EL client, valuing its stability and robust performance. However, recognizing the importance of client diversity post-Merge and the maturity of minority clients, we are proactively adapting our strategy.

Our roadmap for 2024 includes a phased migration from Geth to other stable clients like Nethermind and Erigon. This transition will begin in the first quarter of 2024, starting with an initial 20% migration of our EL clients. This approach aligns with the broader Lido protocol’s objective of reducing reliance on a majority client, hence diminishing the risks associated with a supermajority client block validity bug.

We aim to closely monitor and evaluate the performance of these alternative clients. Based on these assessments, we plan to steadily increase our migration efforts. By the end of the year, we anticipate achieving a migration ratio of 50% away from Geth.

This strategy not only reflects our commitment to the Lido protocol but also our dedication to the larger Ethereum ecosystem. We believe that by diversifying the EL clients, we can collectively enhance the network’s resilience and reliability.

5 Likes

This would be statistically correct with the assumption the VC (including vouch) is selecting the consensus clients correctly. I can’t speak for vouch, because it’s been some time since I last used it on mainnet, but Lighthouse, Nimbus, Teku don’t always take the “best” performing/stable consensus client they could use. This is not a static problem but a moving target, underlying behaviour change all the time.

Also, there are many combinations (EC/CC/VC) that work well, but others simply don’t work well together, in certain areas or scenarios.

Some questions popping in my head:

  • What happens if majority client forks, and some of the VC make a wrong decision, following the wrong majority client instead of the correct minority client(s)?
  • What happens if a minority client(s) forks, but most VC start following the wrong minority client(s)?
2 Likes

Hello,

By the end of February 2024, Kiln commits to have 100% of their Lido validators running on a non-majority client (Nethermind), we’ll most likely have it by the end of next week if it goes smoothly.

We have been running Nethermind at scale on 100 000 validators on Holesky for a few months now without issues and have enough confidence to move forward.

We’ll gradually pursue this effort in parallel for other non-lido validators we operate.

7 Likes

They would also end up on the “wrong” fork. As Thorsten and Jim earlier, and on this week’s NOCC call, pointed out, ultimately what matters is the attestation selection made.

As dankrad explains, it depends on how many of them (i.e. aggregate stake share) end up on either fork.

If that fork has a supermajority and finalizes the “worst case” scenario will manifest. The problem isn’t really a specific client problem but a “spec implementation” problem. i.e. if multiple clients interpret (implement) the spec incorrectly in the same way and attest to an invalid block – even if these X clients have < 2/3 share independently – they will all end up on a “bad” fork; if collectively they have >= 2/3 share, that fork will finalize.

3 Likes

Hi, SenseiNode here.

We run the majority of our EL with nethermind and also run most of our CL with Nimbus.

Having said this we plan to reduce the use of geth from our attestation services and rely on our GETH nodes only as a RPC nodes for our monitoring system as it has proven to be a very reliable and robust for handling many RPS.

On our roadmap for Q2 we have planned the implementation of Erigon and Besu, which we don’t currently use. We have also tried other clients on testnet like rETH and would like to use it on mainnet once it becomes more mature.

5 Likes

:tada: This is awesome!

Allnodes have also stepped up: https://twitter.com/allnodes/status/1750519886286295117?s=46

I stand by the take that the goals could/should be dramatically more ambitious than the initially proposed ones (and get us to a stronger Ethereum network much sooner).

8 Likes

I wanted to bring up some critical points:

  1. Need for More In-Depth Research: It’s become evident that we need more comprehensive research to understand the nuances and potential impacts of our diverse node setups, particularly in multi-client environments. As we’ve discussed, different clients can react uniquely in critical scenarios, like forks, which can have widespread implications for the network.
  2. Opening Up Node Operator Data for Analysis: I strongly advocate for opening up more data regarding node operator setup metadata and metrics. By making this information accessible to teams and individuals, we can enable a broader spectrum of the community to analyze, interpret, and come up with optimal solutions. This transparency is not just about accountability; it’s about leveraging collective intelligence to strengthen the network and protocol as a whole. Importantly, this data can and should be anonymized to protect the privacy and security of Node Operators while still providing valuable insights.
  3. Demanding a Broader View of Incident Impact: We also need to shift our focus to a more holistic view of potential network incidents. This means not just looking at the majority or minority client scenarios but considering the ‘blast radius’ of incidents like forking. Understanding the full scope of potential impacts, including indirect effects on various parts of the network, is crucial for developing more resilient strategies.
  4. Collaboration and Openness: As Izzy suggested, creating an open format for infra reporting is a step in the right direction. However, we need to foster a culture where Node Operators are comfortable with sharing data. This openness will not only enhance our collective ability to respond to challenges but will also foster innovation and collaboration.

In conclusion, the complexity of Lido protocol node operators as well as Ethereum network requires a multifaceted approach. More research, greater transparency, and a broader perspective on incident impact are key to strengthening both.

Looking forward to more discussions and collaborative efforts in this direction.

6 Likes

P2Porg commits to stop using geth for attestations by the end of February.

We have already transitioned all non-Lido validators to besu, and making the same switch for our Lido validators is not a significant challenge. However, we recognize the great opportunity to conduct a clear A/B test and compare the geth vs. besu setup.

Therefore, we will collect the data and share the research findings with the community. After completing this analysis, we will proceed with transitioning to a geth-free setup.

5 Likes

Hey everyone, Chainbase Staking here! We are currently using both Erigon and Geth in our operations. Geth’s known for its stability, which is crucial for ensuring our services run smoothly for clients. On the other hand, using Erigon as a Validator’s execution layer aligns with our commitment to Ethereum’s decentralization and overall network health.

We all are balancing the immediate needs with the broader goal of supporting a diverse and resilient Ethereum ecosystem. Hence, we see this as an excellent opportunity to gather data and conduct an in-depth comparison between the Erigon and Geth setups! We will share our analysis with the group once completed, and hope that our research will contribute to a more informed discussion around these topics in the community!

3 Likes

Lodestar/ChainSafe team here.

We already diversified 30% of our beacon nodes to run Lodestar + Nethermind. We’ve started testing Lodestar + Besu, another popular combination with solo stakers we’ve spoken to. We are aiming to have transitioned to 10% Geth, 45% Nethermind and 45% Besu for the mainnet Deneb fork. We wanted to maintain a small portion of Lodestar + Geth as a primary for continuous comparison and compatibility oversight with Lodestar. As other execution clients are more stabilized, we will aim to include them as well.

7 Likes

Hi, this is Sen from HashKey Cloud.

Previously, we have been conducting relevant tests with Besu on the testnet. According to our migration plan, we will complete the migration of relevant clients in Q1:

  • EL: Geth (50%) + Besu (50%)
  • CL: Prysm(50%) + Lighthouse(50%).

Client diversity is an ongoing priority, and we will continue to work hard on it. According to the current plan, we will do more comprehensive testing with Nethermind/Nimbus and hope to be able to apply it in Q2 or Q3 to reduce risks for our users, and support the overall health of the Ethereum network.

6 Likes