Mainnet sync committee performance test

Hello Lido community,

we are a Lido NO (CryptoManufaktur) and run a Vouch/Dirk validator client to three diverse consensus:execution layer clients (Ligththouse:Erigon, Teku:Besu, Lodestar:Nethermind). In our testing, we noticed that sync committee performance was sub-optimal (around 90%) when the third client was Nimbus.

A fellow NO had also seen this, specifically when using Vouch as the validator client. This does not happen when using the Nimbus VC.

We replaced Nimbus with Teku, and performance recovered. Then we replaced Teku with Lodestar, and performance stayed good.

We’ve been talking to the Nimbus team. They are willing to help find the root cause of this interop issue. We cannot do that on Goerli, because Goerli is missing too many blocks for meaningful investigation - it only has about 80% participation to begin with.

We are proposing to find the issue on mainnet, so that the client teams involved - Nimbus and Attestant/Vouch - can work together to fix this issue, so it is fixed for everyone down the road. To that end, we’d take one of our environments with 1,000 keys where sync committee duties are at the start of their cycle, switch the Lodestar for Nimbus (leaving Lighthouse and Teku in place), and run with debug logs. Observe performance, and if it is degraded like it was before, then gather the logs, switch Nimbus back out for Lodestar, and work with the client teams. Repeat as necessary as fixes are proposed or more testing is required.

This will deliberately degrade sync performance on that environment. We usually see between one and two validators at any given time in a sync committee per environment, so that’s the blast radius: Reduced performance on two validators for one sync committee cycle, or a little less, depending on how quickly we see an issue, per testing run.

Any objections?

10 Likes

While degraded performance is obviously not ideal, I think being able to do this kind of operation in order to surface potential improvements that can be made in cross-client interactions is a) a worthy goal in order to make improvements that lead to increased client diversity (by virtue of better performance for certain combinations), and b) something that is not as easily done via smaller protocols, and thus is something that the Lido protocol’s size enables (the socialized nature of rewards in the Lido protocol makes the impact somewhat more muted, and the overall effect of 1-2 validators having reduced sync performance isn’t anywhere near catastrophic).

Personally I would support this!

6 Likes

We (RockLogic) do this continuously with parts of the keys. There is no other way to find bugs like this, especially with high key count. Lido NOM is aware of this and even support us in various ways.

Love to see other NO see the benefits of investing their time to push cross client compatibility!

5 Likes

Quick update on this.

We didn’t find the sync committee issue, but we found an attestation performance issue. It may be related to a resource leak in Nimbus, which should be fixed in 23.10 or 23.11.

We will test again when that release is out, and monitor the metric that indicates the leak.

2 Likes