Infrastructure design
To put the the change section in context: We run “environments” of 1,000 keys each. They are placed in different geographic regions, mostly EMEA and APAC with one AMERS. Each environment runs one Vouch in k8s, five Dirk in 3/5 threshold across five regions, and three CL:EL servers across three (EMEA, AMERS) or two (APAC) regions.
The CL:EL run Lighthouse:Erigon, Teku:Besu and Lighthouse:Nethermind respectively.
Infrastructure changes
Nethermind was updated to version 1.18.0
Besu was updated to version 23.4.0
An additional environment with 1,000 keys was activated in APAC
Incidents
On Sunday April 30th, one validator key experienced bad sync committee participation. This was caused by Nethermind 1.17.3 pruning load starving the Lighthouse CL. This server was taken out of rotation for the Vouch holding the keys, until the pruning run finished.
Client insights
Nethermind 1.18.0 was released on May 3rd. It introduces a new pruning parameter that uses RAM to reduce load. We tested pruning on it on May 4th. Sync committee participation was not harmed and the prune finished in under 6 hours. This addresses the concerns that the incident on April 30th raised.
Besu 23.4.0 was released on May 4th. Initial testing showed newPayloadV2 taking no more than 1/4th of a second for 80% of blocks, and no more than 1/3rd of a second for 100% of blocks. This is a great improvement over the state of Besu after Merge. We have been happy with how it’s been performing.
Testing
We have stood up a Tempo server so we can get trace data from Vouch, to see where it spends its time during block proposals. We’ve seen missed block proposals twice now and wish to know whether there is something we can do about it.
We are testing Reth on Goerli. It is still pre-alpha and making good progress.
We have been running Lodestar on Goerli. It could replace Lighthouse for the Nethermind CL:EL, becoming Lodestar:Nethermind and increasing CL diversity. Still observing for now.