CryptoManufaktur - Week 18 2023

Infrastructure design

To put the the change section in context: We run “environments” of 1,000 keys each. They are placed in different geographic regions, mostly EMEA and APAC with one AMERS. Each environment runs one Vouch in k8s, five Dirk in 3/5 threshold across five regions, and three CL:EL servers across three (EMEA, AMERS) or two (APAC) regions.

The CL:EL run Lighthouse:Erigon, Teku:Besu and Lighthouse:Nethermind respectively.

Infrastructure changes

Nethermind was updated to version 1.18.0

Besu was updated to version 23.4.0

An additional environment with 1,000 keys was activated in APAC

Incidents

On Sunday April 30th, one validator key experienced bad sync committee participation. This was caused by Nethermind 1.17.3 pruning load starving the Lighthouse CL. This server was taken out of rotation for the Vouch holding the keys, until the pruning run finished.

Client insights

Nethermind 1.18.0 was released on May 3rd. It introduces a new pruning parameter that uses RAM to reduce load. We tested pruning on it on May 4th. Sync committee participation was not harmed and the prune finished in under 6 hours. This addresses the concerns that the incident on April 30th raised.

Besu 23.4.0 was released on May 4th. Initial testing showed newPayloadV2 taking no more than 1/4th of a second for 80% of blocks, and no more than 1/3rd of a second for 100% of blocks. This is a great improvement over the state of Besu after Merge. We have been happy with how it’s been performing.

Testing

We have stood up a Tempo server so we can get trace data from Vouch, to see where it spends its time during block proposals. We’ve seen missed block proposals twice now and wish to know whether there is something we can do about it.

We are testing Reth on Goerli. It is still pre-alpha and making good progress.

We have been running Lodestar on Goerli. It could replace Lighthouse for the Nethermind CL:EL, becoming Lodestar:Nethermind and increasing CL diversity. Still observing for now.

3 Likes

This kind of weekly rundown and the one @stefa2k just recently submitted as a part of the slashing incident updates are really useful!

I think it would be cool if the community coordinate on some types of expectations, and have this as available as an open resource. Not all NOs may feel comfortable sharing such details of setups, but for those that are it would be invaluable. It would be great to make NO operations a lot less “blackbox-y” to the outside world, as well as to perhaps increase inter-NO visibility into things like client hiccups or correlated issues across setups.

It could also include things like “status updates” for things like downtime / incidents etc. Perhaps some kind of open source git repository where NOs (treated almost like a file-based database) where NOs can submit .jsons, and then there is a frontend (eg netlify/vercel/etc) that just autobuilds every few minutes (if there are new files). Perhaps having “ad hoc” submissions for things that are event-based (e.g. incidents / outages) and change reporting on a slightly less rapid schedule (1 month?).

I would imagine that LEGO could help support this via a grant as well!

2 Likes

That could be really helpful yeah. Change reporting monthly or “as warranted” sounds good. There might not be anything happening in a given month; or maybe there’s a fresh client release that warrants mentioning even though a month hasn’t passed, because it brings something useful.

1 Like

Week 23 2023

Infrastructure changes

An additional environment with 1,000 keys was activated in EMEA.

We replaced Lighthouse with Nimbus on one of the nodes per environment, making it Lighthouse:Erigon, Teku:Besu and Nimbus:Nethermind.

We have started a rolling resync of our Nethermind nodes on the new Nethermind v1.19.1. This will take 1-2 months to complete.

We rolled out lido-validator-monitor to have our own copy in addition to the one Lido maintains.

Incidents

We saw 3 missed blocks, all in APAC environments. We have been working with Jim McDonald at Attestant on getting to the bottom of it.

As a result, we removed relays that were consistently unable to reply in time in APAC due to latency.

We explicitly configured the best strategy for blindedblockproposal in Vouch. We previously only had it explicitly configured for blockproposal.

We are going to move Vouch in our “Lido 3” environment to Sydney, from now Tokyo, to reduce latency between it and two of the CL:EL nodes.

Testing

We continue to test Reth.

2 Likes