CryptoManufaktur - Week 18 2023

Infrastructure design

To put the the change section in context: We run “environments” of 1,000 keys each. They are placed in different geographic regions, mostly EMEA and APAC with one AMERS. Each environment runs one Vouch in k8s, five Dirk in 3/5 threshold across five regions, and three CL:EL servers across three (EMEA, AMERS) or two (APAC) regions.

The CL:EL run Lighthouse:Erigon, Teku:Besu and Lighthouse:Nethermind respectively.

Infrastructure changes

Nethermind was updated to version 1.18.0

Besu was updated to version 23.4.0

An additional environment with 1,000 keys was activated in APAC

Incidents

On Sunday April 30th, one validator key experienced bad sync committee participation. This was caused by Nethermind 1.17.3 pruning load starving the Lighthouse CL. This server was taken out of rotation for the Vouch holding the keys, until the pruning run finished.

Client insights

Nethermind 1.18.0 was released on May 3rd. It introduces a new pruning parameter that uses RAM to reduce load. We tested pruning on it on May 4th. Sync committee participation was not harmed and the prune finished in under 6 hours. This addresses the concerns that the incident on April 30th raised.

Besu 23.4.0 was released on May 4th. Initial testing showed newPayloadV2 taking no more than 1/4th of a second for 80% of blocks, and no more than 1/3rd of a second for 100% of blocks. This is a great improvement over the state of Besu after Merge. We have been happy with how it’s been performing.

Testing

We have stood up a Tempo server so we can get trace data from Vouch, to see where it spends its time during block proposals. We’ve seen missed block proposals twice now and wish to know whether there is something we can do about it.

We are testing Reth on Goerli. It is still pre-alpha and making good progress.

We have been running Lodestar on Goerli. It could replace Lighthouse for the Nethermind CL:EL, becoming Lodestar:Nethermind and increasing CL diversity. Still observing for now.

3 Likes

This kind of weekly rundown and the one @stefa2k just recently submitted as a part of the slashing incident updates are really useful!

I think it would be cool if the community coordinate on some types of expectations, and have this as available as an open resource. Not all NOs may feel comfortable sharing such details of setups, but for those that are it would be invaluable. It would be great to make NO operations a lot less “blackbox-y” to the outside world, as well as to perhaps increase inter-NO visibility into things like client hiccups or correlated issues across setups.

It could also include things like “status updates” for things like downtime / incidents etc. Perhaps some kind of open source git repository where NOs (treated almost like a file-based database) where NOs can submit .jsons, and then there is a frontend (eg netlify/vercel/etc) that just autobuilds every few minutes (if there are new files). Perhaps having “ad hoc” submissions for things that are event-based (e.g. incidents / outages) and change reporting on a slightly less rapid schedule (1 month?).

I would imagine that LEGO could help support this via a grant as well!

4 Likes

That could be really helpful yeah. Change reporting monthly or “as warranted” sounds good. There might not be anything happening in a given month; or maybe there’s a fresh client release that warrants mentioning even though a month hasn’t passed, because it brings something useful.

1 Like

Week 23 2023

Infrastructure changes

An additional environment with 1,000 keys was activated in EMEA.

We replaced Lighthouse with Nimbus on one of the nodes per environment, making it Lighthouse:Erigon, Teku:Besu and Nimbus:Nethermind.

We have started a rolling resync of our Nethermind nodes on the new Nethermind v1.19.1. This will take 1-2 months to complete.

We rolled out lido-validator-monitor to have our own copy in addition to the one Lido maintains.

Incidents

We saw 3 missed blocks, all in APAC environments. We have been working with Jim McDonald at Attestant on getting to the bottom of it.

As a result, we removed relays that were consistently unable to reply in time in APAC due to latency.

We explicitly configured the best strategy for blindedblockproposal in Vouch. We previously only had it explicitly configured for blockproposal.

We are going to move Vouch in our “Lido 3” environment to Sydney, from now Tokyo, to reduce latency between it and two of the CL:EL nodes.

Testing

We continue to test Reth.

4 Likes

Week 30 2023

Infrastructure changes

An additional environment with 1,000 keys was activated in APAC.

We replaced Nimbus with Lodestar and then Teku on one of the nodes per environment, making it Lighthouse:Erigon, Teku:Besu and Teku:Nethermind.

Resync of all eth-lidox-c nodes on Nethermind >= 1.19 has been completed.

We deployed a central Loki logging server and have started to direct logs to it from all environments.

Incidents

We had a brief outage of 1,000 keys, caused by a failure of CL monitoring combined with a Vouch restart. We fixed our monitoring so it can detect when a server is down entirely and no longer sending metrics, and added a backlog item to see whether we can assist Jim in changing Vouch so it’ll start when one of its CLs is down.

Client Insights

What prompted our move from Nimbus to Lodestar and then from Lodestar to Teku on the eth-lidox-c nodes was sub-optimal performance. Vouch to Nimbus had bad sync participation, root cause as yet unknown. Vouch to Lodestar had bad attestation performance. Switching that server to Teku resolved both issues.

Testing

We have a Reth archive node running on Ethereum mainnet, on Reth’s alpha build.

3 Likes

Week 41 2023

Infrastructure changes

An additional environment with 1,000 keys was activated in EMEA.

This brings us to a limit of 11,000 keys with ~10,000 active. We expect this to be “final size” unless/until Wave 5 either gets to ~10,000 with Wave 6 not yet active, or there is a large influx that Wave 5 cannot handle. The latter is exceedingly unlikely.

We replaced Teku with Lodestar on all -c servers, and feel comfortable with the performance of Lodestar now.

We replaced Erigon with Nethermind on all -a servers, making it Lighthouse:Nethermind, Teku:Besu and Lodestar:Nethermind. This was done because we were running out of disk space on our servers with a pruned Erigon deployment.

Incidents

We saw more missed blocks than we are comfortable with, though it’s still in the single digits. We’ve been troubleshooting this with relays. Because of deploying in APAC, when the relay misses a blinded block and we have to fall back to locally built, we are at risk of missing the block.

Client Insights

More Nimbus testing was done. Nimbus have identified a resource leak that will be fixed in 23.10 or 23.11. We will test Nimbus again with 1,000 keys once that release is available.

Nimbus does not build a local block in parallel, potentially exacerbating missed blocks when the blinded block is not delivered. As a first step, they reduced local block building on demand from ~1s to ~200ms.

Lodestar has been doing well on the -c servers, the performance looks good.

Lighthouse has a memory leak somewhere, causing recurring OOM. This has also been seen by others. We are troubleshooting with the Lighthouse team.

Testing

Reth alpha testing continues. An issue causing haproxy to mark it down, when it isn’t, was found on Goerli and fixed by the Reth team.

We deployed 20,000 keys on Holesky as part of the genesis set.

7 Likes