The Certus One Ethereum validators, run as part of the Lido liquid staking protocol failed to validate for a period of approximately 6 hours on February 11 2023. This impacted staking rewards on the Lido protocol.
Certus One was aquired by Jump Crypto in 2021.
Inside Google Kubernetes Engine (GKE)
- We run one Lighthouse and one Prysm beacon client each connected to their own geth instance.
- A single Vouch instance communicates with the two beacon clients, and mevboost
- We have a cluster of three dirk threshold signers that Vouch uses
All times UTC
- 2023-02-11 14:55
geth-lighthouserestarted and started resyncing, looked like it was catching up
- 2023-02:-1 21:40
geth-prsyminstance rebooted, and started resyncing. The Prsym instance lost connection and stopped syncing
- 2023-02-11 21:48 Alerted that
dirkinstances were signing fewer than expected attestations, operator investigated Dirk node and seemed operational (all pods & process were reporting up, but not progressing as expected)
- 2023-02-11 22:30 Notified that Lido validators were down in Telegram room, alerting for down pods did not occur (as processes were all up)
- 2023-02-12 00:00 Operator started investigating, discovering issues on both nodes
- 2023-02-12 05:18 Validators back online
- The Prysm Beacon client OOMed, restarted and stuck at block 16591518 (but reported up)
- The Lighthouse Beacon chain OOMed earlier in the day but was yet to catch up
- The geth pods were repeatedly OOMing because of
--cacheflag exceeding pod memory limit of 16gb
- Pointed Lighthouse client at geth node running on dedicated node. Lighthouse returned to operation and validator was able to start proposing blocks.
--cacheamount from 8192 to 4096 on geth nodes & increased pod memory limit.
- Adjust termination grace period for pods from 10 seconds to 5 minutes; this allows geth pods to flush sync data to persistent disk.
Things to improve
- Alerting, escalation and knowledge transfer to wider team, from Telegram and Discord (complete).
- Improve alerting on specific components in the stack, rather than final attestation (complete).
- Investigate splitting stake across two instances or moving to a hot/warm vouch instance in multiple clusters (under consideration).
- Expand the number of Beacon/Consensus nodes that the vouch instance is using, including nodes outside the GKE cluster (complete).
To compensate for the missed proposals and attestations during the outage, Jump Crypto reimbursed the Lido rewards vault 2.2939 ETH:
total value lost = 2*penalties + missed rewards This is because penalties would have been rewards if not incurred, so total value lost is actual penalties plus opportunity cost of not operating correctly.
- 0.9549 in missed rewards
- 0.6695 in penalties