Post mortem: Downtime of Lido validators (Gateway.fm AS)

As representative of Gateway.fm AS, I would like to share the following incident post-mortem with the Lido DAO. The incident caused a 7h 21m of downtime on 100 validators, and there was no slashing involved.

Gateway.fm AS take full responsibility of the incident and we will reimburse the DAO for the estimated value lost during the downtime. This was calculated by the Lido DAO NOM workstream, which we would like to thank at this point as well.

Summary

A group of Lido validators missed attestations between Dec 6, 23:42 UTC and Dec 7 7:04 UTC, due to some beacon nodes being offline.

Impact

100 validators were offline completely for 7h 21m during the incident. As a result, we received attestation penalties and missed potential rewards. The total lost was 0.1752 ETH.

Root Causes

  1. Nethermind consumed most of the instance memory, which caused the AWS health check to fail and brought the node offline
  2. The alert rules failed to catch validator offline events in the new region

Resolution

Allocated less memory for Nethermind job so that the hidden system job has enough resource

Timeline

2023-12-06 10:18 UTC: Four nethermind nodes went offline, and jobs went to the pending state. None of the alerts were triggered. The on-call person wasn’t notified.

2023-12-06 23:42 UTC: 100 Lido validators were activated on-chain. They started missing attestation immediately because the validator client had no working beacon nodes to connect to.

2023-12-07 00:48 UTC: Lido team tried to reach Gateway.fm team on Telegram but failed. No one from the Gateway team saw the message.

2023-12-07 05:34 UTC: Gateway.fm AS CEO confirmed in Telegram that the team will look into the issue ASAP.

2023-12-07 06:34 UTC: The infra team was informed and mitigated the issue immediately.

2023-12-07 07:04 UTC: The validator client started signing attestations again.

Lessons Learned

  • The same resource allocation percentage doesn’t always work on different cloud provider

Action Items

  • Duplicate all validator alert rules to the new region
  • Add a second execution client into the batch to avoid client-specific issue
  • Ensure we have a sign-off list for on-call alerts when we set up a new region
  • Create a way for Lido team to trigger Gateway.fm AS on-call alert

Validator lost caculation

Penalities 0.0519 ETH
Missing rewards 0.0714 ETH
Total 0.1752 ETH

5 Likes

We have sent 0.1752 ETH to the Lido EL Rewards vault in the transaction: Ethereum Transaction Hash (Txhash) Details | Etherscan

6 Likes