Name: Missed Attestations Due to Validator Attachment Issue
Date: 2025/02/14
Impacted servers: All servers
Impacted services: nimbus-eth2
Summary
On 14th of February 2025 starting at 12:15 UTC and ending at 13:38 UTC all validators were offline due to a combination of 3 major factors:
- Design flaw of not taking account delay in loading validators by
Web3Signer
.- Beacon node needs to wait for
Web3Signer
to finish loading validators before it can start.
- Beacon node needs to wait for
- Deployment performed on hosts manually without merging the relevant PR.
- Upgrades of both execution and consensus layer nodes were required to apply new gas limits.
- Miscommunication between old support team and new one taking over duties from old one.
- A separate PR related to paging service was merged accidentally resulting in upgrade rollback.
Due to a gas limit increase, we needed to update Nimbus across all instances. However, because of the dependency issue, this update had to be performed manually.
On 13th of February 2025, the manual deployment was completed, but the associated pull request was not merged into the main branch. The following day, a separate issue arose with paging system which needed an urgent resolution.
The previous team merged a fix for paging service for alarms at 12:11 UTC, while the new team had not yet merged the manually deployed changes. As a result, the old deployment was reverted, restarting services and preventing validators from loading.
Timeline UTC
- 2025-02-13 - Manual deployment of
Nimbus
for gas limit update completed, but the PR was not merged. - 2025-02-14 - Use of incorrect secret for paging service caused false alarms, which required an urgent fix.
- 12:11 - The previous team merged a change, unaware that the new team had not merged the upgrade PR.
- 12:14 - Unintended rollback restarted services, causing validators to fail to load due to
Web3signer
delay. - 12:56 - Issue detected.
- 12:57 - Beacon nodes restarted manually to load validators from
Web3signer
. - 13:38 - Lido confirmed validators were online again.
Root Causes
- Technical Dependency Issue -
Web3Signer
loads keys too slowly, causingNimbus
to start without validators. - Manual Process Risks - The
Nimbus
update (due to the gas limit increase) had to be done manually due to unresolved dependency problems. - Team Transition & Communication Breakdown - Misalignment between old and new teams led to unintended rollback.
Action Items
- Reimburse LIDO for missed 1.323 ETH in rewards
- Transaction:
0xfc23a84ea84e38dc24fc990c2ef004424a64c13fcd9658d5f9f35d663b77bfa4
- Transaction:
- Fix the dependency issue to ensure proper service loading order.
- Implemented service state detection using
systemd-notify
andnotify
service type.
- Implemented service state detection using
- Implement process changes to avoid manual deployments in similar cases.
- Split mainnet and other hosts to auto-apply
main
branch changes to non-mainnet hosts.
- Split mainnet and other hosts to auto-apply
- Improve monitoring and alerting to detect similar issues more quickly.
- Adjusted alerting rules for validator count and missed attestations.
- Establish clear team responsibilities to prevent miscommunication during transition.
- Clarified only the new team is responsible for merging and deploying changes.
- Implement better testing that shows in PR what will be changed and restarted.
- Implement better CI deployments for individual hosts to avoid manual deployments.
- Configure Cachix deploy notifications indicating which host was deployed to.
Conclusion
The incident highlighted weaknesses in our dependency management, deployment process, and team transition strategy. By addressing these issues, we aim to prevent similar disruptions in the future.