Post Mortem: Develp Fleet Downtime 2025/02/14

Name: Missed Attestations Due to Validator Attachment Issue
Date: 2025/02/14
Impacted servers: All servers
Impacted services: nimbus-eth2

Summary

On 14th of February 2025 starting at 12:15 UTC and ending at 13:38 UTC all validators were offline due to a combination of 3 major factors:

  • Design flaw of not taking account delay in loading validators by Web3Signer.
    • Beacon node needs to wait for Web3Signer to finish loading validators before it can start.
  • Deployment performed on hosts manually without merging the relevant PR.
    • Upgrades of both execution and consensus layer nodes were required to apply new gas limits.
  • Miscommunication between old support team and new one taking over duties from old one.
    • A separate PR related to paging service was merged accidentally resulting in upgrade rollback.

Due to a gas limit increase, we needed to update Nimbus across all instances. However, because of the dependency issue, this update had to be performed manually.

On 13th of February 2025, the manual deployment was completed, but the associated pull request was not merged into the main branch. The following day, a separate issue arose with paging system which needed an urgent resolution.

The previous team merged a fix for paging service for alarms at 12:11 UTC, while the new team had not yet merged the manually deployed changes. As a result, the old deployment was reverted, restarting services and preventing validators from loading.

Timeline UTC

  • 2025-02-13 - Manual deployment of Nimbus for gas limit update completed, but the PR was not merged.
  • 2025-02-14 - Use of incorrect secret for paging service caused false alarms, which required an urgent fix.
    • 12:11 - The previous team merged a change, unaware that the new team had not merged the upgrade PR.
    • 12:14 - Unintended rollback restarted services, causing validators to fail to load due to Web3signer delay.
    • 12:56 - Issue detected.
    • 12:57 - Beacon nodes restarted manually to load validators from Web3signer.
    • 13:38 - Lido confirmed validators were online again.

Root Causes

  1. Technical Dependency Issue - Web3Signer loads keys too slowly, causing Nimbus to start without validators.
  2. Manual Process Risks - The Nimbus update (due to the gas limit increase) had to be done manually due to unresolved dependency problems.
  3. Team Transition & Communication Breakdown - Misalignment between old and new teams led to unintended rollback.

Action Items

  • Reimburse LIDO for missed 1.323 ETH in rewards
    • Transaction: 0xfc23a84ea84e38dc24fc990c2ef004424a64c13fcd9658d5f9f35d663b77bfa4
  • Fix the dependency issue to ensure proper service loading order.
    • Implemented service state detection using systemd-notifyand notify service type.
  • Implement process changes to avoid manual deployments in similar cases.
    • Split mainnet and other hosts to auto-apply main branch changes to non-mainnet hosts.
  • Improve monitoring and alerting to detect similar issues more quickly.
    • Adjusted alerting rules for validator count and missed attestations.
  • Establish clear team responsibilities to prevent miscommunication during transition.
    • Clarified only the new team is responsible for merging and deploying changes.
  • Implement better testing that shows in PR what will be changed and restarted.
  • Implement better CI deployments for individual hosts to avoid manual deployments.
  • Configure Cachix deploy notifications indicating which host was deployed to.

Conclusion

The incident highlighted weaknesses in our dependency management, deployment process, and team transition strategy. By addressing these issues, we aim to prevent similar disruptions in the future.

3 Likes