Post Mortem: Staking Facilities downtime 2025/01/22

With this post mortem Staking Facilities wants to inform the community and Lido DAO about an incident that impacted validators from the curated module.

We stand up for our responsibilities as node operators and signal the willingness to reimburse the value lost as an impact of this incident.

Incident Summary

On Wednesday, the 22nd of January 2025 at 09:22 AM (UTC+1) one of our Lido clusters was experiencing a downtime of approximately 60 minutes.
The incident was completely resolved at 10:21 AM (UTC+1).

Root Cause

The root cause for the outage was a pipeline run in our CI/CD stack that deployed misconfigured variables from an ansible playbook which rendered the validator client of this cluster (Vouch) unable to properly connect to the key manager (Dirk).

Impact

The cluster downtime affected 4000 validators from the curated set and lasted for 60 minutes.

Resolution

As soon as we were aware of the situation our first responder had a look at the systems and quickly added two other nodeOps colleagues to the scene to investigate the issue together.
We quickly realized that the issue was located between vouch and dirk and checked the configuration files.
As we found out the pipeline accidentally altered the vouch configuration with a single wrong value we restored the previous configuration file and got the cluster back up and running.
The resolution was quite straight forward but we’ve been a slower than usual to react as our monitoring/alerting stack failed us with the initial critical attestation alert. That alert was resolved in PagerDuty by the integration API within seconds - before going off.
This way the issue went unnoticed until our sync comittee alert went off.

Timeline

  • 2025/01/22 09:20 UTC+1: CI/CD pipeline triggered
  • 2025/01/22 09:22 UTC+1: Incident started
  • 2025/01/22 09:37 UTC+1: Critical Attestation alert received that wrongfully resolved before being noticed
  • 2025/01/22 09:51 UTC+1: Critical SyncCommittee alert received
  • 2025/01/22 09:53 UTC+1: Frist-responder from nodeOps team investigating the issue
  • 2025/01/22 09:57 UTC+1: Incident escalated to second nodeOps colleague for further debugging
  • 2025/01/22 10:05 UTC+1: Root cause identified
  • 2025/01/22 10:08 UTC+1: Further escalation internally - 3 nodeOps professionals on the issue
  • 2025/01/22 10:18 UTC+1: Fix applied to the configuration
  • 2025/01/22 10:21 UTC+1: Incident resolved

Takeaways

After triggering CI/CD pipelines we will be more careful checking the stability of the systems to prevend accidents like this.

Action items

  • we fixed the misconfiguration in the playbook
  • we implemented additional security measures to prevent accidental configuration changes
  • we are checking and improving our monitoring
  • we improved the awareness when working with automation pipelines
2 Likes

Hey, Remus from the Lido Node Operator Mechanisms workstream here

Thank you @dev0_sik and the Staking Facilities team for the detailed report. For reimbursement, we have calculated that the protocol was impacted by missed rewards and penalties amounting to 0.6842.

You can send the reimbursement in ETH (not stETH) to the Execution Layer Rewards vault at 0x388C818CA8B9251b393131C08a736A67ccB19297 - please verify you have the correct address at Mainnet | Lido Docs or the locator at https://etherscan.io/address/0xC1d0b3DE6792Bf6b4b37EccdcC24e45978Cfd2Eb#readProxyContract#F5

2 Likes

The missed rewards have been reimbursed on 2025/02/04 in the transaction 0xf73f0e91abb60969e7e28d63abfbffdbf80b44a6b5707030e159f7a0d63f5d09

1 Like