Post Mortem: Staking Facilities downtime 2025/01/22

dev0_sik · February 3, 2025, 1:25pm

With this post mortem Staking Facilities wants to inform the community and Lido DAO about an incident that impacted validators from the curated module.

We stand up for our responsibilities as node operators and signal the willingness to reimburse the value lost as an impact of this incident.

Incident Summary

On Wednesday, the 22nd of January 2025 at 09:22 AM (UTC+1) one of our Lido clusters was experiencing a downtime of approximately 60 minutes.
The incident was completely resolved at 10:21 AM (UTC+1).

Root Cause

The root cause for the outage was a pipeline run in our CI/CD stack that deployed misconfigured variables from an ansible playbook which rendered the validator client of this cluster (Vouch) unable to properly connect to the key manager (Dirk).

Impact

The cluster downtime affected 4000 validators from the curated set and lasted for 60 minutes.

Resolution

As soon as we were aware of the situation our first responder had a look at the systems and quickly added two other nodeOps colleagues to the scene to investigate the issue together.
We quickly realized that the issue was located between vouch and dirk and checked the configuration files.
As we found out the pipeline accidentally altered the vouch configuration with a single wrong value we restored the previous configuration file and got the cluster back up and running.
The resolution was quite straight forward but we’ve been a slower than usual to react as our monitoring/alerting stack failed us with the initial critical attestation alert. That alert was resolved in PagerDuty by the integration API within seconds - before going off.
This way the issue went unnoticed until our sync comittee alert went off.

Timeline

2025/01/22 09:20 UTC+1: CI/CD pipeline triggered
2025/01/22 09:22 UTC+1: Incident started
2025/01/22 09:37 UTC+1: Critical Attestation alert received that wrongfully resolved before being noticed
2025/01/22 09:51 UTC+1: Critical SyncCommittee alert received
2025/01/22 09:53 UTC+1: Frist-responder from nodeOps team investigating the issue
2025/01/22 09:57 UTC+1: Incident escalated to second nodeOps colleague for further debugging
2025/01/22 10:05 UTC+1: Root cause identified
2025/01/22 10:08 UTC+1: Further escalation internally - 3 nodeOps professionals on the issue
2025/01/22 10:18 UTC+1: Fix applied to the configuration
2025/01/22 10:21 UTC+1: Incident resolved

Takeaways

After triggering CI/CD pipelines we will be more careful checking the stability of the systems to prevend accidents like this.

Action items

we fixed the misconfiguration in the playbook
we implemented additional security measures to prevent accidental configuration changes
we are checking and improving our monitoring
we improved the awareness when working with automation pipelines

remus · February 3, 2025, 1:52pm

Hey, Remus from the Lido Node Operator Mechanisms workstream here

Thank you @dev0_sik and the Staking Facilities team for the detailed report. For reimbursement, we have calculated that the protocol was impacted by missed rewards and penalties amounting to 0.6842.

You can send the reimbursement in ETH (not stETH) to the Execution Layer Rewards vault at 0x388C818CA8B9251b393131C08a736A67ccB19297 - please verify you have the correct address at Mainnet | Lido Docs or the locator at https://etherscan.io/address/0xC1d0b3DE6792Bf6b4b37EccdcC24e45978Cfd2Eb#readProxyContract#F5

dev0_sik · February 10, 2025, 1:30pm

The missed rewards have been reimbursed on 2025/02/04 in the transaction 0xf73f0e91abb60969e7e28d63abfbffdbf80b44a6b5707030e159f7a0d63f5d09

Topic		Replies	Views
Post mortem: Downtime of Staking Facilities / Lido validators Node Operators	2	1381	November 28, 2023
Post Mortem: Staking Facilities downtime 2025/06/08 Node Operators	1	282	June 27, 2025
Post mortem: Downtime of Lido validators (Gateway.fm AS) Node Operators	1	1149	December 30, 2023
Reimbursement for Chorus One October 2021 Downtime Node Operators	4	6749	December 17, 2021
Reimbursement for Certus One / Jump Crypto 2023-02-11 Ethereum Validators Incident Node Operators	1	3481	April 6, 2023