With this post mortem Staking Facilities wants to inform the community and Lido DAO about an incident that impacted validators from the curated module.
We stand up for our responsibilities as node operators and signal the willingness to reimburse the value lost as an impact of this incident.
Incident Summary
On Wednesday, the 22nd of January 2025 at 09:22 AM (UTC+1) one of our Lido clusters was experiencing a downtime of approximately 60 minutes.
The incident was completely resolved at 10:21 AM (UTC+1).
Root Cause
The root cause for the outage was a pipeline run in our CI/CD stack that deployed misconfigured variables from an ansible playbook which rendered the validator client of this cluster (Vouch) unable to properly connect to the key manager (Dirk).
Impact
The cluster downtime affected 4000 validators from the curated set and lasted for 60 minutes.
Resolution
As soon as we were aware of the situation our first responder had a look at the systems and quickly added two other nodeOps colleagues to the scene to investigate the issue together.
We quickly realized that the issue was located between vouch and dirk and checked the configuration files.
As we found out the pipeline accidentally altered the vouch configuration with a single wrong value we restored the previous configuration file and got the cluster back up and running.
The resolution was quite straight forward but we’ve been a slower than usual to react as our monitoring/alerting stack failed us with the initial critical attestation alert. That alert was resolved in PagerDuty by the integration API within seconds - before going off.
This way the issue went unnoticed until our sync comittee alert went off.
Timeline
- 2025/01/22 09:20 UTC+1: CI/CD pipeline triggered
- 2025/01/22 09:22 UTC+1: Incident started
- 2025/01/22 09:37 UTC+1: Critical Attestation alert received that wrongfully resolved before being noticed
- 2025/01/22 09:51 UTC+1: Critical SyncCommittee alert received
- 2025/01/22 09:53 UTC+1: Frist-responder from nodeOps team investigating the issue
- 2025/01/22 09:57 UTC+1: Incident escalated to second nodeOps colleague for further debugging
- 2025/01/22 10:05 UTC+1: Root cause identified
- 2025/01/22 10:08 UTC+1: Further escalation internally - 3 nodeOps professionals on the issue
- 2025/01/22 10:18 UTC+1: Fix applied to the configuration
- 2025/01/22 10:21 UTC+1: Incident resolved
Takeaways
After triggering CI/CD pipelines we will be more careful checking the stability of the systems to prevend accidents like this.
Action items
- we fixed the misconfiguration in the playbook
- we implemented additional security measures to prevent accidental configuration changes
- we are checking and improving our monitoring
- we improved the awareness when working with automation pipelines