Post Mortem: Staking Facilities downtime 2025/06/08

With this post mortem Staking Facilities wants to inform the community and Lido DAO about an incident that impacted validators from the curated module.

We stand up for our responsibilities as node operators and signal the willingness to reimburse the value lost as an impact of this incident.

Incident Summary

On Sunday, the 8th of June 2025 at 11:15 AM (UTC) one of our Lido clusters was experiencing a downtime of approximately 50 minutes.
The incident was resolved at 12:05 AM (UTC).

Root Cause

The root cause for the outage was a PSOD (Purple Screen of Death) on one of our VMware ESXi hypervisors.
This hypervisor was hosting the Vouch instance (validator client) for that cluster.

Impact

The cluster downtime affected roughly 3200 validators from the curated set and lasted for 50 minutes.
During that time the validators were assigned with 3 proposal duties that couldn’t be performed.

Resolution

Once our monitoring-stack alerted the on-call engineer the issue was quickly escalated to a team of three engineers to resolve the critical incident as quick as possible.
The PSOD was identified and the Hypervisor was rebooted and brought back into a working state.
The engineers then restored functionality on the systems and checked the integrity of the whole cluster.

Timeline

  • 2025/06/08 11:15 UTC: Hypervisor crashed with a PSOD
  • 2025/06/08 11:19 UTC: PagerDuty alert escalated to Infra team
  • 2025/06/08 11:20 UTC: Infra team acknowledeg the alert and started to investigate with 2 engineers
  • 2025/06/08 11:30 UTC: PagerDuty alert escalated to NodeOps team
  • 2025/06/08 11:34 UTC: NodeOps team acknowledged the alert
  • 2025/06/08 11:40 UTC: Additional NodeOps engineer joined the incident response team
  • 2025/06/08 11:45 UTC: Hypervisor being rebooted
  • 2025/06/08 11:50 UTC: Hypervisor operational again
  • 2025/06/08 11:55 UTC: All instances started
  • 2025/06/08 12:00 UTC: Node integrity checked and operations restored
  • 2025/06/08 12:05 UTC: Validators attesting again

Takeaways

VMware PSOD are a treat for the validator client if it is the single point of failure.

Action items

  • we are migrating all Vouch clients off VMware and operate them in a hyperconvergent Nutanix cluster
  • we plan to switch to a multi-vouch setup since this is supported now
2 Likes

Hey, Marc from the Lido Node Operator Mechanisms workstream here.

Thanks @dev0_sik and the Staking Facilities team for sharing this detailed and transparent post-mortem. After reviewing the incident, we’ve calculated that the total impact to the protocol from missed proposals and penalties amounts to 0.7171 ETH.

To reimburse, please send the amount in ETH (not stETH) to the Execution Layer Rewards Vault:
0x388C818CA8B9251b393131C08a736A67ccB19297

You can confirm the address via Mainnet | Lido Docs or through the vault locator on Etherscan:
https://etherscan.io/address/0xC1d0b3DE6792Bf6b4b37EccdcC24e45978Cfd2Eb#readProxyContract#F5