Post mortem: Downtime of Staking Facilities / Lido validators

katernoir · November 24, 2023, 3:43pm

As CTO and representative of Staking Facilities, I would like to share the following incident with the Lido DAO. The incident caused a prolonged downtime on ~half of our validators. There was no risk of slashing involved.

We take this kind of incident serious and take full responsibility. We already communicated our will to reimburse the DAO for the estimated value lost during the downtime. This was calculated by the Lido Node Operator Management team, which I would like to thank at this point as well.

Incident summary

Between November 19, 17:50 UTC and November 21, 12:00 UTC, Staking Facilities’ Lido Validators experienced an unplanned downtime in one of SFs data centers due to network connectivity issues.

Root Cause

Loss of connectivity, suspected to be related to a physical disconnection or hardware issue.

Impact

Operational Impact: Validators in one of SFs data centers experienced unplanned downtime, affecting network operations.

Resolution

Initial Misdiagnosis: Initially believed the issue was with the port.

First Resolution: Replugging the ethernet cable, which temporarily restored connectivity. Changed the used port on the device to another.

Recurrence and Further Action: After the issue reappeared, the problematic router was replaced with a new device and connectivity was restored.

Future Preventive Measures: Plans to implement redundancy in the data center for the router to enhance system reliability.

Timeline

November 19, 17:50 UTC: Notification via mail from the Data Center provider about a status change for a physical port to “DOWN”. Multiple alerts were received at the same time, indicating data center is completely unreachable for SF.
November 19, 17:58 UTC: After verification of the issue, SF OnCall Engineer contacts Data Center to investigate the issue on their side.
November 19, 18:09 UTC: Lido is notified about the validators in DC being offline.
November 19, 18:20 UTC: DC opens a ticket for investigation of the root cause.
November 19, 18:26 UTC: DC technician discusses the issue with SF OnCall Engineer; decision made to restart the router.
November 19, 18:53 UTC: Router restart does not fix the issue; DC technician contacts the emergency team.
November 19, 19:17 UTC: SF OnCall Engineer heads to the data center for local inspection.
November 19, 19:53 UTC: SF OnCall Engineer arrives at the Rack.
November 19, 20:18 UTC: After further analysis, Ethernet cable is unplugged and replugged, resolving the issue and restoring internet.
November 19, 20:34 UTC: SF OnCall Engineer changes the used port on the device to another.
###########
###########
###########
November 21, 00:25 UTC: The issue reappears; SF OnCall Engineer contacts the data center again.
November 21, 00:41 UTC: Lido is notified about the recurrence of the incident.
November 21, 00:59 UTC: Ticket is issued for repeating the solution from first incident.
November 21, 01:05 UTC: Connectivity is restored following the same procedure; decision made to replace the affected ethernet cable as a precaution.
November 21, 12:00 UTC: SF replaced the affected router in the DC with a new device.

Lessons learned

Importance of thorough initial diagnosis and quick on-site response.

Action items

SF will take measures to cluster two routers, in order to have a failover setup for future incidents.

sabrychiaa · November 27, 2023, 5:17pm

The estimated value lost during the downtime was 3.2791 ETH, please see details below.

katernoir · November 28, 2023, 4:20pm

Thank you for the estimation, Sabrina! We have sent the ETH to the Lido EL Rewards vault (0x388C818CA8B9251b393131C08a736A67ccB19297) in this transaction: Ethereum Transaction Hash (Txhash) Details | Etherscan

Topic		Replies	Views
Post Mortem: Staking Facilities downtime 2025/01/22 Node Operators	2	91	February 10, 2025
Post mortem: Downtime of Lido validators (Gateway.fm AS) Node Operators	1	1144	December 30, 2023
CryptoManufaktur 2022-12-21 Ethereum validators outage Node Operators	4	4439	December 29, 2022
Slashing Incident involving Launchnodes Validators - Oct 11, 2023 Node Operators	8	3144	May 8, 2024
Extension of probation period for the new Lido on Ethereum operators Node Operators	5	1745	November 2, 2023