As CTO and representative of Staking Facilities, I would like to share the following incident with the Lido DAO. The incident caused a prolonged downtime on ~half of our validators. There was no risk of slashing involved.
We take this kind of incident serious and take full responsibility. We already communicated our will to reimburse the DAO for the estimated value lost during the downtime. This was calculated by the Lido Node Operator Management team, which I would like to thank at this point as well.
Incident summary
Between November 19, 17:50 UTC and November 21, 12:00 UTC, Staking Facilities’ Lido Validators experienced an unplanned downtime in one of SFs data centers due to network connectivity issues.
Root Cause
Loss of connectivity, suspected to be related to a physical disconnection or hardware issue.
Impact
Operational Impact: Validators in one of SFs data centers experienced unplanned downtime, affecting network operations.
Resolution
Initial Misdiagnosis: Initially believed the issue was with the port.
First Resolution: Replugging the ethernet cable, which temporarily restored connectivity. Changed the used port on the device to another.
Recurrence and Further Action: After the issue reappeared, the problematic router was replaced with a new device and connectivity was restored.
Future Preventive Measures: Plans to implement redundancy in the data center for the router to enhance system reliability.
Timeline
- November 19, 17:50 UTC: Notification via mail from the Data Center provider about a status change for a physical port to “DOWN”. Multiple alerts were received at the same time, indicating data center is completely unreachable for SF.
- November 19, 17:58 UTC: After verification of the issue, SF OnCall Engineer contacts Data Center to investigate the issue on their side.
- November 19, 18:09 UTC: Lido is notified about the validators in DC being offline.
- November 19, 18:20 UTC: DC opens a ticket for investigation of the root cause.
- November 19, 18:26 UTC: DC technician discusses the issue with SF OnCall Engineer; decision made to restart the router.
- November 19, 18:53 UTC: Router restart does not fix the issue; DC technician contacts the emergency team.
- November 19, 19:17 UTC: SF OnCall Engineer heads to the data center for local inspection.
- November 19, 19:53 UTC: SF OnCall Engineer arrives at the Rack.
- November 19, 20:18 UTC: After further analysis, Ethernet cable is unplugged and replugged, resolving the issue and restoring internet.
- November 19, 20:34 UTC: SF OnCall Engineer changes the used port on the device to another.
###########
###########
########### - November 21, 00:25 UTC: The issue reappears; SF OnCall Engineer contacts the data center again.
- November 21, 00:41 UTC: Lido is notified about the recurrence of the incident.
- November 21, 00:59 UTC: Ticket is issued for repeating the solution from first incident.
- November 21, 01:05 UTC: Connectivity is restored following the same procedure; decision made to replace the affected ethernet cable as a precaution.
- November 21, 12:00 UTC: SF replaced the affected router in the DC with a new device.
Lessons learned
Importance of thorough initial diagnosis and quick on-site response.
Action items
SF will take measures to cluster two routers, in order to have a failover setup for future incidents.