Incident Post Mortem
Date and Time of Incident: 31st December 2024 23:00 IST
Duration of Incident: Approximately 20 hours 40 minutes
Incident Summary:
On 31st December 2024, at 23:00 IST, one of Launchnodes’ servers in a ‘bare metal’ Data Center in South Africa became unreachable. Alerts were triggered to Launchnodes via Telegram and email, from its remote monitoring system. Initial attempts to resolve the issue, including SSH and ping commands, were unsuccessful. A ticket was promptly raised with the Data Center’s on-site team, and the issue was escalated. During their investigation, intermittent connectivity issues were observed.
Incident Timeline:
- 31st December 2024, 23:00 IST: Alerts received via Telegram and email.
- 1st January 2025, 00:15 IST: Connection restored temporarily.
- 1st January 2025, 00:19 IST: Connection lost again.
- 1st January 2025, 00:35 IST: Connection restored.
- 1st January 2025, 01:06 IST: Connection lost.
- 1st January 2025, 16:28 IST: Connection restored.
- 1st January 2025, 21:41 IST: Connection lost.
- 1st January 2025, 23:42 IST: Connection restored.
Root Cause Analysis:
- According to the Data Center team, the root cause of the issue was a faulty network cable at the Data Center, which required replacement. This was not a component over which Launchnodes had direct control.
- This hardware failure caused intermittent connectivity issues during the specified period. Launchnodes has ‘VIP’ status with this provider and multiple servers in this Data Center, however this issue took a significant period of time to identify and resolve, in part due to the issue occurring overnight on New Year’s Eve, and into New Year’s Day.
Actions Taken:
- A ticket was raised with the Data Center provider immediately after the incident was identified.
- Continuous monitoring and communication between Launchnodes’ staff and the DC team was maintained throughout the troubleshooting and resolution process.
- As it became apparent that the issue was intermittent, Launchnodes provided regular updates via its private Telegram group
- Tests were conducted in one of Launchnodes’ development environments to confirm the approach to safely migrate services to an alternative DC
- A plan was developed and shared with key Lido community members, ready to be enacted in the event that the service continued to be down, or in the event that other servers at this DC began to experience similar issues.
Impact and Mitigation:
-
Impact: Approximately 595 keys went offline for a cumulative duration of approximately 20 hours and 40 minutes.
-
Mitigation: The team worked closely with the Data Center team to restore services, and conducted internal tests to identify similar incidents in the future.
-
Financial Impact: The total financial impact to the protocol has been calculated to be 2.657 ETH
This amount has been transferred by Launchnodes to the Lido Execution Layer Rewards Vault:
Follow-up Actions:
- Launchnodes to continue to review, refine and test its resiliency and cutover plans
Conclusion:
After the faulty network cable was replaced by on-site staff at the Data Center, intermittent connectivity issues did not reoccur and all validators began to attest successfully.
Report Prepared By:
Launchnodes Engineering Team
9th Jan 2025