Network Outage Report
Date:19th March 2025
Affected Service: Bare Metal Server Hosting Validators
Impact: 5,000 Validators Not Attesting
Duration: 1:16 AM IST - 1:41 AM IST
Incident Summary
On 19th March 2025, an outage occurred in one of our bare metal server provider’s data centers, impacting 5,000 validators. The downtime lasted approx. from 1:16 AM IST to 1:41 AM IST. The issue was caused by an upstream network failure at the provider’s data center, which led to an incorrect automatic routing switch during a Distributed Denial of Service (DDoS) mitigation process.
Timeline of Events
- 1:16 AM IST: Automated monitoring systems detected that the validators were unreachable.
- 1:22 AM IST: Launchnodes’ engineering team received alerts and began investigating the issue.
- 1:30 AM IST: Initial diagnostics confirmed that the affected servers were unresponsive and could not be reached via standard network protocols.
- 1:36 AM IST: The team contacted the server provider for further investigation and assistance.
- 1:36 AM IST: The server provider confirmed a network issue within their infrastructure.
- 1:41 2:00 AM IST: The network issue was resolved, and connectivity was restored to the affected servers.
Root Cause Analysis
According to the data center provider, the outage was caused by an automatic network routing issue triggered during a DDoS mitigation process. Their standard procedure involves rerouting traffic away from affected subnets using a combination of four upstreams, two of which have DDoS protection. However, during this event:
- One of the protected upstream links failed due to a link failure.
- The automatic switch from four upstreams to two was executed incorrectly, leaving only one functional upstream.
- This resulted in connectivity loss for servers within the affected /24 subnet.
The data center provider identified the issue through their monitoring system and promptly corrected the misconfiguration. Additionally, they have updated their routing configurations to prevent such issues from reoccurring in the future.
Impact Assessment
- 5,000 validators experienced downtime between 1:16 AM IST and 1:41 AM IST.
- All services resumed normal operations after connectivity was restored.
Preventive Measures & Next Steps
- Data Center Provider Actions:
- Updated routing configurations to prevent incorrect automatic failovers.
- Improved monitoring and alerting mechanisms for upstream failures.
Conclusion
The outage was caused by an incorrect automatic switch in the data center’s routing process during a DDoS mitigation event. The issue has been addressed, and measures have been put in place to prevent similar incidents in the future. We will continue to monitor performance and collaborate with the provider to ensure network reliability.
A payment of 0.5271 ETH has been made by Launchnodes to the Lido Execution Layer Rewards vault, to cover rewards missed due to this data center outage: