Incident Post Mortem
Date and Time of Incident: March 10, 2025, 09:52 KST
Duration of Incident: Approximately 1 hour and 10 minutes
Incident Summary
On March 10, 2025, at 09:52 KST, a memory leak issue occurred in the firewall equipment at one of A41’s data centers in South Korea following an automatic software upgrade. This caused the firewall to enter conserve mode, preventing it from maintaining sessions for firewall rules. As a result, network latency increased, and Ethereum nodes on physical machines reliant on this firewall were unable to participate normally in the network. The issue was resolved by downgrading the firewall software version and addressing the memory problem. Moving forward, the automatic upgrade option has been disabled, and upgrades will be manually performed with verified stable versions.
Incident Timeline
- March 10, 2025, 09:52 KST: Alerts received from the monitoring system
- March 10, 2025, 09:57 KST: Ethereum engineering team engaged
- March 10, 2025, 10:30 KST: Identified that the firewall was in conserve mode due to a memory issue
- March 10, 2025, 10:40 KST: IDC network engineer engaged
- March 10, 2025, 11:00 KST: Firewall equipment memory flushed & Firewall software downgraded
- March 10, 2025, 11:05 KST: Incident resolved
Root Cause Analysis
The incident was caused by an issue related to the automatic software upgrade of firewall equipment in one of A41’s South Korean data centers. The automatic upgrade feature was enabled, and the newly upgraded software version contained a memory leak issue. This led the firewall to enter conserve mode, disrupting session maintenance for firewall rules and increasing network latency. Consequently, Ethereum nodes on physical machines dependent on this firewall could not participate normally in the network.
Actions Taken
- Flushed the firewall equipment’s memory to address the immediate issue
- Downgraded the problematic software version to a previous stable release
- Disabled automatic upgrades and shifted to a manual upgrade process after verifying version stability
Impact
Impact: Approximately 3,500 keys were offline for a cumulative duration of 1 hour and 10 minutes.
Financial Impact: The total financial impact to the protocol was calculated to be 0.5979 ETH, which has been reimbursed by A41.
Follow-up Actions
- Enhance alerting systems for IDC infrastructure, including firewall equipment, beyond existing service and node metrics
- Remove automatic software upgrades and implement manual upgrades only after confirming version stability
Report Prepared By:
A41 Engineering Team
March 19, 2025