[Post Mortem] Solstice Incident - December 3, 2025 (fusaka update)

Marcus_Maute · January 29, 2026, 12:18pm

Incident Post Mortem

Date and Time of Incident: December 3, 2024, 23:15 CET
Duration of Incident: Approximately 6 hours and 35 minutes

Incident Summary

On December 3, 2024, at 23:15 CET, one validator client (VC#18) running Lighthouse v8.0.1 began rejecting attestations with “invalid signature” errors across all three connected node pairs (Nethermind, Nimbus, MEV-Boost). The issue was caused by a failed Docker Compose restart during a prior automated Ansible update on December 1, leaving the validator client running an outdated software version. This created a version mismatch between the validator client and the rest of the updated stack. The incident was resolved on December 4 at 05:50 CET by restarting the affected Docker stack. Moving forward, the automated update process has been hardened with additional restart validation and version checks to prevent undetected failed restarts.

Incident Timeline

December 3, 2024, 22:49 CET: Fusaka update deployed
December 3, 2024, 23:15 CET: First error messages appeared in the central monitoring solution; incident reported with 100-500 validators offline requiring urgent action
December 4, 2024, 00:47 CET: Solstice team responds, making phone calls to reach the Zurich team
December 4, 2024, 05:54 CET: Docker container restarted, service restored
December 4, 2024, 06:00 CET: Recovery confirmed

Root Cause Analysis

The incident was caused by a failure in the automated restart sequence of validator VC#18 following a scheduled update cycle. The infrastructure uses an Ansible-based automation process that performs rolling updates, restarting one validator client every 30 minutes to avoid fleet-wide impact. During the update run on December 1, the Docker Compose stack for VC#18 failed to restart correctly, resulting in:

The intended updated container image not being loaded
An older version of the validator client continuing to run
A version mismatch between the validator client and the rest of the updated stack (Nethermind v1.35.3, Nimbus v25.11.1, MEV-Boost v1.10.1)

This state was not detected by existing health checks, allowing the outdated process to remain in production until it began causing protocol-level errors (“invalid signature” and HTTP 400 errors). This was the first observed failure of this automation process in the 24 months since its introduction.

Actions Taken

Restarted the affected validator’s Docker Compose stack to restore the correct software version and normal attestation processing
Hardened the automated update process with additional restart validation and version checks
Increased DevOps team coverage to ensure 24×7 availability during travel, illness, and other unplanned team outages

Impact

Impact: One validator client (VC#18) experienced attestation failures, resulting in missed or rejected attestations for approximately 100-500 validators during the incident window of 6 hours and 35 minutes.

Financial Impact: The estimated impact to the protocol was calculated to be 1.0772 ETH.

This amount will be reimbursed by Solstice.

Follow-up Actions

Implement stricter health checks, version verification, and alerting for failed container restarts
Add immediate post-deployment restart and version checks to all validator update workflows
Strengthen automation resilience by introducing staged rollouts with automatic rollback and enhanced observability

Report Prepared By: Solstice Staking AG
Date: January 2026

Izzy · January 30, 2026, 2:23pm

Appreciate the transparency here, thank you Marcus!

Topic		Replies	Views
Post Mortem: Develp Fleet Downtime 2025/02/14 Node Operators	0	64	February 28, 2025
Post-Mortem Stakely June 16, 2025 Incident Node Operators	1	129	July 15, 2025
Post Mortem: Staking Facilities downtime 2025/01/22 Node Operators	2	123	February 10, 2025
Post Mortem: Staking Facilities downtime 2025/06/08 Node Operators	1	361	June 27, 2025
Post mortem: Downtime of Lido validators (Gateway.fm AS) Node Operators	1	1164	December 30, 2023