[Post-Mortem] Stakefish Validator and Remote Signer Connectivity Loss - 17 May 2026

seven · June 23, 2026, 7:20am

Status: Resolved

Incident Date: 17 May 2026

Duration: Approx. 1 hour (06:01-07:10 UTC)

Service recovery began: 07:10 UTC

Published: 25 Jun 2026

Executive Summary

On 17 May 2026, our validation infrastructure experienced an unexpected downtime incident affecting a portion of our Ethereum validators. The incident was triggered by an automated system package update (unattended-upgrades) on our Debian-based server housing the validator client.

This automated process caused a transient restart of the core operating system networking service. Due to a systemd dependency and interface handling misalignment, the WireGuard VPN tunnel connecting the validator client to our remote signer failed to properly re-initialize after the interface cycle. As a result, the validator client was unable to request blocks/attestations signatures from the remote signer, leading to missed duties.

The issue has been successfully resolved, with full recovery confirmed at 07:30 UTC. To prevent any recurrence, automated updates have been completely decoupled from our consensus-layer infrastructure. Additionally, as a concrete preventative measure, we have developed and deployed an automated WireGuard Watchdog script to provide fallback self-healing capabilities for the network tunnel.

About the Infrastructure

Our validation setup follows institutional-grade security practices to safeguard signing keys:

Validator Node: Hosted on dedicated Debian bare-metal/instances, running the consensus and execution layer clients.
Remote Signer Node: A hardened environment containing the validator private keys, accessible exclusively via a secure WireGuard VPN tunnel.
Service Management: All critical networking interfaces and software processes are managed locally as systemd daemons.

Timeline of Events

All timestamps are listed in UTC on 17 May.

06:01 UTC – The unattended-upgrades daemon on the Debian validator server automatically triggered a routine, background system package security update.
06:01 UTC – The package installations necessitated an immediate, automated reload of the networking.service components to apply system-level changes.
06:01 UTC – Core network interfaces were cycled. The systemd service [email protected] remained in an active (running) state according to systemd tracking, but the kernel-level virtual network interface (wg0) collapsed and failed to bind back to the underlying physical route.
06:01 UTC – Internal monitoring systems triggered alerts indicating a high rate of missed attestations. Validator logs reported: Failed to fetch signature from remote signer: connection timed out.
06:16 UTC – On-call engineers triaged the incoming alert, immediately initiated the incident response protocol, and mapped out the blast radius to confirm the precise scope of affected validator keys. Initial diagnostics and routing analysis were conducted to pinpoint the root failure within the network layer.
07:00 UTC – Engineers manually reloaded the WireGuard configuration via systemd, instantly restoring the VPN tunnel and rescuing validator duty performance.
07:30 UTC – Following a comprehensive period of observation, engineers officially confirmed full recovery and stabilization across all affected validator keys.

Root Cause Analysis (RCA)

Our technical investigation identified two overlapping failures within the operating system’s default behavior:

Systemd Dependency Gaps: By default, the WireGuard systemd unit file ([email protected]) hooks into network.target. In Debian’s systemd lifecycle, network.target indicates that the networking management stack has booted up initially; it does not automatically track or propagate mid-lifecycle restarts of the underlying networking.service. When the package manager restarted the networking subsystem, systemd did not cascade a restart down to the dependent WireGuard virtual interfaces.
“Zombie” State Phenomenon: When the network interface abruptly dropped during the upgrade, the routing table entries for the WireGuard endpoint were purged. Because the service was stopped and started rapidly by the package script, WireGuard became trapped in a “zombie” state. Systemd registered the daemon process as running, but the actual network interface (wg0) had vanished from the kernel space. At the time of the incident, the cluster lacked an automated script to detect this interface-level discrepancy and force a reload.

Impact

Validator Performance: 6000 of Lido keys missed attestations and block proposals between 06:01 UTC and 07:10 UTC, with tracking concluded upon full confirmation at 07:30 UTC.
Financial Impact: Calculated to be 0.8781 ETH, based on missed attestations during the outage window. Stakefish has committed to cover this loss of earnings on behalf of Lido users.

Actions Taken & Follow-up Actions

Immediate Actions Taken:

Unattended Upgrades Uninstalled: To eliminate any element of unpredictability and un-orchestrated restarts on our validator fleet, unattended-upgrades has been completely purged from all production machines.
```
sudo apt-get remove --purge unattended-upgrades
```
APT Timers Disabled & Masked: To guarantee that no automated background apt-upgrade tasks trigger under any condition, the corresponding systemd daily timers have been disabled and masked.
```
sudo systemctl stop apt-daily.timer apt-daily-upgrade.timer
sudo systemctl disable apt-daily.timer apt-daily-upgrade.timer
sudo systemctl mask apt-daily.timer apt-daily-upgrade.timer
```

Follow-up & Prevention Plan (Post-Incident Mitigations):

Implementation of WireGuard Watchdog Script: As a direct remedy for the “zombie interface” vulnerability, we have written and deployed a lightweight bash watchdog script running on a short-interval systemd timer. The script actively checks the status of the wg0 interface and performs ICMP ping tests to the remote signer’s internal IP. If the link is unresponsive or the interface is missing, the script automatically triggers a programmatic restart of [email protected] to recover connectivity without human intervention.
Systemd Hardening: We are updating our infrastructure-as-code templates (Ansible/Terraform) to append explicit unit overrides for the WireGuard systemd configuration, forcing it to bind directly to the state of the primary network service:

[Unit]
BindsTo=networking.service
After=networking.service

Manual Patching Windows: All OS and security updates for the validator infrastructure will now strictly occur during scheduled, manually-supervised maintenance windows with active failover procedures in place.
Enhanced Monitoring Alerts: Improved our Prometheus/Grafana alerting rules to specifically monitor network tunnel interface uptime (node_network_up metric for wg0) rather than just relying on generic client duty metrics, allowing for instantaneous alerting in the event of interface silent failure.

KimonSh · June 23, 2026, 11:14am

Thank you @seven for the transparency and committed cover following the incident - it’s very much appreciated.

Nerixmachine12 · June 23, 2026, 1:34pm

Thank you for the detailed post-mortem. Purging automated updates from live production environments and implementing explicit systemd unit overrides (BindsTo=networking.service) are the correct technical steps to mitigate virtual network interface silent failures. graciass

Topic		Replies	Views
[Post Mortem] Solstice Incident - December 3, 2025 (fusaka update) Node Operators	1	107	January 30, 2026
Post Mortem: Develp Fleet Downtime 2025/02/14 Node Operators	0	67	February 28, 2025
Post-Mortem Stakely June 16, 2025 Incident Node Operators	1	139	July 15, 2025
CryptoManufaktur 2022-12-21 Ethereum validators outage Node Operators	4	4476	December 29, 2022
Post Mortem: Staking Facilities downtime 2025/01/22 Node Operators	2	128	February 10, 2025