[Post-Mortem] Stakefish Validator and Remote Signer Connectivity Loss - 17 May 2026

Status: Resolved

Incident Date: 17 May 2026

Duration: Approx. 1 hour (06:01-07:10 UTC)

Service recovery began: 07:10 UTC

Published: 25 Jun 2026

Executive Summary

On 17 May 2026, our validation infrastructure experienced an unexpected downtime incident affecting a portion of our Ethereum validators. The incident was triggered by an automated system package update (unattended-upgrades) on our Debian-based server housing the validator client.

This automated process caused a transient restart of the core operating system networking service. Due to a systemd dependency and interface handling misalignment, the WireGuard VPN tunnel connecting the validator client to our remote signer failed to properly re-initialize after the interface cycle. As a result, the validator client was unable to request blocks/attestations signatures from the remote signer, leading to missed duties.

The issue has been successfully resolved, with full recovery confirmed at 07:30 UTC. To prevent any recurrence, automated updates have been completely decoupled from our consensus-layer infrastructure. Additionally, as a concrete preventative measure, we have developed and deployed an automated WireGuard Watchdog script to provide fallback self-healing capabilities for the network tunnel.


About the Infrastructure

Our validation setup follows institutional-grade security practices to safeguard signing keys:

  • Validator Node: Hosted on dedicated Debian bare-metal/instances, running the consensus and execution layer clients.
  • Remote Signer Node: A hardened environment containing the validator private keys, accessible exclusively via a secure WireGuard VPN tunnel.
  • Service Management: All critical networking interfaces and software processes are managed locally as systemd daemons.

Timeline of Events

All timestamps are listed in UTC on 17 May.

  • 06:01 UTC – The unattended-upgrades daemon on the Debian validator server automatically triggered a routine, background system package security update.
  • 06:01 UTC – The package installations necessitated an immediate, automated reload of the networking.service components to apply system-level changes.
  • 06:01 UTC – Core network interfaces were cycled. The systemd service [email protected] remained in an active (running) state according to systemd tracking, but the kernel-level virtual network interface (wg0) collapsed and failed to bind back to the underlying physical route.
  • 06:01 UTC – Internal monitoring systems triggered alerts indicating a high rate of missed attestations. Validator logs reported: Failed to fetch signature from remote signer: connection timed out.
  • 06:16 UTC – On-call engineers triaged the incoming alert, immediately initiated the incident response protocol, and mapped out the blast radius to confirm the precise scope of affected validator keys. Initial diagnostics and routing analysis were conducted to pinpoint the root failure within the network layer.
  • 07:00 UTC – Engineers manually reloaded the WireGuard configuration via systemd, instantly restoring the VPN tunnel and rescuing validator duty performance.
  • 07:30 UTC – Following a comprehensive period of observation, engineers officially confirmed full recovery and stabilization across all affected validator keys.

Root Cause Analysis (RCA)

Our technical investigation identified two overlapping failures within the operating system’s default behavior:

  1. Systemd Dependency Gaps: By default, the WireGuard systemd unit file ([email protected]) hooks into network.target. In Debian’s systemd lifecycle, network.target indicates that the networking management stack has booted up initially; it does not automatically track or propagate mid-lifecycle restarts of the underlying networking.service. When the package manager restarted the networking subsystem, systemd did not cascade a restart down to the dependent WireGuard virtual interfaces.
  2. “Zombie” State Phenomenon: When the network interface abruptly dropped during the upgrade, the routing table entries for the WireGuard endpoint were purged. Because the service was stopped and started rapidly by the package script, WireGuard became trapped in a “zombie” state. Systemd registered the daemon process as running, but the actual network interface (wg0) had vanished from the kernel space. At the time of the incident, the cluster lacked an automated script to detect this interface-level discrepancy and force a reload.

Impact

  • Validator Performance: 6000 of Lido keys missed attestations and block proposals between 06:01 UTC and 07:10 UTC, with tracking concluded upon full confirmation at 07:30 UTC.
  • Financial Impact: Calculated to be 0.8781 ETH, based on missed attestations during the outage window. Stakefish has committed to cover this loss of earnings on behalf of Lido users.

Actions Taken & Follow-up Actions

Immediate Actions Taken:

  • Unattended Upgrades Uninstalled: To eliminate any element of unpredictability and un-orchestrated restarts on our validator fleet, unattended-upgrades has been completely purged from all production machines.
    sudo apt-get remove --purge unattended-upgrades
    
  • APT Timers Disabled & Masked: To guarantee that no automated background apt-upgrade tasks trigger under any condition, the corresponding systemd daily timers have been disabled and masked.
    sudo systemctl stop apt-daily.timer apt-daily-upgrade.timer
    sudo systemctl disable apt-daily.timer apt-daily-upgrade.timer
    sudo systemctl mask apt-daily.timer apt-daily-upgrade.timer
    

Follow-up & Prevention Plan (Post-Incident Mitigations):

  1. Implementation of WireGuard Watchdog Script: As a direct remedy for the “zombie interface” vulnerability, we have written and deployed a lightweight bash watchdog script running on a short-interval systemd timer. The script actively checks the status of the wg0 interface and performs ICMP ping tests to the remote signer’s internal IP. If the link is unresponsive or the interface is missing, the script automatically triggers a programmatic restart of [email protected] to recover connectivity without human intervention.
  2. Systemd Hardening: We are updating our infrastructure-as-code templates (Ansible/Terraform) to append explicit unit overrides for the WireGuard systemd configuration, forcing it to bind directly to the state of the primary network service:
[Unit]
BindsTo=networking.service
After=networking.service
  1. Manual Patching Windows: All OS and security updates for the validator infrastructure will now strictly occur during scheduled, manually-supervised maintenance windows with active failover procedures in place.
  2. Enhanced Monitoring Alerts: Improved our Prometheus/Grafana alerting rules to specifically monitor network tunnel interface uptime (node_network_up metric for wg0) rather than just relying on generic client duty metrics, allowing for instantaneous alerting in the event of interface silent failure.