Post Mortem: Develp Fleet DNS Resolution Failure and Missed Proposal on 2026/05/07

Name: DNS Resolution Failure and Missed Proposal
Impacted Servers: geth-06 and node-06
Impacted Services: nimbus-beacon-node, nimbus-validator-client, web3signer

Summary

On 2026/05/07 after a Linux Kernel security upgrade, one of the LIDO server experienced a network issue leading to attestation delays and a missed proposal.
The event happened a few days after a hardware incident where the Cloud Provider had to replace the network switch used by one of the servers running Validators.

In addition during that window one proposal was missed:

Glossary

  • execution layer node- Nethermind or Geth client used, further referred to as EL.
  • beacon node - Nimbus client running, further referred to as BN.
  • validator client - Nimbus client providing signatures to BN, further referred as VC.
  • web3signer - Remote signing service that loads keys, further referred as Signer.
  • Copy Fail - Security Exploit detected in the Linux kernel.
  • proposal - The validator proposes a block to the network.
  • attestation - Action done by the validator to approve the proposed block.

Timeline (UTC)

  • 2026/05/01
    • 00:55 - Network incident at the Cloud Provider datacenter.
    • 06:00 - Switch replacement completed, end of the hardware incident.
  • 2026/05/06
    • 11:00 - Host system security update.
    • 17:40 - Host reboot to apply kernel update.
    • 18:00 - DevOps team notices increased attestation delays. As issue is not critical, investigation delayed till next day.
  • 2026/06/07
    • 07:10 - Manual update in DNS configuration to mitigate latency.
    • 07:30 - Enabled systemd-resolved service. Partial recovery.
    • 12:33 - Missed proposal at 14277711 slot.
    • 13:15 - VC config changed to use IP address instead of DNS entries.
    • 14:50 - Signer config changed to use IP address instead of DNS entry.
    • 19:28 - Proper DNS fix deployed to less impacted node-06 for testing.

Details

The incident is a combination of multiple events over multiples days:

Networking Switch Incident

During the night of the 2026/05/01, the cloud provider had a hardware issue and changed the switch of the server geth-06 that ran some of the validators.
During the switch replacement, the Technical team misconfigured the internal DNS.

CopyFail Kernel Security Patch

The CVE 2026-31431 - also know as CopyFail - was detected at the end of April and patched in the Linux Kernel on the 29/04/2026. This security vulnerability allowed a user to gain administration privileges.
The fix this security issue, the kernel of the server has to be upgraded and the host restarted.
After fixing the most exposed servers of our infrastructure, it was decided to deploy the patch on the Validators fleet. Each host was restarted sequentially to reduce missed attestations.

DNS server misconfiguration

After the restart, the hosts geth-06 and node-06 started to exhibit high delay in the block attestation and missing attestations.

Our layout of Ethereum nodes in the fleet uses servers in pairs to provide redundancy at the level of the BN:

Because VCs on geth-06 and node-06 nodes are linked to both BNs the DNS resolution delays on geth-06 affected metrics of both BNs.

The DevOps team identified four separate issues which together caused the problem:

  1. The cloud provider DNS misconfiguration that included an unreachable DNS server.
  2. Sequential nature of libc DNS resolving used by Nimbus client which would get gets stuck on bad server.
  3. Unexpected lower priority of /etc/hosts file in Name Service Switch configuration.
  4. Usage of DNS entries for BN and Signer in VC config even for local ones.

Initially the faulty cloud provider DNS server was removed manually from the hosts configuration which provided full recovery.
In second step activation of systemd-resolved on the host intended to address the 4th issue undid the manual DNS sever removal.

The DNS removal happened around 07:10 which quickly resulted in a notable drop in the attestation delay, but then some re-appeared after systemd-resolved enabling.
The situation started to stabilise and the DevOps team continued to investigate while the attestation delay was settling down.

Missed proposal

At 14:22 UTC, one validator missed a proposal slot due to delay in network submission.

As a temporary measure the loopback interface IP address was used in VC configuration to mitigate any impact of DNS issues. This resulted in return to low attestation delays and confirmed DNS as the cause of the issue.

By replacing all DNS entries(including localhost) with IP addresses in all service config the delay stopped completely, confirming DNS was the root cause.

The permanent solution was to make /etc/hosts to be first in libc resolving priority by modify the host Name Service Switch configuration. The use of systemd-resolved was kept for record caching.

 > grep hosts: /etc/nsswitch.conf
hosts:     files mymachines myhostname resolve [!UNAVAIL=return] dns

Conclusions

Key lessons are:

  • Order of DNS servers in /etc/resolv.conf matters since libc querying is sequential.
  • Priority of /etc/hosts is not the default on all systems and shouldn’t be assumed to be.
  • Using 127.0.0.1 instead of localhost is safer since it avoids DNS resoution entirely.
  • Nimbus VC behavior of querying DNS at every request exacerbates impact of DNS issues.
  • Monitoring of DNS server availability can help detect hard to debug issues easier.

The issue has been resolved but additional DNS availability checks will be implemented.

1 Like