Post Mortem: Develp Fleet DNS Resolution Failure and Missed Proposal on 2026/05/07

Develp.Devops · May 29, 2026, 10:52am

Name: DNS Resolution Failure and Missed Proposal
Impacted Servers: geth-06 and node-06
Impacted Services: nimbus-beacon-node, nimbus-validator-client, web3signer

Summary

On 2026/05/07 after a Linux Kernel security upgrade, one of the LIDO server experienced a network issue leading to attestation delays and a missed proposal.
The event happened a few days after a hardware incident where the Cloud Provider had to replace the network switch used by one of the servers running Validators.

In addition during that window one proposal was missed:

https://beaconcha.in/slot/14277711

Glossary

execution layer node- Nethermind or Geth client used, further referred to as EL.
beacon node - Nimbus client running, further referred to as BN.
validator client - Nimbus client providing signatures to BN, further referred as VC.
web3signer - Remote signing service that loads keys, further referred as Signer.
Copy Fail - Security Exploit detected in the Linux kernel.
proposal - The validator proposes a block to the network.
attestation - Action done by the validator to approve the proposed block.

Timeline (UTC)

2026/05/01
- 00:55 - Network incident at the Cloud Provider datacenter.
- 06:00 - Switch replacement completed, end of the hardware incident.
2026/05/06
- 11:00 - Host system security update.
- 17:40 - Host reboot to apply kernel update.
- 18:00 - DevOps team notices increased attestation delays. As issue is not critical, investigation delayed till next day.
2026/06/07
- 07:10 - Manual update in DNS configuration to mitigate latency.
- 07:30 - Enabled systemd-resolved service. Partial recovery.
- 12:33 - Missed proposal at 14277711 slot.
- 13:15 - VC config changed to use IP address instead of DNS entries.
- 14:50 - Signer config changed to use IP address instead of DNS entry.
- 19:28 - Proper DNS fix deployed to less impacted node-06 for testing.

Details

The incident is a combination of multiple events over multiples days:

Networking switch incident in Teraswitch data center.
CopyFail kernel security patch and subsequent host reboot.
DNS Server misconfiguration in Teraswitch data center.
Missed proposal caused by increased delays.

Networking Switch Incident

During the night of the 2026/05/01, the cloud provider had a hardware issue and changed the switch of the server geth-06 that ran some of the validators.
During the switch replacement, the Technical team misconfigured the internal DNS.

CopyFail Kernel Security Patch

The CVE 2026-31431 - also know as CopyFail - was detected at the end of April and patched in the Linux Kernel on the 29/04/2026. This security vulnerability allowed a user to gain administration privileges.
The fix this security issue, the kernel of the server has to be upgraded and the host restarted.
After fixing the most exposed servers of our infrastructure, it was decided to deploy the patch on the Validators fleet. Each host was restarted sequentially to reduce missed attestations.

DNS server misconfiguration

After the restart, the hosts geth-06 and node-06 started to exhibit high delay in the block attestation and missing attestations.

Our layout of Ethereum nodes in the fleet uses servers in pairs to provide redundancy at the level of the BN:

Because VCs on geth-06 and node-06 nodes are linked to both BNs the DNS resolution delays on geth-06 affected metrics of both BNs.

The DevOps team identified four separate issues which together caused the problem:

The cloud provider DNS misconfiguration that included an unreachable DNS server.
Sequential nature of libc DNS resolving used by Nimbus client which would get gets stuck on bad server.
Unexpected lower priority of /etc/hosts file in Name Service Switch configuration.
Usage of DNS entries for BN and Signer in VC config even for local ones.

Initially the faulty cloud provider DNS server was removed manually from the hosts configuration which provided full recovery.
In second step activation of systemd-resolved on the host intended to address the 4th issue undid the manual DNS sever removal.

The DNS removal happened around 07:10 which quickly resulted in a notable drop in the attestation delay, but then some re-appeared after systemd-resolved enabling.
The situation started to stabilise and the DevOps team continued to investigate while the attestation delay was settling down.

Missed proposal

At 14:22 UTC, one validator missed a proposal slot due to delay in network submission.

As a temporary measure the loopback interface IP address was used in VC configuration to mitigate any impact of DNS issues. This resulted in return to low attestation delays and confirmed DNS as the cause of the issue.

By replacing all DNS entries(including localhost) with IP addresses in all service config the delay stopped completely, confirming DNS was the root cause.

The permanent solution was to make /etc/hosts to be first in libc resolving priority by modify the host Name Service Switch configuration. The use of systemd-resolved was kept for record caching.

 > grep hosts: /etc/nsswitch.conf
hosts:     files mymachines myhostname resolve [!UNAVAIL=return] dns

Conclusions

Key lessons are:

Order of DNS servers in /etc/resolv.conf matters since libc querying is sequential.
Priority of /etc/hosts is not the default on all systems and shouldn’t be assumed to be.
Using 127.0.0.1 instead of localhost is safer since it avoids DNS resoution entirely.
Nimbus VC behavior of querying DNS at every request exacerbates impact of DNS issues.
Monitoring of DNS server availability can help detect hard to debug issues easier.

The issue has been resolved but additional DNS availability checks will be implemented.

Topic		Replies	Views
Post Mortem: Develp Fleet Downtime 2025/02/14 Node Operators	0	67	February 28, 2025
Post mortem: Downtime of Lido validators (Gateway.fm AS) Node Operators	1	1167	December 30, 2023
Post Mortem: Develp Fleet Fusaka Fork Downtime 2025/12/03 Node Operators	0	134	December 6, 2025
Reimbursement for Certus One / Jump Crypto 2023-02-11 Ethereum Validators Incident Node Operators	1	3502	April 6, 2023
[Post-Mortem] Execution Layer Data Corruption Incident – April 28, 2026 & Node Operators	0	38	June 23, 2026