Launchnodes - Server Restart – Incident Report
Date and Time of Initial Incident: 5th September 2024 12:35 PM IST
Subsequent, Related Incidents Occurred on 9th and 10th September 2024.
Incident Summary
5th September 2024 - Unattended Upgrades
Initially, on the 5th September 2024 at 12:35pm IST, automatic alerts were generated to the Launchnodes team with notifications that pods were restarting, due to automatic updates of microk8s taking place. One of Launchnode’s signer servers had restarted due to the microk8s update, causing downtime to validators. At no point was there a risk of slashing
This situation had not occurred in Launchnodes’ development or testing environments previously. We also transitioned largely from Docker to Kubernetes this year, this situation was never encountered in our Docker environments.
As a temporary measure, Launchnodes evaluated and implemented several intermediate changes, to prevent future updates from causing a re-start, including
- Stopping Unattended Upgrades
- Setting the Snap Refresh Hold to the year 2040
- Disabling Snapd and Snapd.socket
Root Cause
These incidents occured primarily because microk8s was able to automatically update, causing pods to restart without manual intervention. Similarly, automatic microk8s certificate renewals were found to be able to also invoke restarts.
Resolution
Launchnodes has now tested and implemented an approach across its test, dev and production servers, to ensure that upcoming microk8s certificate renewals and updates are monitored and alerted to the team - via email, Telegram and other methods. Automatic updates to microk8s (that could result in pods restarting) are prevented using the ‘snap refresh –hold’ command.
Certificates are now also manually (rather than automatically) renewed, with certificate renewal dates monitored by the team. Lock files are now used within all microk8s environments, to prevent automatic updates of certificates.
Launchnodes receives Github and other notifications for microk8s updates, with the ability to schedule maintenance in the event of general patches and updates, critical vulnerabilities or other issues. Changes are tested in dedicated test environments, prior to being reflected on production servers.
Impact
Some Launchnodes validators went offline at different times, over a 5 day period, when pods restarted. The engineering team were immediately alerted, keys were reloaded and validators were restarted in a controlled manner whenever these actions were required. Doppelganger protection and other anti-slashing measures remain in place across Launchnodes’ infrastructure.
-
- 5th September 2024 12:35 PM IST: MicroK8s got restarted on one of the Bare metal servers.
-
- 9th September 2024 04:25 PM IST: MicroK8s got restarted on 3 cloud servers.
-
- 10th September 2024 08:25 PM IST: MicroK8s got restarted on bare metal servers.
-
- 10th September 2024 10:00 PM IST: MicroK8s got restarted on bare metal servers.
The Lido Node Operator Management team were updated by Launchnodes during this incident, and calculated that missed rewards and penalties totalled 3.2182 ETH over the 5 day period. This amount was reimbursed to the Execution Layer Rewards vault by Launchnodes on October 8th 2024: