Slashing Incident involving RockLogic GmbH Validators - April 13, 2023

On April 13, 2023 DAO Contributors identified 11 slashings related to validators operators by RockLogic GmbH as a part of the Lido protocol and notified the relevant node operator to shut down the impact node(s) and investigate the cause.

Contributors are currently working together with the Node Operator to assess the full impact, if any other validators may be affected, and root cause.

More information and a detailed incident report will follow.

10 Likes

A post mortem with updates has been posted to the Lido blog, including identification of the root cause and an outline for possible next steps.

10 Likes

Great post mortem, thanks for sharing.

I’d suggest that the DAO use the cover fund for reimbursement, as that’s what it’s there for. I don’t like the precedent or assumption of operators reimbursing for losses. The cover fund exists for this purpose and by using it we exemplify its importance, which I believe has been overlooked.

5 Likes

Since lowering key limits cannot be done via EasyTrack, I would like to share that NOM contributors are proposing to lower RockLogic’s current key limit (9000) to a number close to their currently active amount (5456), e.g. something like 5800 (given that this on-chain vote would take 3 days to finalize and be enacted). The exact number will be finalized as close to the omnibus vote launch as possible.

This would effectively limit new stake distribution to RockLogic until a) remediation of internal processes and setups can occur and be communicated, and b) a further action is discussed and voted on by the DAO.

EDIT:
The omni bus vote including the above proposal is live here: Lido DAO Voting UI

7 Likes

We sincerely apologize for the incident and the damage it has caused to all those directly involved.

We outline more of the details of the bug, as well as a way to reproduce it & our learnings from it in the following incident report:

If you have any further questions concerning this incident please feel free to contact us at any time.

7 Likes

Aragon vote is started Lido DAO Voting UI
The main phase will last for 48 hours. Please cast your votes

3 Likes

Thanks for providing additional details and, importantly, the reproduction instructions. As a follow-up, it seems that the Prysmatic Labs team has been working hard on this confirmed issue already and it appears as closed in their github (keystore: Deleting keys via keymanager API may not fully delete keys · Issue #12281 · prysmaticlabs/prysm · GitHub), so I hope it’s included in the next release.

5 Likes

We would like to request that the 11 slashed validators be reimbursed by the cover fund for reimbursement. We think that this incident qualifies to the core purpose of this fund, strengthening how important it is to have an instrument like this to mitigate risk.

8 Likes

The below is a summary for purposes of pointing the upcoming snapshot vote to.


Incident Summary

On April 13, 2023, 11 validators operated by RockLogic GmbH were slashed due to the duplication of validator keys in two different active clusters, causing a double vote.

Lido DAO contributors alerted the RockLogic team to the slashings and RockLogic subsequently brought the affected cluster offline to mitigate potential further risk. The root cause was identified by the RockLogic team and the remaining validators were steadily brought back online.

The estimated impact on stETH holders, from the time of the incident until the validators fully exit and are automatically withdrawn from the network on May 20th, is estimated to be ~13.77 ETH.

For more detail regarding the incident’s root cause, timeline, impact, and response, please see the Lido post mortem and RockLogic post mortem.

Staker Compensation

On April 19, 2023, RockLogic GmbH requested that the Lido DAO utilize the slashing cover fund to provide compensation to stakers (more detail regarding the cover fund).

The actual amount of compensation will be determined following the full exit and withdrawal of the 11 validators from the network, but has been estimated to be close to ~13.77 ETH.

Next Steps

If there’s no objections, I propose that a snapshot vote commence on Apr 20 (and run until Apr 27th). DAO participants will vote whether the slashing cover fund should be utilized to source compensation for this incident.

If the snapshot vote passes, an on-chain Aragon vote will be held after May 20th (the date the validators will be withdrawn) to proceed with on-chain execution in order to compensate stETH holders via usage of the slashing cover fund.

Lido DAO and community members are encouraged to voice their opinions regarding this matter in this thread.

6 Likes

Would fully support the issue going to the snapshot as soon as possible, thank you for posting!

3 Likes

Slashing Incident involving RockLogic GmbH Validators - April 13, 2023

Measures Already Implemented and Further Action Points

As of April 20, 2023

Following the incident from April 13, 2023 we took several actions to understand its causes and to prevent them from happening ever again.

Here is an overview of 1) what has been carried out so far, 2) what we are currently working on and will finalise in the next days and 3) what we plan for the coming weeks.

We want to share this with the community and are happy for your feedback!

  1. ALREADY DONE
  • Reproduced the bug causing the failure and shared this information with the community (see earlier statement)
  • Expanded internal monitoring
  • Security checks of client configurations (eg. if doppelgänger is enabled)
  • Documented and clear instructions for the migration of keys
  1. ONGOING ACTIVITIES
  • Prepared key handling guides (import keys/delete keys/move keys) - will be shared as github gist markdown for NO and LIDO
  • Prepared a Node update guide - will be shared as github gist markdown for NO and LIDO
  • Doppelgänger protection of Nodes is active on new and unassigned servers for standby; on the productive Nodes it is being rolled out today.
  • Slashing alerts tool (fast and reliable alerts to all Devs) is in preparation
  1. FURTHER ACTION POINTS
  • Make all guides public to everyone on our website
  • Review other processes to see if bugs like this could possibly cause similar outcomes.
  • Schedule additional automized tests for such cases
  • Make use of lessons learned: Tighten up the process of moving keys and configs to eliminate the risk of running into a bug

If you have further questions or ideas, please contact us at any time.

4 Likes

Snapshot vote started

We’re starting the RockLogic Slashing Incident Staker Compensation Snapshot, active till Thu, 27 Apr 2023 17:00:00 GMT . Please don’t forget to cast your vote!

The relevant identified bug in Prysm’s keymanager has been addressed by the Prysmatic Labs team in the newest release Release v4.0.3 · prysmaticlabs/prysm · GitHub, as a part of PR 12284

1 Like

You will now find our key handling guides, node update guide and Doppelgänger protection of nodes on github gist.
Please provide us with feedback for the processes we propose!

Import keys on node

Delete keys on node

Move keys (migration)

Update node

Doppelgänger detection for clients

3 Likes

Aragon intended to setting RockLogic’s limit to 5800 was enacted Ethereum Transaction Hash (Txhash) Details | Etherscan

2 Likes

To hopefully add to the discussion beyond the operational~~

The node operator in question is not ‘faultless’, as these errors and slower response-time (i.e. contributors informing the operator and not the other way around) feel like symptoms of a larger issue.

If you provide another entity that ‘specializes’ in Devops an opportunity to make ~decent $$ a year, wouldn’t you expect them to have:

  • strong monitoring,
  • & the ability to proportionally reimburse, when fault is admitted?

There are plenty of candidates eagerly waiting to join the validator set, and the standards for being a part of the curated professional validator set should be set to a professional level.

Lowering the operators key limit seems to be the best middle-ground for the parties involved.

I see Lido’s mission is to deliver on decentralization and making staking simple – not to provide operators who are operating shakily a profitable business model.
For every $ we spend in the latter direction we take away from the DAOs ability to forward it’s overarching goals.

5 Likes

Following the incident from April 13, 2023 and the measures we have taken since to prevent something like this to happen again, we want to give you a status update as of today.

This also refers to the public call today which you can see here: Node Operator Community Call #6 - YouTube

SLASHING INCIDENT OVERVIEW

On April 13, 2023, 11 validators operated by RockLogic GmbH were slashed due to the duplication of validator keys in two different active clusters, causing a double vote. This was caused by an unforseeable bug, which since has been eliminated. But, at the time of incident, the full extent of measures to prevent key duplication was not taken, leading to the slashing.

INCIDENT RESPONSE

When the slashing event was confirmed within 15 mins of its first occurrence, we prevented further slashings by bringing relevant clusters offline. The failover cluster was brought back online and the remaining keys were all incrementally activated within the next three hours. Both our’s and Prysmatic Labs’ technical investigations confirmed the bug the next day and we made it reproducible for further analysis. Lido released a post mortem that day.

REMEDIATION ACTIVITIES

The bug has been identified and fixed by Prysmatic Labs (GH Issue #12281) as of Prysm v4.0.3.
RockLogic GmbH updated the configuration of nodes (doppelganger used uniformly throughout) and key handling guides and issued those on Github.

We are updating monitoring & alerting and procedures & guides, which will be made publicly available soon (ongoing).

We also plan to create automated tests for such cases (not started yet).

INCIDENT FOLLOW-UP

As a consequence and further precaution following the slashing, Lido DAO took an on-chain Aragon vote to limit RockLogic keys until full remediation has taken place, which passed. There is also a Lido DAO Snapshot vote for staker compensation using the cover fund still in progress.

FUTURE PROCEDURES

We plan a community discussion about which policies to implement for future reactions upon incidents like this, and related ones (eg large outages) from a governance and NO set management perspective. This will start later this week.

Snapshot vote ended

Thank you all who participated in the RockLogic Slashing Incident Staker Compensation Snapshot, the proposal passed! :pray:
The results are:
For: 55.2M LDO
Against: 27 LDO

1 Like

The proposal to use the cover fund for compensation has passed. Once the validators have fully exited (~May 20th) and the actual amount of penalties can be finalized, another vote will be set up so that the compensation can occur.

4 Likes

Lido on Ethereum RockLogic Weekly News Post

“Making things better by the week”

First Update 03.05.2023

INTRODUCTION

This is a new format that Lido DAO contributors and RockLogic set up to communicate in a more transparent manner with the Ethereum blockchain community.

What happened until this week:

MINOR BUG

A 1.000 keys were moved, following a Lighthouse bug - no harm here.

INFRASTRUCTURE EXTENSION

We moved 500 keys in accordance with an extension of our infrastructure.

SLASHING ALERT SERVICE

As announced previously, we will establish a slashing alert service - this is due to be released May 4th, 2023.

LIDO V2 COMPLIANCE

RockLogic will be fully compliant to Lido V2 on mainnet by May 8th, 2023. So far, it has been fully tested on the testnet.

COMMUNITY DISCUSSION ON INCIDENT TREATMENT

We are still keen on input of the community following our thread on how to deal with future incidents and if/how to set up a universal procedure of how to deal with things:

3 Likes