Slashing Incident involving RockLogic GmbH Validators - April 13, 2023

Great post mortem, thanks for sharing.

I’d suggest that the DAO use the cover fund for reimbursement, as that’s what it’s there for. I don’t like the precedent or assumption of operators reimbursing for losses. The cover fund exists for this purpose and by using it we exemplify its importance, which I believe has been overlooked.

5 Likes

Since lowering key limits cannot be done via EasyTrack, I would like to share that NOM contributors are proposing to lower RockLogic’s current key limit (9000) to a number close to their currently active amount (5456), e.g. something like 5800 (given that this on-chain vote would take 3 days to finalize and be enacted). The exact number will be finalized as close to the omnibus vote launch as possible.

This would effectively limit new stake distribution to RockLogic until a) remediation of internal processes and setups can occur and be communicated, and b) a further action is discussed and voted on by the DAO.

EDIT:
The omni bus vote including the above proposal is live here: Lido DAO Voting UI

7 Likes

We sincerely apologize for the incident and the damage it has caused to all those directly involved.

We outline more of the details of the bug, as well as a way to reproduce it & our learnings from it in the following incident report:

If you have any further questions concerning this incident please feel free to contact us at any time.

7 Likes

Aragon vote is started Lido DAO Voting UI
The main phase will last for 48 hours. Please cast your votes

3 Likes

Thanks for providing additional details and, importantly, the reproduction instructions. As a follow-up, it seems that the Prysmatic Labs team has been working hard on this confirmed issue already and it appears as closed in their github (keystore: Deleting keys via keymanager API may not fully delete keys · Issue #12281 · prysmaticlabs/prysm · GitHub), so I hope it’s included in the next release.

5 Likes

We would like to request that the 11 slashed validators be reimbursed by the cover fund for reimbursement. We think that this incident qualifies to the core purpose of this fund, strengthening how important it is to have an instrument like this to mitigate risk.

8 Likes

The below is a summary for purposes of pointing the upcoming snapshot vote to.


Incident Summary

On April 13, 2023, 11 validators operated by RockLogic GmbH were slashed due to the duplication of validator keys in two different active clusters, causing a double vote.

Lido DAO contributors alerted the RockLogic team to the slashings and RockLogic subsequently brought the affected cluster offline to mitigate potential further risk. The root cause was identified by the RockLogic team and the remaining validators were steadily brought back online.

The estimated impact on stETH holders, from the time of the incident until the validators fully exit and are automatically withdrawn from the network on May 20th, is estimated to be ~13.77 ETH.

For more detail regarding the incident’s root cause, timeline, impact, and response, please see the Lido post mortem and RockLogic post mortem.

Staker Compensation

On April 19, 2023, RockLogic GmbH requested that the Lido DAO utilize the slashing cover fund to provide compensation to stakers (more detail regarding the cover fund).

The actual amount of compensation will be determined following the full exit and withdrawal of the 11 validators from the network, but has been estimated to be close to ~13.77 ETH.

Next Steps

If there’s no objections, I propose that a snapshot vote commence on Apr 20 (and run until Apr 27th). DAO participants will vote whether the slashing cover fund should be utilized to source compensation for this incident.

If the snapshot vote passes, an on-chain Aragon vote will be held after May 20th (the date the validators will be withdrawn) to proceed with on-chain execution in order to compensate stETH holders via usage of the slashing cover fund.

Lido DAO and community members are encouraged to voice their opinions regarding this matter in this thread.

6 Likes

Would fully support the issue going to the snapshot as soon as possible, thank you for posting!

3 Likes

Slashing Incident involving RockLogic GmbH Validators - April 13, 2023

Measures Already Implemented and Further Action Points

As of April 20, 2023

Following the incident from April 13, 2023 we took several actions to understand its causes and to prevent them from happening ever again.

Here is an overview of 1) what has been carried out so far, 2) what we are currently working on and will finalise in the next days and 3) what we plan for the coming weeks.

We want to share this with the community and are happy for your feedback!

  1. ALREADY DONE
  • Reproduced the bug causing the failure and shared this information with the community (see earlier statement)
  • Expanded internal monitoring
  • Security checks of client configurations (eg. if doppelgänger is enabled)
  • Documented and clear instructions for the migration of keys
  1. ONGOING ACTIVITIES
  • Prepared key handling guides (import keys/delete keys/move keys) - will be shared as github gist markdown for NO and LIDO
  • Prepared a Node update guide - will be shared as github gist markdown for NO and LIDO
  • Doppelgänger protection of Nodes is active on new and unassigned servers for standby; on the productive Nodes it is being rolled out today.
  • Slashing alerts tool (fast and reliable alerts to all Devs) is in preparation
  1. FURTHER ACTION POINTS
  • Make all guides public to everyone on our website
  • Review other processes to see if bugs like this could possibly cause similar outcomes.
  • Schedule additional automized tests for such cases
  • Make use of lessons learned: Tighten up the process of moving keys and configs to eliminate the risk of running into a bug

If you have further questions or ideas, please contact us at any time.

4 Likes

Snapshot vote started

We’re starting the RockLogic Slashing Incident Staker Compensation Snapshot, active till Thu, 27 Apr 2023 17:00:00 GMT . Please don’t forget to cast your vote!

The relevant identified bug in Prysm’s keymanager has been addressed by the Prysmatic Labs team in the newest release Release v4.0.3 · prysmaticlabs/prysm · GitHub, as a part of PR 12284

1 Like

You will now find our key handling guides, node update guide and Doppelgänger protection of nodes on github gist.
Please provide us with feedback for the processes we propose!

Import keys on node

Delete keys on node

Move keys (migration)

Update node

Doppelgänger detection for clients

3 Likes

Aragon intended to setting RockLogic’s limit to 5800 was enacted Ethereum Transaction Hash (Txhash) Details | Etherscan

2 Likes

To hopefully add to the discussion beyond the operational~~

The node operator in question is not ‘faultless’, as these errors and slower response-time (i.e. contributors informing the operator and not the other way around) feel like symptoms of a larger issue.

If you provide another entity that ‘specializes’ in Devops an opportunity to make ~decent $$ a year, wouldn’t you expect them to have:

  • strong monitoring,
  • & the ability to proportionally reimburse, when fault is admitted?

There are plenty of candidates eagerly waiting to join the validator set, and the standards for being a part of the curated professional validator set should be set to a professional level.

Lowering the operators key limit seems to be the best middle-ground for the parties involved.

I see Lido’s mission is to deliver on decentralization and making staking simple – not to provide operators who are operating shakily a profitable business model.
For every $ we spend in the latter direction we take away from the DAOs ability to forward it’s overarching goals.

5 Likes

Following the incident from April 13, 2023 and the measures we have taken since to prevent something like this to happen again, we want to give you a status update as of today.

This also refers to the public call today which you can see here: Node Operator Community Call #6 - YouTube

SLASHING INCIDENT OVERVIEW

On April 13, 2023, 11 validators operated by RockLogic GmbH were slashed due to the duplication of validator keys in two different active clusters, causing a double vote. This was caused by an unforseeable bug, which since has been eliminated. But, at the time of incident, the full extent of measures to prevent key duplication was not taken, leading to the slashing.

INCIDENT RESPONSE

When the slashing event was confirmed within 15 mins of its first occurrence, we prevented further slashings by bringing relevant clusters offline. The failover cluster was brought back online and the remaining keys were all incrementally activated within the next three hours. Both our’s and Prysmatic Labs’ technical investigations confirmed the bug the next day and we made it reproducible for further analysis. Lido released a post mortem that day.

REMEDIATION ACTIVITIES

The bug has been identified and fixed by Prysmatic Labs (GH Issue #12281) as of Prysm v4.0.3.
RockLogic GmbH updated the configuration of nodes (doppelganger used uniformly throughout) and key handling guides and issued those on Github.

We are updating monitoring & alerting and procedures & guides, which will be made publicly available soon (ongoing).

We also plan to create automated tests for such cases (not started yet).

INCIDENT FOLLOW-UP

As a consequence and further precaution following the slashing, Lido DAO took an on-chain Aragon vote to limit RockLogic keys until full remediation has taken place, which passed. There is also a Lido DAO Snapshot vote for staker compensation using the cover fund still in progress.

FUTURE PROCEDURES

We plan a community discussion about which policies to implement for future reactions upon incidents like this, and related ones (eg large outages) from a governance and NO set management perspective. This will start later this week.

Snapshot vote ended

Thank you all who participated in the RockLogic Slashing Incident Staker Compensation Snapshot, the proposal passed! :pray:
The results are:
For: 55.2M LDO
Against: 27 LDO

1 Like

The proposal to use the cover fund for compensation has passed. Once the validators have fully exited (~May 20th) and the actual amount of penalties can be finalized, another vote will be set up so that the compensation can occur.

4 Likes

Lido on Ethereum RockLogic Weekly News Post

“Making things better by the week”

First Update 03.05.2023

INTRODUCTION

This is a new format that Lido DAO contributors and RockLogic set up to communicate in a more transparent manner with the Ethereum blockchain community.

What happened until this week:

MINOR BUG

A 1.000 keys were moved, following a Lighthouse bug - no harm here.

INFRASTRUCTURE EXTENSION

We moved 500 keys in accordance with an extension of our infrastructure.

SLASHING ALERT SERVICE

As announced previously, we will establish a slashing alert service - this is due to be released May 4th, 2023.

LIDO V2 COMPLIANCE

RockLogic will be fully compliant to Lido V2 on mainnet by May 8th, 2023. So far, it has been fully tested on the testnet.

COMMUNITY DISCUSSION ON INCIDENT TREATMENT

We are still keen on input of the community following our thread on how to deal with future incidents and if/how to set up a universal procedure of how to deal with things:

3 Likes

Lido on Ethereum RockLogic Weekly News Post

“Making things better by the week”

Second Update 10.05.2023

INTRODUCTION

This is a new format that Lido DAO contributors and RockLogic set up to communicate in a more transparent manner with the Ethereum blockchain community.

What happened since last week:

LIDO V2 COMPLIANCE

As announced previously, RockLogic is now fully compliant to Lido V2 on mainnet since May 8th, 2023. 500 keys have been signed for withdrawals.

SLASHING ALERT SERVICE

The first release of the slashing guard has now been configured for RockLogic and is in use.

This is of course open source, please give us your feedback here:

FALLBACK FULL NODES FOR VALIDATOR CLIENTS

The fallback feature of validator clients to other beacon clients will be rolled out for Teku and Nimbus this & next week.

COMMUNICATION

We are looking forward to your contribution in this thread! Please, share your ideas, opinions, suggestions for improvement here:

2 Likes

ACTION STATUS REPORT

14.05.2023

SCOPE

Today, we want to provide you all with an overview of the activities we accomplished within the recent weeks as well as their status.

WEEKLY INFORMATION UPDATES (ongoing activity)

Established a new weekly Lido Stereum News Post “Making things better by the week” to inform the community on a regular basis.

LIDO V2 COMPLIANCE (done)

RockLogic is fully compliant to Lido V2 on mainnet since May 8th, 2023.
500 keys have been signed for withdrawals.

SLASHING ALERT SERVICE (done)

The first release of the slashing guard has been configured for RockLogic and is in use.

FALLBACK FULL NODES FOR VALIDATOR CLIENTS (in progress)

The fallback feature of validator clients to other beacon clients will be rolled out.

PUBLICATION OF GUIDES (open)

Make key handling and node update guides available for a broad audience.

PREPARATION OF GUIDES (done)

Prepared key handling and node update guides and shared them with the community.

DOPPELGÄNGER PROTECTION (done)

Doppelgänger protection of nodes is active and in use.

RECEIVED SUPPORT BY THE COMMUNITY (ongoing)

Apart from multiple support for our activities there was a vote to use the cover fund for reimbursement of the slashed stakers which passed.

COMMUNITY DISCUSSION ON INCIDENT TREATMENT (ongoing)

We opened up a community discussion on how to deal with future potentially critical incidents and how to set up a universal procedure to deal with them.

REVIEW INTERNAL PROCESSES (done)

We reviewed our internal processes and monitoring carefully and adjusted them where necessary.

ACTIVITIES TO FIX ORIGINAL SLASHING BUG (all done)

Reviewed other processes to see if bugs like this could possibly cause similar outcomes.

Tightened up the process of moving keys and configurations.

Security checks of client configurations (eg. if doppelgänger was enabled).

Reproduced the bug causing the failure and shared this information with the community.

3 Likes