Slashing Incident involving RockLogic GmbH Validators - April 13, 2023

Lido on Ethereum RockLogic Weekly News Post

“Making things better by the week”

Second Update 10.05.2023

INTRODUCTION

This is a new format that Lido DAO contributors and RockLogic set up to communicate in a more transparent manner with the Ethereum blockchain community.

What happened since last week:

LIDO V2 COMPLIANCE

As announced previously, RockLogic is now fully compliant to Lido V2 on mainnet since May 8th, 2023. 500 keys have been signed for withdrawals.

SLASHING ALERT SERVICE

The first release of the slashing guard has now been configured for RockLogic and is in use.

This is of course open source, please give us your feedback here:

FALLBACK FULL NODES FOR VALIDATOR CLIENTS

The fallback feature of validator clients to other beacon clients will be rolled out for Teku and Nimbus this & next week.

COMMUNICATION

We are looking forward to your contribution in this thread! Please, share your ideas, opinions, suggestions for improvement here:

2 Likes

ACTION STATUS REPORT

14.05.2023

SCOPE

Today, we want to provide you all with an overview of the activities we accomplished within the recent weeks as well as their status.

WEEKLY INFORMATION UPDATES (ongoing activity)

Established a new weekly Lido Stereum News Post “Making things better by the week” to inform the community on a regular basis.

LIDO V2 COMPLIANCE (done)

RockLogic is fully compliant to Lido V2 on mainnet since May 8th, 2023.
500 keys have been signed for withdrawals.

SLASHING ALERT SERVICE (done)

The first release of the slashing guard has been configured for RockLogic and is in use.

FALLBACK FULL NODES FOR VALIDATOR CLIENTS (in progress)

The fallback feature of validator clients to other beacon clients will be rolled out.

PUBLICATION OF GUIDES (open)

Make key handling and node update guides available for a broad audience.

PREPARATION OF GUIDES (done)

Prepared key handling and node update guides and shared them with the community.

DOPPELGÄNGER PROTECTION (done)

Doppelgänger protection of nodes is active and in use.

RECEIVED SUPPORT BY THE COMMUNITY (ongoing)

Apart from multiple support for our activities there was a vote to use the cover fund for reimbursement of the slashed stakers which passed.

COMMUNITY DISCUSSION ON INCIDENT TREATMENT (ongoing)

We opened up a community discussion on how to deal with future potentially critical incidents and how to set up a universal procedure to deal with them.

REVIEW INTERNAL PROCESSES (done)

We reviewed our internal processes and monitoring carefully and adjusted them where necessary.

ACTIVITIES TO FIX ORIGINAL SLASHING BUG (all done)

Reviewed other processes to see if bugs like this could possibly cause similar outcomes.

Tightened up the process of moving keys and configurations.

Security checks of client configurations (eg. if doppelgänger was enabled).

Reproduced the bug causing the failure and shared this information with the community.

3 Likes

Lido on Ethereum RockLogic Weekly News Post

“Making things better by the week”

Third Update 17.05.2023

INTRODUCTION

This is a new format that Lido DAO contributors and RockLogic set up to communicate in a more transparent manner with the Ethereum blockchain community.

What happened in the past 3 weeks:

ETHEREUM NETWORK ISSUES

We analyzed the impact of Ethereum network issues from the 11th and 12th of May, 2023 related to a Prysm bug on our infrastructure.

FALLBACK FULL NODES FOR VALIDATOR CLIENTS

We moved additional keys to validator client failover infrastructure, making it 3 300 keys on multiple Lighthouse, Nimbus & Teku validator clients.

Our investigation of the Teku consensus client port forwarding issue is still ongoing.

LIDO V2 COMPLIANCE

Regarding the Lido V2 mainnet launch no erros occurred.

A dry run was turned off successfully. Confirmed for V2 mainnet ready are:

  • Multiple access nodes for EJECTOR and KAPI up and running
  • EJECTOR up and running in LIVE mode (dry_run = false)
  • KAPI up and running
  • 500 exit messages pre-signed

In addition, outdated KAPI docs and issues with Nimbus were reported to the Lido tooling team.

4 Likes

An update on the slashing incident: as of May 20, 2023 (specifically Epoch 202374 or 202375, depending on the validator), the 11 validators in question are now withdrawable (and have thus stopped accumulating penalties). The remaining balance for these validators be withdrawn when the withdrawals sweep cycles around to those indices, which should be within the next 1-2 days.

The breakdown of the total penalties and missed rewards can be found in the below-attached image. Actual total penalties have been calculated as the change in balance of the relevant validators as of the slashing epoch for each validator up until the epoch that each became withdrawable. Missed rewards have been calculated in the same manner that they were in the post mortem. Note that there was an unexpected network event that caused a brief period of inactivity leak, which introduced a minor amount of penalties which could not have been calculated earlier. However, the actual amount of penalties was still less than the initially projected amount per the post mortem.

For those who wish to check or re-perform the calculations there are two options:

Note that penalties for specific duties (attestations, proposals, etc.) are based on calculation and cannot be queried directly from chain data.

As voted upon by the DAO earlier, the amount will be covered via the slashing cover fund. This will occur via the burning stETH shares representing 13.45978634 stETH thereby increasing the value of other stETH tokens. An on-chain Aragon vote will follow in which the funds will be transferred to the burn contract, to be burned with the following on-chain accounting oracle report.

6 Likes

As with any onchain motions, the safety of operations is the first priority. The total number to be compensated is known now (13.45978634 stETH), so the onchain vote along with the detailed test suite would be prepared in the nearest time (ETA is about two weeks from now), but not in the next omnibus vote. Sorry for the delay on execution.

4 Likes

Lido on Ethereum RockLogic News Post

Fourth Update 31.05.2023

Lately, things have been running smoothly again, which gives us confidence that we are on the right path. Therefore, we will continue this news post on a bi-weekly basis.

FALLBACK FULL NODES FOR VALIDATOR CLIENTS

We moved additional keys to validator clients with failovers - we now have a total of 4 300 keys with failover beacon nodes.

FEATURES & REQUIREMENTS

With a lower frequency of features and requirements from Lido/RockLogic to be developed and introduced we are changing the frequency of our post to a bi-weekly update.

2 Likes

Lido on Ethereum RockLogic News Post

Fifth Update 14.06.2023

Welcome to our fifth update, when we can happily report things go well within our plans!

NODES FOR VALIDATOR CLIENTS

We integrated 2 full nodes together with Erigon successfully for use for validator clients. These have already connected to some validator clients since June 5th. We observed some minor issues which were already brought to the attention of the affected clients’ teams.

OUTAGE & REIMBURSEMENT

There was a downtime of roughly 8 hours affecting 1 500 active validator keys. We are planning to reimburse the damages and already asked the Lido NOM team about calculating the damages.

FALLBACK FULL NODES FOR VALIDATOR CLIENTS

We moved additional keys to validator clients with failovers - we now have a total of 5 300 keys with failover beacon nodes.

CHANGE OF NODE OPERATOR ADDRESS

Request to change our node operator address to a multi-sig wallet posted to: Node Operator Registry - Name & Reward address change - #14 by stefa2k Lido NOM team already confirmed and we’re looking forward to a successful vote on this topic.

2 Likes

Lido on Ethereum RockLogic News Post

Sixth Update 29.06.2023

FALLBACK FULL NODES FOR VALIDATOR CLIENTS

All 5 800 keys are now on validator clients with failover beacon nodes configured.

OPTIMIZATIONS

Our main focus was on optimizing setups to enhance performance and stability.

CLIENT ISSUES

In the process of our work as Lido NO we discovered 3 new issues with a client (refering to doppelgänger protection, failover mechanism, block proposal algorithm). Ongoing work with client team to identify cause and resolve the issues.

OUTAGE & REIMBURSEMENT

Reimbursement of outage of 1 500 validators is in review on our side right now. We expect to be finished mid next week internally and will communicate here in the forums right after our full review.

HARDWARE EXTENSIONS

In the past 2 weeks we implemented some significant extensions of the hardware used for Lido node operating resulting in better performance and more resources for additional nodes.

4 Likes

On-chain vote intended to the Slashing Incident Staker Compensation (items 1-5) passed and was executed on the 30th of June

3 Likes

As a final part to this thread, following from the ongoing discussions about establishing policy around large incidents Discussion - Treatment of Potentially Harmful Incidents, I propose that we proceed with a discussion and a vote (if deemed necessary) in order to bring conclusion to this topic. I will present my thoughts using a loose framework for ad hoc consideration of the slashing event (linked below), but obviously any thoughts regardless of whether this same framework is used or not are useful.

Using the framework outlined in my post in the referenced thread, my view is the below:


Factors around the incident itself

  • was the act malicious or not

    No

  • the proximate cause of the event

    1. Peculiarity of node setup + lack of fallback node pair, which led to keys being moved for liveness following a node failure
    2. A technical bug (i.e. the key manager saying it did something when it did not) providing false information / confirmation that keys had been deleted from the original node
    3. Lack of additional preventive measures (e.g. keeping original node offline, or nuking it)
  • whether any best practices, infrastructure setups or configurations, common safety measures, or reasonable processes / mechanisms could have prevented the slashing from happening

    • Yes (different type of setup could have prevented this, e.g. use of fallback node pairs, threshold signing), regardless of setup additional measures (e.g. nuking original host) would have prevented this from happening
  • how quickly was the issue identified and by whom

    • Quickly but not immediately (i.e. within ~10-12 minutes), not by the node operator (although there is evidence to believe that it would have been caught eventually, i.e. relevant monitoring was in place but not optimal)
  • how quickly was the issue resolved

    • Impact was mitigated quickly once issue was observed, resolution, namely the updating of infrastructure and modification of processes to be more resilient took place relatively quickly and in a transparent manner, see weekly updates on in this thread (an example)

Consequence considerations:

  • Which “module” did this happen in; or, more broadly, what are the trust assumptions associated with the affected validators (e.g. are the validators unbonded)
    • The event happened in the curated operator module where Node Operators are entrusted with running bond-less validators.
  • impact of the event (in case of slashing, finance impact can be small but damage to trust can be high, etc. for example: how does the event affect the trust assumptions between stakers, the DAO, and the NO)
    • While the financial impact was relatively low (e.g. compared to daily rewards), in this module trust assumptions are very high (basically implicit) and stakers do not have any kind of assurances or recourse against these events (e.g. insurance, unless they opt in individually). As such, an event such as slashing which is seen by the network as an attack can be considered as something that may need to reset trust assumptions.
  • extenuating circumstances (what are the pros/cons of this decision, and what other substantial things may need to be taken into account – e.g. does the NO somehow bring key value to the protocol?)
    • RockLogic GmbH is an NO that has been doing a lot in terms of client diversity (both on EL and CL (including validator client) side), infrastructure diversity/resilience (bare metal only as of most recent quarter), and is a very active participant in the Ethereum staking ecosystem, creating tools for stakers such as Stereum, Synclink, SlashingGuard, etc.
  • what other options there are for the node operator to participate
    • Currently none, as no other modules are live.
  • what is the status of remediation of the issue
    • As per the updates provided by RockLogic GmbH, the issue has been remediated.
      • The technical bug in the Prysm key manager has been patched.
      • They have moved to an infra model with explicit multiple EL+CL node pairs in fallback.
      • Their processes for key migration have been updated to explicitly nuke original nodes/hosts to prevent possible slashing.
      • All internal processes were reviewed, and some were made publicly available on the Lido research forums for the benefit of others.
  • is there a way to gauge likelihood of something like this happening again, and if so what is the assessed likelihood
    • Based on above remediation, if procedures are followed as per the updates implemented by the Operator, then likelihood is very low (note: qualification here doesn’t mean I have reason to believe they won’t be, just that it’s important to clarify that there is a non-ensurable element to the likelihood assessment).
  • how can remediation be assessed and, if it can, is the remediation deemed satisfactory
    • Through review of relevant documentation and code (e.g. in case of technical bugs). With regards to execution, only assurance from the operator can be provided. In the future things like tests of internal controls and certifications of this (e.g. via SOC IIs) may be measures that can add additional comfort, however even this does not provide assurance of execution.

Consequence options

  • Do nothing
  • Warning (do nothing w/ the condition that the next time the consequence is one/any of the below)
  • Limit the Node Operator’s key count for a certain period of time
  • Decrease the Node Operator’s key count (by prioritizing those keys for exit)
  • Offboard the operator (with the ability to rejoin the permissioned set at a later time)
  • Offboard the operator (without the ability to rejoin the permissioned set at a later time)

Based on the above evaluation, and the described possible outcomes for slashings in the curated operator set, I believe that the appropriate course of governance action in this case would be something between “Do Nothing → Offboard the operator (with the ability to rejoin)”. “Do nothing” and “limit the node operator’s key count” have effectively already occurred, as this was the part of the response following the incident. So the question at hand is what the next step would be (does the limit stay, is another option chosen, or is the limit lifted). At this juncture, I believe that there should be at least rough consensus on how to move forward, and discussion and input from relevant stakeholders should take place before any specific action is taken (if at all). A vote isn’t strictly necessary for some of the outcomes, but may be useful for signaling nonetheless.

Personal opinion
Given lack of clarity around this topic (for example, already laid out consequences for slashing / serious incidents) up until now, the length of time since the slashing, the operator’s overall contribution to the ecosystem, and the successful and timely implementation of remediative actions, I personally would lean towards “Limit the Node Operator’s key count for a certain period of time”. Since the key count has already been limited shortly following the slashing event (discussion, vote), and no issues have been noted with the NO’s validators apart from performance issues which the operator has compensated stakers for, I would propose that NO’s key count rise together with the coming (pending DAO votes) Ethereum Wave 5 cohort, i.e. once new node operators are added to the set (which has a timeframe of ~ mid aug to mid sep). When the new cohort reaches the level of RockLogic (new node operators are always prioritized for deposits), RockLogic then also re-commence addition of validators to the validator pool.

EDIT:
Post has been updated to add links to Stereum, SyncLink, SlashingGuard, which were mistakenly only left with initial “add link” note and not a full functioning link.

8 Likes

As ususal with @Izzy , I find his analysis and description very good, concise and I agree both in the assessment (from what I have read) and in the suggested consequence.

1 Like

Dear Lido Community!

Following the latest development concerning the slashing incident of April 13th 2023, and particularly the post of Izzy from Lido from July 27th 2023 we would like to take the opportunity to lay out our position on the matter.

As you probably know, eleven validators associated with our node were slashed on that day due to a bug affecting our operation. They have been re-imbursed since and we, with the support of the community, have taken considerable effort to prevent the occurrence of similar incidents both to us and other NOs in the future.

To do this, we have actively engaged in the forums over the past months, providing transparent explanations of our actions and their progress, and setting into motion a discussion about what we and others could generally improve running their node. We value the feedback and support received during this period, which you can see here:

We also greatly appreciate the work of Lido’s Node Operator Management, and the efforts NOM has put forth to support us before, during and after the incident. Also, we want to express our gratitude towards NOM, and Izzy in particular, for their positive appraisal of our remediation actions. We are happy to see that our efforts to repair the damage and to further the security of the Ethereum network have been fruitful and are very much appreciated by the community.

For everyone who does not know yet, we also want to mention that in addition to our role as a Lido NO, our team is actively involved in the Stereum project. Stereum contributes to the decentralization of the Ethereum ecosystem by providing an Ethereum node setup software which facilitates staking. We have successfully integrated various Lido services, including KAPI and Ejector, into Stereum. By doing so, we are actively working towards building a stronger, more decentralized Ethereum network. This goes beyond our responsibilities as a NO and contributes to building towards Lido’s own decentralized future.

Even more, we are one of the most involved operators intensively driving minority client usage (consensus and execution).

We are proud of these contributions, and again want to thank Izzy for explicitly mentioning the value they add to Lido’s operation.

Of course, we are not happy that notwithstanding our best efforts, this slashing incident has occurred to us - but we like to see it as a major investment into a better future, creating a precedence on which we all jointly had to react upon by now setting up a common process of how to deal with incidents like these and the parties concerned.

We support the train of consequences proposed by Izzy in full, because we believe that open communication, collaboration, and a common set of rules are essential in addressing and resolving any kind of incident effectively. This includes crisis management - if needed - as an important part of the path to developing a resilient and trusted Ethereum environment.

What we are unhappy with, though, is the idea of still limiting the number of keys with which we can operate. For the contributions described above we do need more keys to move forward with our development. We understand the position that some kind of punishment had been in order, which was the immediate freezing of our key limit.

Thus, our current key limit has been set upon us as a direct response after the incident back in April, an understandable precautionary measure at the time - until things would be fixed again. But now that they are, this key limit starts to affect our work.

Generally, a key limit bears major disadvantages, as there is less decentralization of keys present. It slows down the long-term development of Stereum, it worsens security assumptions - and in our estimation all this does not only affect us, but also Lido’s future growth and decentralization efforts, and thus the whole Ethereum community.

All this is to say that we now would like to take a major step forward, rounding up the incident of April 13th 2023, and concentrate on what we do best: putting our hearts into the development of Stereum as a trusted contribution to Lido’s operation, and a strong Ethereum network and community.

We are fully aware of our responsibility, and we kindly request your support in the upcoming DAO vote as to remove our key limit. We believe that the initial key limit has served its purpose, we have already faced its consequences and it has been a significant period since the incident occurred.

Having continuously demonstrated our commitment to ensuring the security and reliability of Lido stakers & the Ethereum network, we are willing to do so in the future - and we now need your support to proceed!

Thank you all for your consideration.

Stefan Kobrc & the RockLogic Team

4 Likes

@Izzy, thanks a lot for putting everything together. Really appreciated!

I would suggest moving the decision with consequence options (proposing these three below) to the DAO vote, as I think it may be hard to find a consensus in forum discussion:

  • either release the Node Operator from the current limit (5 800 keys) as soon as possible after the Snaphot vote ends (I won’t say immediately, as Easy Track limit increase require at least 72 hours)
  • either release the Node Operator from the current limit, when new Node Operators from Wave 5 will reach the level of 5 800 keys;
  • or to stay with the current limit (5 800 keys) the Node Operator has.

I added the last option, as without it won’t really be a democratic decision, because suggested only “yes, now” and “yes, but later” options. We need to take into account, that some voters would like to have still limit in place. In case they won’t have an option to say “no”, they will just ignore the vote, which I think is not good for the whole governance.

I do understand that it is not the best option to have everything via DAO in scale (and :crossed_fingers:t3: we won’t need to solve such incidents in future), but for this time there is an opportunity for us to end this story.

Also I believe that it may be nice to include some kind of flow how we wrap up such incidents (what should we do with keys: limit, limit for some time, etc), because ad-hoc approach seems to me a bit controversial. In case it fits “guardrails” to treatment of potentially harmful incidents you mentioned here, that would be great.

2 Likes

I would suggest moving the decision with consequence options (proposing these three below) to the DAO vote, as I think it may be hard to find a consensus in forum discussion:

I agree. Perhaps the vote itself doesn’t need to necessarily “pass”, but it would at least be useful to get signaling about general community sentiment, since it seems there are no other “directions / options” being brought up at the moment and we should progress the discussion.

I am thinking that the vote would have the options:

  • Proceed with increasing key limit later (with Wave 5 cohort)
  • Proceed with increasing key limit at NO’s earliest convenience
  • None of the above (in effect this means keep limit where it is for now, implication being that either more discussion / analysis is needed or perhaps a more conservative consequence).

What do you think about something like that?

Also agree with this. Working on a guardrails approach but it’s a bit slow going to try to find a right balance between being too prescriptive and on the other hand being too vague / abstract, but hope to have something in the next week or so to share!

2 Likes

I really like these, especially the last one “none of the above”, as it is broader.
Will be looking forward to the snapshot and vote!

Snapshot vote started

The April Slashing Incident: Key Limit Follow-up Snapshot has started! Please cast your votes before Thu, 10 Aug 2023 14:00:00 GMT :pray:

5 Likes

Dear Lido Community,

welcome to our monthly update on the recent actions and improvements we have made to our Ethereum staking infrastructure. We believe in maintaining transparency and keeping you informed about the changes that directly impact your staking experience.

IMPROVED VALIDATOR CLIENT CONFIGURATION
We have successfully completed the migration of all keys to our validator clients. This transition ensures smoother operations and better resilience in case of any unexpected events. Potential failures will now be automatically handled through our failover beacon nodes configuration, aiming at minimizing downtime and ensuring consistent staking performance.

HARDWARE EXTENSION AND PERFORMANCE BOOST
To further enhance our infrastructure’s capabilities, we have added additional cluster members to our setup, resulting in improved performance and increased reliability. These upgrades enable us to handle larger staking volumes and provide you with a more robust and stable staking environment.
Further hardware extension and exchange for faulty hardware are planned for this month.

NO FURTHER KEY MIGRATION
Since the last post from July we did not migrate any more keys as this was no longer necessary.

REIMBURSEMENT ON THE WAY
Following the April slashing incident the reimbursement of the validators concerned is still in progress, please see details here:

DAO SNAPSHOT VOTE ENDING TOMORROW
Concerning lifting our key limit that resulted from the April slashing incident there is a DAO snapshot vote which we kindly ask you all to participate in. When voting please consider our ongoing contribution of the Etherum blockchain and bear in mind that the current key limit might impede our future work, both outlayed in detail here:

You can vote here:
https://snapshot.org/#/lido-snapshot.eth/proposal/0xf08cf873c2887e16484dc24873baef4b461ab2bb1b6371b3fcb7b848dd0b2302

We thank you all for your confidence in our work so far and are looking forward to your decision on Thursday, as we need clarity to proceed with our development for a safe Ethereum environment. Thank you for being a part of the Lido community and for your continued trust in RockLogic!

Stefan Kobrc & the RockLogic Node Operator Team

3 Likes

Snapshot vote ended

The April Slashing Incident: Key Limit Follow-up Snapshot vote concluded!
The results are:
Lift Key Limit (KL) immediately: 1.3M LDO
Lift KL after Wave 5 catches up: 51.9M LDO
None of the above: 49.0k LDO

1 Like