To hopefully add to the discussion beyond the operational~~
The node operator in question is not ‘faultless’, as these errors and slower response-time (i.e. contributors informing the operator and not the other way around) feel like symptoms of a larger issue.
If you provide another entity that ‘specializes’ in Devops an opportunity to make ~decent $$ a year, wouldn’t you expect them to have:
& the ability to proportionally reimburse, when fault is admitted?
There are plenty of candidates eagerly waiting to join the validator set, and the standards for being a part of the curated professional validator set should be set to a professional level.
Lowering the operators key limit seems to be the best middle-ground for the parties involved.
I see Lido’s mission is to deliver on decentralization and making staking simple – not to provide operators who are operating shakily a profitable business model.
For every $ we spend in the latter direction we take away from the DAOs ability to forward it’s overarching goals.
On April 13, 2023, 11 validators operated by RockLogic GmbH were slashed due to the duplication of validator keys in two different active clusters, causing a double vote. This was caused by an unforseeable bug, which since has been eliminated. But, at the time of incident, the full extent of measures to prevent key duplication was not taken, leading to the slashing.
When the slashing event was confirmed within 15 mins of its first occurrence, we prevented further slashings by bringing relevant clusters offline. The failover cluster was brought back online and the remaining keys were all incrementally activated within the next three hours. Both our’s and Prysmatic Labs’ technical investigations confirmed the bug the next day and we made it reproducible for further analysis. Lido released a post mortem that day.
The bug has been identified and fixed by Prysmatic Labs (GH Issue #12281) as of Prysm v4.0.3.
RockLogic GmbH updated the configuration of nodes (doppelganger used uniformly throughout) and key handling guides and issued those on Github.
We are updating monitoring & alerting and procedures & guides, which will be made publicly available soon (ongoing).
We also plan to create automated tests for such cases (not started yet).
As a consequence and further precaution following the slashing, Lido DAO took an on-chain Aragon vote to limit RockLogic keys until full remediation has taken place, which passed. There is also a Lido DAO Snapshot vote for staker compensation using the cover fund still in progress.
We plan a community discussion about which policies to implement for future reactions upon incidents like this, and related ones (eg large outages) from a governance and NO set management perspective. This will start later this week.
The proposal to use the cover fund for compensation has passed. Once the validators have fully exited (~May 20th) and the actual amount of penalties can be finalized, another vote will be set up so that the compensation can occur.
An update on the slashing incident: as of May 20, 2023 (specifically Epoch 202374 or 202375, depending on the validator), the 11 validators in question are now withdrawable (and have thus stopped accumulating penalties). The remaining balance for these validators be withdrawn when the withdrawals sweep cycles around to those indices, which should be within the next 1-2 days.
The breakdown of the total penalties and missed rewards can be found in the below-attached image. Actual total penalties have been calculated as the change in balance of the relevant validators as of the slashing epoch for each validator up until the epoch that each became withdrawable. Missed rewards have been calculated in the same manner that they were in the post mortem. Note that there was an unexpected network event that caused a brief period of inactivity leak, which introduced a minor amount of penalties which could not have been calculated earlier. However, the actual amount of penalties was still less than the initially projected amount per the post mortem.
Note that penalties for specific duties (attestations, proposals, etc.) are based on calculation and cannot be queried directly from chain data.
As voted upon by the DAO earlier, the amount will be covered via the slashing cover fund. This will occur via the burning stETH shares representing 13.45978634 stETH thereby increasing the value of other stETH tokens. An on-chain Aragon vote will follow in which the funds will be transferred to the burn contract, to be burned with the following on-chain accounting oracle report.
As with any onchain motions, the safety of operations is the first priority. The total number to be compensated is known now (13.45978634 stETH), so the onchain vote along with the detailed test suite would be prepared in the nearest time (ETA is about two weeks from now), but not in the next omnibus vote. Sorry for the delay on execution.
Welcome to our fifth update, when we can happily report things go well within our plans!
NODES FOR VALIDATOR CLIENTS
We integrated 2 full nodes together with Erigon successfully for use for validator clients. These have already connected to some validator clients since June 5th. We observed some minor issues which were already brought to the attention of the affected clients’ teams.
OUTAGE & REIMBURSEMENT
There was a downtime of roughly 8 hours affecting 1 500 active validator keys. We are planning to reimburse the damages and already asked the Lido NOM team about calculating the damages.
FALLBACK FULL NODES FOR VALIDATOR CLIENTS
We moved additional keys to validator clients with failovers - we now have a total of 5 300 keys with failover beacon nodes.
All 5 800 keys are now on validator clients with failover beacon nodes configured.
Our main focus was on optimizing setups to enhance performance and stability.
In the process of our work as Lido NO we discovered 3 new issues with a client (refering to doppelgänger protection, failover mechanism, block proposal algorithm). Ongoing work with client team to identify cause and resolve the issues.
OUTAGE & REIMBURSEMENT
Reimbursement of outage of 1 500 validators is in review on our side right now. We expect to be finished mid next week internally and will communicate here in the forums right after our full review.
In the past 2 weeks we implemented some significant extensions of the hardware used for Lido node operating resulting in better performance and more resources for additional nodes.
As a final part to this thread, following from the ongoing discussions about establishing policy around large incidents Discussion - Treatment of Potentially Harmful Incidents, I propose that we proceed with a discussion and a vote (if deemed necessary) in order to bring conclusion to this topic. I will present my thoughts using a loose framework for ad hoc consideration of the slashing event (linked below), but obviously any thoughts regardless of whether this same framework is used or not are useful.
Peculiarity of node setup + lack of fallback node pair, which led to keys being moved for liveness following a node failure
A technical bug (i.e. the key manager saying it did something when it did not) providing false information / confirmation that keys had been deleted from the original node
Lack of additional preventive measures (e.g. keeping original node offline, or nuking it)
whether any best practices, infrastructure setups or configurations, common safety measures, or reasonable processes / mechanisms could have prevented the slashing from happening
Yes (different type of setup could have prevented this, e.g. use of fallback node pairs, threshold signing), regardless of setup additional measures (e.g. nuking original host) would have prevented this from happening
how quickly was the issue identified and by whom
Quickly but not immediately (i.e. within ~10-12 minutes), not by the node operator (although there is evidence to believe that it would have been caught eventually, i.e. relevant monitoring was in place but not optimal)
how quickly was the issue resolved
Impact was mitigated quickly once issue was observed, resolution, namely the updating of infrastructure and modification of processes to be more resilient took place relatively quickly and in a transparent manner, see weekly updates on in this thread (an example)
Which “module” did this happen in; or, more broadly, what are the trust assumptions associated with the affected validators (e.g. are the validators unbonded)
The event happened in the curated operator module where Node Operators are entrusted with running bond-less validators.
impact of the event (in case of slashing, finance impact can be small but damage to trust can be high, etc. for example: how does the event affect the trust assumptions between stakers, the DAO, and the NO)
While the financial impact was relatively low (e.g. compared to daily rewards), in this module trust assumptions are very high (basically implicit) and stakers do not have any kind of assurances or recourse against these events (e.g. insurance, unless they opt in individually). As such, an event such as slashing which is seen by the network as an attack can be considered as something that may need to reset trust assumptions.
extenuating circumstances (what are the pros/cons of this decision, and what other substantial things may need to be taken into account – e.g. does the NO somehow bring key value to the protocol?)
RockLogic GmbH is an NO that has been doing a lot in terms of client diversity (both on EL and CL (including validator client) side), infrastructure diversity/resilience (bare metal only as of most recent quarter), and is a very active participant in the Ethereum staking ecosystem, creating tools for stakers such as Stereum, Synclink, SlashingGuard, etc.
what other options there are for the node operator to participate
Currently none, as no other modules are live.
what is the status of remediation of the issue
As per the updates provided by RockLogic GmbH, the issue has been remediated.
The technical bug in the Prysm key manager has been patched.
They have moved to an infra model with explicit multiple EL+CL node pairs in fallback.
Their processes for key migration have been updated to explicitly nuke original nodes/hosts to prevent possible slashing.
All internal processes were reviewed, and some were made publicly available on the Lido research forums for the benefit of others.
is there a way to gauge likelihood of something like this happening again, and if so what is the assessed likelihood
Based on above remediation, if procedures are followed as per the updates implemented by the Operator, then likelihood is very low (note: qualification here doesn’t mean I have reason to believe they won’t be, just that it’s important to clarify that there is a non-ensurable element to the likelihood assessment).
how can remediation be assessed and, if it can, is the remediation deemed satisfactory
Through review of relevant documentation and code (e.g. in case of technical bugs). With regards to execution, only assurance from the operator can be provided. In the future things like tests of internal controls and certifications of this (e.g. via SOC IIs) may be measures that can add additional comfort, however even this does not provide assurance of execution.
Warning (do nothing w/ the condition that the next time the consequence is one/any of the below)
Limit the Node Operator’s key count for a certain period of time
Decrease the Node Operator’s key count (by prioritizing those keys for exit)
Offboard the operator (with the ability to rejoin the permissioned set at a later time)
Offboard the operator (without the ability to rejoin the permissioned set at a later time)
Based on the above evaluation, and the described possible outcomes for slashings in the curated operator set, I believe that the appropriate course of governance action in this case would be something between “Do Nothing → Offboard the operator (with the ability to rejoin)”. “Do nothing” and “limit the node operator’s key count” have effectively already occurred, as this was the part of the response following the incident. So the question at hand is what the next step would be (does the limit stay, is another option chosen, or is the limit lifted). At this juncture, I believe that there should be at least rough consensus on how to move forward, and discussion and input from relevant stakeholders should take place before any specific action is taken (if at all). A vote isn’t strictly necessary for some of the outcomes, but may be useful for signaling nonetheless.
Given lack of clarity around this topic (for example, already laid out consequences for slashing / serious incidents) up until now, the length of time since the slashing, the operator’s overall contribution to the ecosystem, and the successful and timely implementation of remediative actions, I personally would lean towards “Limit the Node Operator’s key count for a certain period of time”. Since the key count has already been limited shortly following the slashing event (discussion, vote), and no issues have been noted with the NO’s validators apart from performance issues which the operator has compensated stakers for, I would propose that NO’s key count rise together with the coming (pending DAO votes) Ethereum Wave 5 cohort, i.e. once new node operators are added to the set (which has a timeframe of ~ mid aug to mid sep). When the new cohort reaches the level of RockLogic (new node operators are always prioritized for deposits), RockLogic then also re-commence addition of validators to the validator pool.
Post has been updated to add links to Stereum, SyncLink, SlashingGuard, which were mistakenly only left with initial “add link” note and not a full functioning link.
Following the latest development concerning the slashing incident of April 13th 2023, and particularly the post of Izzy from Lido from July 27th 2023 we would like to take the opportunity to lay out our position on the matter.
As you probably know, eleven validators associated with our node were slashed on that day due to a bug affecting our operation. They have been re-imbursed since and we, with the support of the community, have taken considerable effort to prevent the occurrence of similar incidents both to us and other NOs in the future.
To do this, we have actively engaged in the forums over the past months, providing transparent explanations of our actions and their progress, and setting into motion a discussion about what we and others could generally improve running their node. We value the feedback and support received during this period, which you can see here:
We also greatly appreciate the work of Lido’s Node Operator Management, and the efforts NOM has put forth to support us before, during and after the incident. Also, we want to express our gratitude towards NOM, and Izzy in particular, for their positive appraisal of our remediation actions. We are happy to see that our efforts to repair the damage and to further the security of the Ethereum network have been fruitful and are very much appreciated by the community.
For everyone who does not know yet, we also want to mention that in addition to our role as a Lido NO, our team is actively involved in the Stereum project. Stereum contributes to the decentralization of the Ethereum ecosystem by providing an Ethereum node setup software which facilitates staking. We have successfully integrated various Lido services, including KAPI and Ejector, into Stereum. By doing so, we are actively working towards building a stronger, more decentralized Ethereum network. This goes beyond our responsibilities as a NO and contributes to building towards Lido’s own decentralized future.
Even more, we are one of the most involved operators intensively driving minority client usage (consensus and execution).
We are proud of these contributions, and again want to thank Izzy for explicitly mentioning the value they add to Lido’s operation.
Of course, we are not happy that notwithstanding our best efforts, this slashing incident has occurred to us - but we like to see it as a major investment into a better future, creating a precedence on which we all jointly had to react upon by now setting up a common process of how to deal with incidents like these and the parties concerned.
We support the train of consequences proposed by Izzy in full, because we believe that open communication, collaboration, and a common set of rules are essential in addressing and resolving any kind of incident effectively. This includes crisis management - if needed - as an important part of the path to developing a resilient and trusted Ethereum environment.
What we are unhappy with, though, is the idea of still limiting the number of keys with which we can operate. For the contributions described above we do need more keys to move forward with our development. We understand the position that some kind of punishment had been in order, which was the immediate freezing of our key limit.
Thus, our current key limit has been set upon us as a direct response after the incident back in April, an understandable precautionary measure at the time - until things would be fixed again. But now that they are, this key limit starts to affect our work.
Generally, a key limit bears major disadvantages, as there is less decentralization of keys present. It slows down the long-term development of Stereum, it worsens security assumptions - and in our estimation all this does not only affect us, but also Lido’s future growth and decentralization efforts, and thus the whole Ethereum community.
All this is to say that we now would like to take a major step forward, rounding up the incident of April 13th 2023, and concentrate on what we do best: putting our hearts into the development of Stereum as a trusted contribution to Lido’s operation, and a strong Ethereum network and community.
We are fully aware of our responsibility, and we kindly request your support in the upcoming DAO vote as to remove our key limit. We believe that the initial key limit has served its purpose, we have already faced its consequences and it has been a significant period since the incident occurred.
Having continuously demonstrated our commitment to ensuring the security and reliability of Lido stakers & the Ethereum network, we are willing to do so in the future - and we now need your support to proceed!
@Izzy, thanks a lot for putting everything together. Really appreciated!
I would suggest moving the decision with consequence options (proposing these three below) to the DAO vote, as I think it may be hard to find a consensus in forum discussion:
either release the Node Operator from the current limit (5 800 keys) as soon as possible after the Snaphot vote ends (I won’t say immediately, as Easy Track limit increase require at least 72 hours)
either release the Node Operator from the current limit, when new Node Operators from Wave 5 will reach the level of 5 800 keys;
or to stay with the current limit (5 800 keys) the Node Operator has.
I added the last option, as without it won’t really be a democratic decision, because suggested only “yes, now” and “yes, but later” options. We need to take into account, that some voters would like to have still limit in place. In case they won’t have an option to say “no”, they will just ignore the vote, which I think is not good for the whole governance.
I do understand that it is not the best option to have everything via DAO in scale (and we won’t need to solve such incidents in future), but for this time there is an opportunity for us to end this story.
Also I believe that it may be nice to include some kind of flow how we wrap up such incidents (what should we do with keys: limit, limit for some time, etc), because ad-hoc approach seems to me a bit controversial. In case it fits “guardrails” to treatment of potentially harmful incidents you mentioned here, that would be great.
I would suggest moving the decision with consequence options (proposing these three below) to the DAO vote, as I think it may be hard to find a consensus in forum discussion:
I agree. Perhaps the vote itself doesn’t need to necessarily “pass”, but it would at least be useful to get signaling about general community sentiment, since it seems there are no other “directions / options” being brought up at the moment and we should progress the discussion.
I am thinking that the vote would have the options:
Proceed with increasing key limit later (with Wave 5 cohort)
Proceed with increasing key limit at NO’s earliest convenience
None of the above (in effect this means keep limit where it is for now, implication being that either more discussion / analysis is needed or perhaps a more conservative consequence).
What do you think about something like that?
Also agree with this. Working on a guardrails approach but it’s a bit slow going to try to find a right balance between being too prescriptive and on the other hand being too vague / abstract, but hope to have something in the next week or so to share!