Hello everyone! Last week one of the Lido on Ethereum node operator experienced slashing of 20 validators. Although the node operator reacted ASAP, bringing validators offline to mitigate potential further risk, the initial slashing was 20 ETH (1 ETH penalty per validator slashed). As per post mortem, the total sum of projected penalties and missed rewards of all impacted validators is ~29 ETH. The final amount of loss will be clear when all the validators will exit the network (November 17 according to beaconcha.in).
Taking into account that Launchnodes joined Lido during Wave 5 Stage 1 and we have 7 more new operators, who joined during Stage 2, I would suggest for the DAO to consider:
Extend probation period for Wave 5 node operators and all the next onboarding waves both on the testnet (from 2 weeks to at least 1 month) and the mainnet (from 2 weeks to at optimal 2-3 month), as slashing seems to be costly the protocol.
Create some checklist, best practices to work with nodes, Web3signer etc (those that is usually cause slashing on Ethereum) or work closer with both old and new validators to ensure everything goes smoothly. Just to remind, this is the 2nd time during the year when Lido node operators experience slashing. The 1st one happened earlier this year, on April 13, when 11 RockLogic validators slashed due to the duplication of validator keys in two different active clusters.
I believe these measures (and those that would be suggested below in the discussion) may help to minimize slashings, and thus losses, in the future both for existing operators and newcomers, because these have not only monetary effect on stakers, node operators and the protocol itself, but also on Lido reputation, which is under the microscope of the broader community these days.
Would be happy to hear thought of the Lido DAO team and everyone, who have an idea how to work towards the challenges that we have faced with slashing.
Hey!
As a new operator from Wave 5 Stage 2 - RockawayX Infra, I want to express my view on this.
I agree that slashing is a bad PR for LIDO, and we should go the extra mile to mitigate it. You propose extended probation period. How much would it reduce these risks? From the post-mortem of the Launchnodes, this will only help a little since the origin of this issue started in the DC outage, which is happening exceptionally rarely, and a three-month period is too short for this scenario to happen. It would mean having at least a year period, which is nonsense. Also, with 100 validators, you cannot check infra at the correct scale. The current validator ramp-up process is sufficient in terms of gradual load increase.
The checklist is a good idea. It would be great to have another pair of eyes to go through the infrastructure setup, especially the part with double sign protection lines of defense and outage handling scenarios.
So, I propose to create a small working group within the NOs to go through the infra setup with newcomers.
Hey there, thanks for taking the time to make a post expressing your view.
If the period for ramping up stake was further delayed it may have made a difference for the most recent slashing, but there isn’t direct causality, it really only makes sense looking backwards imo.
i.e. this sort of failure could have happened 6-8 weeks after onboarding
Even if we perfected aggregating and displaying all of the ETH nodeops & devops automation knowledge on the internet and proactively circulating it through various channels, I’m not sure it would necessarily have mitigated how the most recent slashing occurred.
I do believe that what is extremely noteable is the level of ownership, rapid reimbursement, and professionalism really shows the NOs willingness to continue growing alongside Lido.
I appreciate this post and I think it raises a lot of good points.
I would agree here with EmiT above that I don’t believe that extending the probation period effectively reduces the likelihood of such events. It would also considerably delay and hamper the stake re-distribution mechanism which arises from the prioritization of exits from the top and deposits to the bottom, which I think is a huge benefit of new onboardings.
I also think that given that in both cases of slashing stakers were made whole relatively quickly (and in the Launchnodes case basically suffered no reduced rewards at all since it happened within the rebase cycle), so the point around minimization of financial impact isn’t a burning one, because there essentially wasn’t any. I agree however that slashings also carry a reputational and brand risk which is unique, and that should probably be taken into account when determining things like possible end-state stake (re-)allocation mechanisms. There’s also certainly room to explore improvements in impact mitigation mechanisms, for example: potentially things like (semi-)automated compensation or cover mechanisms, but this will rely on robust cross-layer slashing detection via oracles and then total rewards lost calculations for which there aren’t ideal mechanisms currently, but certainly something to consider in the future, especially as new modules can allow for more complex and explicit “impact recovery” mechanisms).
I do agree that, given the level of scale and responsibility that comes with running this many validators, there can be more work to be done in terms of creating a collaborative and NO-community enforced professional excellence culture as described in this post, that could extend to things like “hypercare” periods following new onboardings.
That said, I do not think it’s the DAO’s (or that of DAO contributors) job to create and “enforce” internal best practices on Node Operators, especially those in the set of curated professionals (especially because it’s not really feasible to do so). That doesn’t mean something like this shouldn’t exist, I believe it should, but it should be a bottom up effort vs top-down and self-enforced by the community. The difficulty around this is finding (a) a good way to do it, (b) the resources to devote to such an effort and the organization of them, and (c) someone to manage this process. I do think the DAO has an opportunity here to at the very least research and fund a workgroup that can try to tackle this, and some contributors have been in discussions with third party consultants to explore the possibility of doing this via LEGO.
@r0b.eth Many thanks for raising this topic, and sharing your thoughts on how Lido can be made even more resilient. As you might imagine, we have been having a very similar discussion internally.
The Launchnodes team wanted to reflect for a couple of days before commenting, so we could consider the suggestions and feedback more fully. We recognise that this close to what happened last week, ours may not be the most objective view, but I wanted to share our perspective and thoughts on this topic.
On point 1. and extending the probation period, we would agree with previous comments from MF_DROO and EmiT87 in particular. The slashing incident affecting 20 of our nodes did surface as a result of a Data Center connectivity failure. As has been said, this might have happened after many weeks and months - rather than during a longer probation period. The other aspect here is that changes will always continue to take place to an NO’s staking infrastructure, architecture, software and team over time - which might introduce issues which again would often not be flagged during an extended testing period. Rocklogic, for example, was on boarded as a Lido Operator in June 2022 and experienced slashing almost a year later.
Your suggestion in point 2. of more guidance, checklists and best practices is a great one, and early Lido Network Operators and experienced Lido DAO contributors have often offered their time and expertise to new Waves of Operators. We are very supportive of the discussion and suggestions that Izzy refers to in his reply to this topic. The approach of a community-driven professional excellence culture, that underpins the operational, geographic and client diversity of NOs would be very welcome, and we of course would be happy to support this initiative.
The significant overall impact of slashing is well known and understood. The causes of slashing are equally well known and understood. Fundamentally we sometimes overcomplicate our staking infrastructure and processes, especially when trying to maximise node availability. Being reminded of that on a regular basis - internally or by friendly voices in the community - is never a bad thing.
Just want to point out that item 2 is being addressed in the D.U.C.K initiative, recently approved for funding by LEGO. It will be a knowledge base of risks, mitigating measures and overall best practices collected from the operators in the Curated set. Really a cool initiative, will be interesting to see the initial release.
Create some checklist, best practices to work with nodes, Web3signer etc (those that is usually cause slashing on Ethereum) or work closer with both old and new validators to ensure everything goes smoothly