Discussion - Treatment of Potentially Harmful Incidents

The slashing incident of April 13, 2023 has got us all thinking about its causes and how to remedy them - and perhaps even more importantly, about how to prevent incidents like this happening to Ethereum and its contributors in the future.

Whether you call this particular incident a mistake, a reluctant reaction, human error, system failure or anything else, there will always be minor and major incidents that are unforeseeable; we need a proper strategy to deal with these.

Since mistakes provide a valuable opportunity to learn, grow and make things better, we want to start an open discussion within the Lido DAO community on whether or not we should establish a jointly accepted, well-proven and commonly known procedure for how to treat potentially harmful incidents. And if so, what such a procedure should look like.

The slashing incident that occurred to RockLogic GmbH in April 2023 and triggered this initiative is one amongst a greater variety of incidents that have happened before across the community. As we understand it, all these incidents have been handled professionally by those involved - but a common procedure has been distinctly lacking.

Which bears these questions:

  • Would it be better to jointly develop a structured, universal procedure for how to treat things in case of emergency?
  • Should we have a common “recipe” for how to deal with dangerous incidents? Should we develop open guides or emergency protocols that can be easily executed BEFORE disaster strikes?
  • Should we adapt procedures, generalize them, and make them mandatory?
  • Is there more we should think of, beyond monitoring & alerting, automated tests, doppelgänger protection, node configuration changes, key handling, release precautions & prior testing, …?

In short, should we have a universal emergency plan?

Think of it as a fire extinguisher: it will not help you with every problem that might occur on the way - but if something happens, it is easily accessible, and you know how to use it. It will kill minor flames in an instant and help mitigate, or in the best case even prevent, large fires.

We are keen to hear your opinions on the subject, and hope this post kicks off a healthy and fruitful discussion on the matter!

10 Likes

So my own two cents on this: I would be reluctant to draw the line in the sand here and make a vote to say “DAO think you should operate this way and not that way”.

Curated node operator set is expected to be better at this then most dao members, and, I mean, some of these are client development teams, and others are also super professional.

The art of blockchain (an ethereum) ops has stabilized a lot lately but is still decently green. I do not think that a strict operational one size fits all manual would be a good thing atm.

Having a self-regulation codex of good practices and voluntary, self-imposed standards (ethical, organizational, technical) could be a good start here.

12 Likes

I cannot speak to what node operators should do broadly, but I can speak to what the guidelines are we use.

  • “Blast radius” contained to 1,000 keys per “environment” - VC, remote signer if in use, CL:ELs
  • Keys don’t move. Resilience comes from having multiple CL:EL, resilient remote signers, and VC in k8s (or any container orchestration)
  • Maintenance of any one node does not remove failover resilience: In practice this means three CL:EL, three k8s nodes, and if in use, three or more Dirk instances, per environment of 1,000 keys
  • Avoid supermajority clients (only Geth, currently), as they carry a low-probability and catastrophic-impact risk of stranding the validators on a non-canonical chain and getting them leaked down to 16 eth or below.
  • Embrace client diversity: The three CL:EL in an environment use two or three different client combinations. This way, a bug in any one client does not cause prolonged downtime.
  • Embrace geographic diversity: Nodes within an environment are in different regions within the same broader geographic region (APAC, EMEA, AMERS); place environments in different broad geographic regions, which serves to spread keys out around the globe.
  • Alerts in Lido Discord re validator performance are to be treated as real, because they are. Combat alert fatigue by finding root cause for alerts and improving things.
  • IaC is your friend; metrics are your friend

That note about alerts aims at running well, in addition to running reliably. Looking at the alerts one can get the impression that there’s a temptation to give up - “that thing squawks all the time”. I’d encourage fellow NOs to not give up. That alert mechanism is working well, those are real and actionable.

7 Likes

I think I can’t edit a post, and realize I should have started with the kind of harmful events these guidelines seek to minimize risk for.

  • Slashing. This risk should be so close to zero as essentially be zero, short of a catastrophic bug in slashing protection DB, which has never been seen in any client. That’s why keys don’t move. Don’t give humans a chance to make a mistake.
  • Keys getting “stranded” on the wrong, finalizing chain. This risk can be reduced to near-zero by avoiding supermajority clients. See also Upgrading Ethereum
  • Prolonged offline penalties. This risk can be reduced by good and actionable alerting, as well as client diversity.
  • Sub-optimal rewards. This risk can be reduced by acting on on-chain alerts for sync committee participation, missed attestations, missed block proposals, and so on.
6 Likes

Thanks Stefan for kickstarting the discussion! One of the great things about decentralized protocols and DAOs is the many different ways you can approach a challenge or set of problems. I think the guiding questions you’ve posed are certainly in the right direction, but perhaps disagree a bit in terms of follow-through.

I would like to try to lay some groundwork in terms of how I think of the protocol:

  • The Lido on Ethereum protocol is essentially software / middleware. It doesn’t have agency and can only do things which it is programmed to do, either as a reaction to things that happen to/from it, or through interactions with it by third parties.
  • The DAO, to the extent through which it can action things (via votes), should attempt to be a guardian of (a) the protocol (lido on Ethereum), (b) its users, and (c) the underlying network. It’s essentially a utility maximization <> risk minimization exercise.
    • Because Ethereum is stronger when it’s robust, this means things like fostering decentralization, diversity, and variety, versus picking just what is “most profitable” and focusing on that.
    • As the protocol matures, the things which the DAO can do and its total power of the protocol should minimize, and its role in the routine functioning of the protocol should basically end up somewhere between “none at all” to “in case of emergency, break glass”. There are technical limitations (e.g. lack of withdrawals until recently, ability to consume beacon state in the EL directly, etc.) which substantially affect the potential and practical mechanisms of trying to make the protocol fully autonomous, but DAO involvement it should be minimized at every step of the way based on context, protocol design decisions, and technical environment limitations.

What does this mean for NOs and other entities interacting with the protocol and the DAO? I agree that it makes sense to have a more structured approach for how the NOs interacting with the protocol may deal with emergencies / incidents, however, as far as the DAO is concerned, I think the approach should be more of something like a policy rather than a procedure. Basically: it should outline what the goals are, what the expected results are, and what happens if the expected results are not achieved. Node Operators can then be encouraged to do their own thing, in their own way, within the set of objectives that aim for a robust validator set for the protocol and Ethereum. Ideally, NOs collaborate and coordinate on developing these operational practices and they (and the wider community) hold each other accountable, vs needing the DAO to do it.

  • I don’t think the DAO should enforce or be involved in the reviewing/checking of specific practices (eg via workstream contributors). I think the community of NOs can certainly do so, and there can be a culture of sharing of information, tools / processes, and self-reporting relevant information, and the DAO could support this effort (e.g. via grants, open source tooling, collaborations with 3rd parties to offer things like security and process assurance frameworks and audits). But, I don’t think it’s the DAO’s job to be an arbiter or an enforcer of “these are the right procedures” and “do you have the right procedures set up”. There’s a few reasons for this:
    • For the protocol to truly be permissionless, NOs should be able to interact with the protocol without requiring this kind of effort by the DAO or DAO contributors
    • The protocol’s design should be built to minimize risk via various mechanisms (e.g. selection of professional operators, bonding, scoring, etc). As the protocol matures, and more options are available, the manner in which NOs can interact with the protocol will widen, the number of NOs who interact with it will increase drastically, and it is not practical for the DAO to enforce anything that isn’t results-based.
    • Establishing and requiring conformity amongst practices can lead to correlated failure and doesn’t allow for NO “opinionatedness” in certain things where an NO may do things differently for a variety of reasons (and doing differently is fine, provided that it’s done correctly).
  • Instead, assessment of these things can be largely results based, kind of like the approaches followed in the proposed Validator Exits Policy and the ratified Block Proposer Reward Policy. The latter does entail a measure of “this is how you should do it” given the large set of technical and security considerations, but the policy itself stipulates that other mechanisms could be utilized by NOs provided they are discussed, assessed, and tested first).
  • NOs can and should be encouraged to co-develop these detailed operational and security processes / practices / standards and self-regulate.
  • The DAO then only really needs to worry about getting involved in edge cases where, for example, self-regulation has failed, the protocol has been endangered, serious outages / incidents may have caused doubt with regards to the continued performant or sustainable operations of an NO, “real world” (meatspace) events which the protocol cannot address necessitate action, or the trust between participants in the protocol has been damaged.
8 Likes

To revive this thread a little… I will work with the NOM workstream will try to put together a discussion document for interested DAO stakeholders & NOs to discuss to roughly outline what a “guardrails” approach to treatment of potentially harmful incidents (or things like sustained malperformance) in order to advance discussion on having a more concrete approach. I will aim to have this draft ready the week of July 3rd.

I would love to see more initiative/collaboration amongst NOs for something like this. I’m not sure what the best way to foster this (grants? putting everyone in a (virtual) room to workshop it out?) is, but open to suggestions.

Because the above might take some time and there is a question of the actual recent slashing, I think the DAO should have a more concrete conversation about how to treat these specific types of incidents and what we should do in this specific case. It’s been quite a while since the event, and all stakeholders would benefit from clarity.

I think slashings should be (and have been) rare enough that decisions can be made ad hoc, but would benefit from a general framework for them. With that in mind, my thinking is below (trying to establish a set of static constraints, but with enough allowance for context). IMO it’s incredibly important that we get input from as many parties as possible here (especially NOs), so the below is just my 0.02 to serve as a starting point.

Factors around the incident itself that might be considered:

  • was the act malicious or not
  • the proximate cause of the event
  • whether any best practices, infrastructure setups or configurations, common safety measures, or reasonable processes / mechanisms could have prevented the slashing from happening
  • how quickly was the issue identified and by whom
  • how quickly was the issue resolved

Consequences, could depend on:

  • Which “module” did this happen in; or, more broadly, what are the trust assumptions associated with the affected validators (e.g. are the validators unbonded)
  • impact of the event (in case of slashing, finance impact can be small but damage to trust can be high, etc. for example: how does the event affect the trust assumptions between stakers, the DAO, and the NO)
  • extenuating circumstances (what are the pros/cons of this decision, and what other substantial things may need to be taken into account – e.g. does the NO somehow bring key value to the protocol?)
  • what other options there are for the node operator to participate
  • what is the status of remediation of the issue
  • is there a way to gauge likelihood of something like this happening again, and if so what is the assessed likelihood
  • how can remediation be assessed and if it can, is the remediation deemed satisfactory

For example, in the current “curated operator set” on Lido on Ethereum, the options may be:

  • Do nothing
  • Warning (do nothing w/ the condition that the next time the consequence is one/any of the below)
  • Limit the Node Operator’s key count for a certain period of time
  • Decrease the Node Operator’s key count (by prioritizing those keys for exit)
  • Offboard the operator (with the ability to rejoin the permissioned set at a later time)
  • Offboard the operator (without the ability to rejoin the permissioned set at a later time)
6 Likes

This is a very wide topic, so I think a series of workshops would be in order to outline policies and recommendations.

Since all NOPs are different, I don’t think there should be any mandatory way of operating Lido infrastructure, in detail. However, I do believe there are specific high-level requirements that NOPs should adhere to and acknowledge to the DAO that they are “compliant.”

As you say, “guardrails” for how we operate. From client diversity, key management, redundancy (keys/clients/underlying infra), monitoring (slashings/missed att/props), etc.

Although there are some obvious reasons why incidents would happen, there will most likely always be context and new scenarios that are hard to predict. Therefore I think incidents should be considered ad-hoc. That being said, having the processes for handling them is a must.

7 Likes

I agree with the above. General stated guidelines like “blast radius” isolation definitely make sense.

We have quite a big advantage here - working with the curated node operator set, where all the teams have a lot of experience in the validation space.

I would imagine following the guidelines as we set them will probably not affect the infra/processes much as most of the node operators are already working within them or at least very close to them.

3 Likes

I don’t have a lot to add here, I think I agree with most points Izzy outlined, or node operators provided. One thing I wanted to add here is that it’d be a lot better if the framework for the guardrails came from node operators themselves: the reason being that my expectations are that node operators should be way better at this than lido contributors.

3 Likes

I think Node Operators (NO) can contribute the most is in the area of best practices. As @Thorsten_Behrens has pointed out, keys should never be moved, as moving them carries a high risk. There may be situations where there is no choice but to move keys, and in these cases a standard procedure could be defined for NOs that want to follow it. In general, it could be something like this

  • Notify the Lido team.
  • Reason for moving the keys.- Detailed planning of the low-level migration process.
  • Rollback plan.- Risk mitigation plan.

The plan could even be shared with the NOs in order to collectively analyze and identify any potential risks that may exist.
This is just one example of high-risk actions where NOs could reduce their risk near to zero.

Other risks that we could face are errors or external problems (network failures, hardware failures, etc).
As far as bugs in the NO are involved, having a test environment where you can test things before releasing them to mainnet, not rushing into upgrades unless specifically advised to do so, can prevent a lot of problems. Also, maintaining good communication with the development teams will benefit all of us. This could also be standardized.

As for external failures, implementing several redundant solutions that avoid a single point of failure is certainly the most appropriate solution. This aspect may be the most complicated as the internal configuration of each NO so far is quite private. I think that, revealing what we use, but not how we implement the solution, is certainly something that will benefit everyone.

Regarding the Factors around the incident, I think you have said everything that can be considered. The problem is how to measure it. Should each NO provide a rating for each of these points to present the average opinion of all the NOs helping the DAO in evaluating the technical aspect?

4 Likes

I think that, revealing what we use, but not how we implement the solution, is certainly something that will benefit everyone.

I’d love to understand the concern about “opening the kimono”, as it were. We’re pretty far out there with transparency: We’ve FOSS’d all of our deployment tooling under Apachev2, and we’re open as to which clients we run and which broad geopgraphic regions (AMERS, EMEA, APAC) we run in. I understand that’s not for everyone. What I don’t understand yet is why it’s not for everyone, what the risk assessment is.

2 Likes

As you know we also have a similar philosophy but we don’t give out all the information available to us. E. g. we don’t publish security related configurations/improvements/changes to firewalls/network configs/OS updates/etc. There are different philosophies coming together with Lido. Diversity makes us strong.

Also, open source doesn’t guarantee a high level of security. A famous example was the SSH heartbleed bug.

1 Like

I should clarify that I am looking to understand, I am not looking to shoot holes into things. Keeping the details private is legitimate.

I am curious about the risk modeling because that’s something I used to do once.

“We just don’t like to give that information out, unknown unknowns” is a legitimate answer.

“We have this specific scenario in which an attacker can use that information to do this thing” is a more interesting answer.

2 Likes

I think we have a crucial point here - how do we define transparency and where is the difference between a NO’s obligation of secrecy towards his own customers - and his own operation - vs. what is to commonly share with the community for the benefit of all of us.

For me (RockLogic), transparency is the way to go, because I think it is the right thing to do. But there are two crucial aspects to it when we want this principle to be a common ground.

First, the same degree of transparency, or rules for transparency, should apply to everyone. It is normal to everything that there are always contributors who go first and detect things, sometimes involuntarily, sometimes faulty, but if the key learnings from their actions are made common knowledge, their contribution is crucial to progress. Also, there are always those who sit quietly in the background, waiting to learn. And while they contribute their part as well, they experience less trouble in doing so - and this could result in others being over-cautious when sharing their knowledge.

Second, transparency and forthrightness to admit to mistakes should not be punished! There is an important message here: Tell us what you did wrong, and how you amended it, and we stick to you because we all can learn from it. Just imagine it the other way round: you did something wrong, you feel the obligation to make it public to the best of the community - and suddenly you find yourself punished in a probably exceeding way. Would you then be transparent a second time?

Diversity is crucial to a functioning Ethereum environment, but at the same time, there must be a reasonable framework of basic rules of how NOs can successfully operate on Lido. But, as we are all together still in a phase of defining processes and how things can work out best, transparency without fear is the way to go - for everybody and their future. And to be clear: I am not talking about malicious and premeditated behavior, which of course must be punished - I am talking about incidents that can and will happen notwithstanding the best of intentions and how we, as a community, deal with it and those parties concerned.

3 Likes

First, the same degree of transparency, or rules for transparency, should apply to everyone

Not sure that can or even should be done. The DAO explicitly doesn’t want to say how NOs should run their stuff. That includes how much they want to share, though there’s been a push for more transparency, which I welcome.

Being transparent without fear seems to work well so far. We both have experience with suboptimal configuration, and being transparent about what caused it. I haven’t seen any punishment, quite the opposite. Even to the point of having the protocol take on reimbursements instead of holding the NO fully accountable.

Accountability is an important concept as well. Which ties into transparency.

When we showed up with missing attestations the past few weeks this was embarrassing as all get-out. So we’re on the hook, I feel, to investigate and give an explanation. Which ended up being “lodestar didn’t work for us, nor did nimbus”. We’ve been rock solid with zero performance alerts on Lighthouse and Teku, and sharing that experience with the community at large feels meaningful. Even as seeing ourselves in the alert channel all the time was painful.

There hasn’t been any negative feedback on that, not even a hint. I think it’s understood that values such as client diversity are worthwhile persuing, even if that leads to suboptimal results while lessons are being learned, and even if the diversity attempt has to be rolled back again.

1 Like