Discussion - Treatment of Potentially Harmful Incidents

The slashing incident of April 13, 2023 has got us all thinking about its causes and how to remedy them - and perhaps even more importantly, about how to prevent incidents like this happening to Ethereum and its contributors in the future.

Whether you call this particular incident a mistake, a reluctant reaction, human error, system failure or anything else, there will always be minor and major incidents that are unforeseeable; we need a proper strategy to deal with these.

Since mistakes provide a valuable opportunity to learn, grow and make things better, we want to start an open discussion within the Lido DAO community on whether or not we should establish a jointly accepted, well-proven and commonly known procedure for how to treat potentially harmful incidents. And if so, what such a procedure should look like.

The slashing incident that occurred to RockLogic GmbH in April 2023 and triggered this initiative is one amongst a greater variety of incidents that have happened before across the community. As we understand it, all these incidents have been handled professionally by those involved - but a common procedure has been distinctly lacking.

Which bears these questions:

  • Would it be better to jointly develop a structured, universal procedure for how to treat things in case of emergency?
  • Should we have a common “recipe” for how to deal with dangerous incidents? Should we develop open guides or emergency protocols that can be easily executed BEFORE disaster strikes?
  • Should we adapt procedures, generalize them, and make them mandatory?
  • Is there more we should think of, beyond monitoring & alerting, automated tests, doppelgänger protection, node configuration changes, key handling, release precautions & prior testing, …?

In short, should we have a universal emergency plan?

Think of it as a fire extinguisher: it will not help you with every problem that might occur on the way - but if something happens, it is easily accessible, and you know how to use it. It will kill minor flames in an instant and help mitigate, or in the best case even prevent, large fires.

We are keen to hear your opinions on the subject, and hope this post kicks off a healthy and fruitful discussion on the matter!

9 Likes

So my own two cents on this: I would be reluctant to draw the line in the sand here and make a vote to say “DAO think you should operate this way and not that way”.

Curated node operator set is expected to be better at this then most dao members, and, I mean, some of these are client development teams, and others are also super professional.

The art of blockchain (an ethereum) ops has stabilized a lot lately but is still decently green. I do not think that a strict operational one size fits all manual would be a good thing atm.

Having a self-regulation codex of good practices and voluntary, self-imposed standards (ethical, organizational, technical) could be a good start here.

9 Likes

I cannot speak to what node operators should do broadly, but I can speak to what the guidelines are we use.

  • “Blast radius” contained to 1,000 keys per “environment” - VC, remote signer if in use, CL:ELs
  • Keys don’t move. Resilience comes from having multiple CL:EL, resilient remote signers, and VC in k8s (or any container orchestration)
  • Maintenance of any one node does not remove failover resilience: In practice this means three CL:EL, three k8s nodes, and if in use, three or more Dirk instances, per environment of 1,000 keys
  • Avoid supermajority clients (only Geth, currently), as they carry a low-probability and catastrophic-impact risk of stranding the validators on a non-canonical chain and getting them leaked down to 16 eth or below.
  • Embrace client diversity: The three CL:EL in an environment use two or three different client combinations. This way, a bug in any one client does not cause prolonged downtime.
  • Embrace geographic diversity: Nodes within an environment are in different regions within the same broader geographic region (APAC, EMEA, AMERS); place environments in different broad geographic regions, which serves to spread keys out around the globe.
  • Alerts in Lido Discord re validator performance are to be treated as real, because they are. Combat alert fatigue by finding root cause for alerts and improving things.
  • IaC is your friend; metrics are your friend

That note about alerts aims at running well, in addition to running reliably. Looking at the alerts one can get the impression that there’s a temptation to give up - “that thing squawks all the time”. I’d encourage fellow NOs to not give up. That alert mechanism is working well, those are real and actionable.

5 Likes

I think I can’t edit a post, and realize I should have started with the kind of harmful events these guidelines seek to minimize risk for.

  • Slashing. This risk should be so close to zero as essentially be zero, short of a catastrophic bug in slashing protection DB, which has never been seen in any client. That’s why keys don’t move. Don’t give humans a chance to make a mistake.
  • Keys getting “stranded” on the wrong, finalizing chain. This risk can be reduced to near-zero by avoiding supermajority clients. See also Upgrading Ethereum
  • Prolonged offline penalties. This risk can be reduced by good and actionable alerting, as well as client diversity.
  • Sub-optimal rewards. This risk can be reduced by acting on on-chain alerts for sync committee participation, missed attestations, missed block proposals, and so on.
3 Likes

Thanks Stefan for kickstarting the discussion! One of the great things about decentralized protocols and DAOs is the many different ways you can approach a challenge or set of problems. I think the guiding questions you’ve posed are certainly in the right direction, but perhaps disagree a bit in terms of follow-through.

I would like to try to lay some groundwork in terms of how I think of the protocol:

  • The Lido on Ethereum protocol is essentially software / middleware. It doesn’t have agency and can only do things which it is programmed to do, either as a reaction to things that happen to/from it, or through interactions with it by third parties.
  • The DAO, to the extent through which it can action things (via votes), should attempt to be a guardian of (a) the protocol (lido on Ethereum), (b) its users, and (c) the underlying network. It’s essentially a utility maximization <> risk minimization exercise.
    • Because Ethereum is stronger when it’s robust, this means things like fostering decentralization, diversity, and variety, versus picking just what is “most profitable” and focusing on that.
    • As the protocol matures, the things which the DAO can do and its total power of the protocol should minimize, and its role in the routine functioning of the protocol should basically end up somewhere between “none at all” to “in case of emergency, break glass”. There are technical limitations (e.g. lack of withdrawals until recently, ability to consume beacon state in the EL directly, etc.) which substantially affect the potential and practical mechanisms of trying to make the protocol fully autonomous, but DAO involvement it should be minimized at every step of the way based on context, protocol design decisions, and technical environment limitations.

What does this mean for NOs and other entities interacting with the protocol and the DAO? I agree that it makes sense to have a more structured approach for how the NOs interacting with the protocol may deal with emergencies / incidents, however, as far as the DAO is concerned, I think the approach should be more of something like a policy rather than a procedure. Basically: it should outline what the goals are, what the expected results are, and what happens if the expected results are not achieved. Node Operators can then be encouraged to do their own thing, in their own way, within the set of objectives that aim for a robust validator set for the protocol and Ethereum. Ideally, NOs collaborate and coordinate on developing these operational practices and they (and the wider community) hold each other accountable, vs needing the DAO to do it.

  • I don’t think the DAO should enforce or be involved in the reviewing/checking of specific practices (eg via workstream contributors). I think the community of NOs can certainly do so, and there can be a culture of sharing of information, tools / processes, and self-reporting relevant information, and the DAO could support this effort (e.g. via grants, open source tooling, collaborations with 3rd parties to offer things like security and process assurance frameworks and audits). But, I don’t think it’s the DAO’s job to be an arbiter or an enforcer of “these are the right procedures” and “do you have the right procedures set up”. There’s a few reasons for this:
    • For the protocol to truly be permissionless, NOs should be able to interact with the protocol without requiring this kind of effort by the DAO or DAO contributors
    • The protocol’s design should be built to minimize risk via various mechanisms (e.g. selection of professional operators, bonding, scoring, etc). As the protocol matures, and more options are available, the manner in which NOs can interact with the protocol will widen, the number of NOs who interact with it will increase drastically, and it is not practical for the DAO to enforce anything that isn’t results-based.
    • Establishing and requiring conformity amongst practices can lead to correlated failure and doesn’t allow for NO “opinionatedness” in certain things where an NO may do things differently for a variety of reasons (and doing differently is fine, provided that it’s done correctly).
  • Instead, assessment of these things can be largely results based, kind of like the approaches followed in the proposed Validator Exits Policy and the ratified Block Proposer Reward Policy. The latter does entail a measure of “this is how you should do it” given the large set of technical and security considerations, but the policy itself stipulates that other mechanisms could be utilized by NOs provided they are discussed, assessed, and tested first).
  • NOs can and should be encouraged to co-develop these detailed operational and security processes / practices / standards and self-regulate.
  • The DAO then only really needs to worry about getting involved in edge cases where, for example, self-regulation has failed, the protocol has been endangered, serious outages / incidents may have caused doubt with regards to the continued performant or sustainable operations of an NO, “real world” (meatspace) events which the protocol cannot address necessitate action, or the trust between participants in the protocol has been damaged.
6 Likes