Slashing Incident involving RockLogic GmbH Validators - April 13, 2023

As a final part to this thread, following from the ongoing discussions about establishing policy around large incidents Discussion - Treatment of Potentially Harmful Incidents, I propose that we proceed with a discussion and a vote (if deemed necessary) in order to bring conclusion to this topic. I will present my thoughts using a loose framework for ad hoc consideration of the slashing event (linked below), but obviously any thoughts regardless of whether this same framework is used or not are useful.

Using the framework outlined in my post in the referenced thread, my view is the below:


Factors around the incident itself

  • was the act malicious or not

    No

  • the proximate cause of the event

    1. Peculiarity of node setup + lack of fallback node pair, which led to keys being moved for liveness following a node failure
    2. A technical bug (i.e. the key manager saying it did something when it did not) providing false information / confirmation that keys had been deleted from the original node
    3. Lack of additional preventive measures (e.g. keeping original node offline, or nuking it)
  • whether any best practices, infrastructure setups or configurations, common safety measures, or reasonable processes / mechanisms could have prevented the slashing from happening

    • Yes (different type of setup could have prevented this, e.g. use of fallback node pairs, threshold signing), regardless of setup additional measures (e.g. nuking original host) would have prevented this from happening
  • how quickly was the issue identified and by whom

    • Quickly but not immediately (i.e. within ~10-12 minutes), not by the node operator (although there is evidence to believe that it would have been caught eventually, i.e. relevant monitoring was in place but not optimal)
  • how quickly was the issue resolved

    • Impact was mitigated quickly once issue was observed, resolution, namely the updating of infrastructure and modification of processes to be more resilient took place relatively quickly and in a transparent manner, see weekly updates on in this thread (an example)

Consequence considerations:

  • Which “module” did this happen in; or, more broadly, what are the trust assumptions associated with the affected validators (e.g. are the validators unbonded)
    • The event happened in the curated operator module where Node Operators are entrusted with running bond-less validators.
  • impact of the event (in case of slashing, finance impact can be small but damage to trust can be high, etc. for example: how does the event affect the trust assumptions between stakers, the DAO, and the NO)
    • While the financial impact was relatively low (e.g. compared to daily rewards), in this module trust assumptions are very high (basically implicit) and stakers do not have any kind of assurances or recourse against these events (e.g. insurance, unless they opt in individually). As such, an event such as slashing which is seen by the network as an attack can be considered as something that may need to reset trust assumptions.
  • extenuating circumstances (what are the pros/cons of this decision, and what other substantial things may need to be taken into account – e.g. does the NO somehow bring key value to the protocol?)
    • RockLogic GmbH is an NO that has been doing a lot in terms of client diversity (both on EL and CL (including validator client) side), infrastructure diversity/resilience (bare metal only as of most recent quarter), and is a very active participant in the Ethereum staking ecosystem, creating tools for stakers such as Stereum, Synclink, SlashingGuard, etc.
  • what other options there are for the node operator to participate
    • Currently none, as no other modules are live.
  • what is the status of remediation of the issue
    • As per the updates provided by RockLogic GmbH, the issue has been remediated.
      • The technical bug in the Prysm key manager has been patched.
      • They have moved to an infra model with explicit multiple EL+CL node pairs in fallback.
      • Their processes for key migration have been updated to explicitly nuke original nodes/hosts to prevent possible slashing.
      • All internal processes were reviewed, and some were made publicly available on the Lido research forums for the benefit of others.
  • is there a way to gauge likelihood of something like this happening again, and if so what is the assessed likelihood
    • Based on above remediation, if procedures are followed as per the updates implemented by the Operator, then likelihood is very low (note: qualification here doesn’t mean I have reason to believe they won’t be, just that it’s important to clarify that there is a non-ensurable element to the likelihood assessment).
  • how can remediation be assessed and, if it can, is the remediation deemed satisfactory
    • Through review of relevant documentation and code (e.g. in case of technical bugs). With regards to execution, only assurance from the operator can be provided. In the future things like tests of internal controls and certifications of this (e.g. via SOC IIs) may be measures that can add additional comfort, however even this does not provide assurance of execution.

Consequence options

  • Do nothing
  • Warning (do nothing w/ the condition that the next time the consequence is one/any of the below)
  • Limit the Node Operator’s key count for a certain period of time
  • Decrease the Node Operator’s key count (by prioritizing those keys for exit)
  • Offboard the operator (with the ability to rejoin the permissioned set at a later time)
  • Offboard the operator (without the ability to rejoin the permissioned set at a later time)

Based on the above evaluation, and the described possible outcomes for slashings in the curated operator set, I believe that the appropriate course of governance action in this case would be something between “Do Nothing → Offboard the operator (with the ability to rejoin)”. “Do nothing” and “limit the node operator’s key count” have effectively already occurred, as this was the part of the response following the incident. So the question at hand is what the next step would be (does the limit stay, is another option chosen, or is the limit lifted). At this juncture, I believe that there should be at least rough consensus on how to move forward, and discussion and input from relevant stakeholders should take place before any specific action is taken (if at all). A vote isn’t strictly necessary for some of the outcomes, but may be useful for signaling nonetheless.

Personal opinion
Given lack of clarity around this topic (for example, already laid out consequences for slashing / serious incidents) up until now, the length of time since the slashing, the operator’s overall contribution to the ecosystem, and the successful and timely implementation of remediative actions, I personally would lean towards “Limit the Node Operator’s key count for a certain period of time”. Since the key count has already been limited shortly following the slashing event (discussion, vote), and no issues have been noted with the NO’s validators apart from performance issues which the operator has compensated stakers for, I would propose that NO’s key count rise together with the coming (pending DAO votes) Ethereum Wave 5 cohort, i.e. once new node operators are added to the set (which has a timeframe of ~ mid aug to mid sep). When the new cohort reaches the level of RockLogic (new node operators are always prioritized for deposits), RockLogic then also re-commence addition of validators to the validator pool.

EDIT:
Post has been updated to add links to Stereum, SyncLink, SlashingGuard, which were mistakenly only left with initial “add link” note and not a full functioning link.

8 Likes