LoE Validator Exits: Delinquency Incident involving Chorus One

On October 3 2023, the Accounting Oracle report indicated that node operator Chorus One (id #3) had failed to timely process 31 exit requests which were signaled in a VEBO on September 28th. The untimely processing of these exits meant that these validators were automatically marked as “stuck” and the NO considered delinquent, and that the NO would receive half rewards until both a) the processing of the overdue exits, and b) the expiry of a cooldown period were completed.

Such events are handled automatically by the protocol as exits (if necessary) are re-routed to other operators so that withdrawals are not affected.

As per the stipulations in the LoE Validator Exits Policy, this post is to evidence the formal issue raised with the Node Operator. The Node Operator has also been separately contacted by NOM workstream contributors when the issue was identified.

The Node Operator was contacted and actions to remediate the situation were successfully taken on October 3rd. Chorus One will provide an analysis of the incident and remediation actions undertaken. The 31 exits in question were processed on October 3rd, and all following numerous exit requests since were also processed timely. The Node Operator has since resumed a status of in good standing and received full rewards as of today’s accounting oracle. The course of the incident and the penalties can be observed in the dashboard for LoE NO Rewards and Penalties.

Further, as per policy:

Due to the re-routing of validator exit requests, the DAO should consider (via an ad-hoc vote) overriding the total limit of active validators for the relevant Node Operator such that if/when they resume a status of in good standing, they are not benefiting at the expense of Node Operators who took over the processing of the re-routed exits requests.

What I see on-chain now:

  • The NO has resumed a status of in good standing as of yesterday (the Accounting Oracle report for Oct 10, 2023 should show the operator receive full rewards following the expiry of the cooldown period)
  • The NO is operating 9427 active validators (which are less validators than other NOs in the same cohort (e.g. P2P) who were at a similarly active number before this set of exits was signalled by the protocol. Due to the deterministic exits order (see code), Chorus One has essentially been prioritized (along with the oldest two waves of Node Operators) due to the “stake weight” prioritization mechanism, which will generally continue to be true until the operator has less than 1% of Lido stake allocated (they are still above that threshold).

Given the above, I do not see a strong reason to set a target limit at this time as a) the overdue exits were processed ASAP once the issue was noted, and the NO began processing exits again the following day and b) the NO is already prioritized for exits (unless there is an operator with TargetLimit ahead of them).

6 Likes

Hello everyone, we want to share our incident report.

What happened
On September 28 2023, a total of 31 validator exits were requested by the Lido Oracle. These were not processed in time by Chorus One due to an unexpected oversight in our automation software, which led to the delinquency event stated above. As of October 10 2023, all automatic exists are working as expected and no further issues have been observed.

Root cause

  • Beacon Node and Validator Ejector were both working without a problem.
  • As part of recent update of Chorus One’s internal key management API, we have added exposure of deposit data for every validator.
  • In our key management software, Lido validators have a special status because of how deposit data is generated. Thus, we only import public key and withdrawal credentials into our key management software, not the full deposit data.
  • Due to the deposit data missing, the Lido validators were skipped and not exposed via the key management API to our exit software. This software works by listing all validators by their internal ID, and then using that internal ID to exit. In the case of the incident, this list was showing no validators for Lido.

We first observed this issue on September 27 2023, and was then remediated by introducing deposit data into our database on September 29 2023, so Lido validators exposure has been working correctly since that date. Currently, the exits are processed in a fully automated manner.

Unfortunately, this fix was applied only after the 31 exits requested on September 28, and we did not double-check if there were pending exits left between these dates, so the exits were not caught on time and led to falling into delinquency status.

Remediation
Currently, our exit automation does expose metrics that specify whether a validator was successfully found and exited by the automation software. Unfortunately, the exposed metrics were NOT covered by automated alerting rules. Our main remediation path has been adding automated alerts on those metrics. Additional steps are adding separate monitoring path, that will scrape Lido Oracle events and create internal incident if any “stuck” keys are detected. If any software that allows this exists, we are open to use it internally as well.

As of now, automatic exits are working as expected, with no further issues observed. Automated alerting is running as well, and have been tested to route into our 24/7 support notification queue.

6 Likes