Research for an Operator Scoring System [Completed]

The following deliverable is the completion of this research into the data available for a node operator scoring system. This research examined available datasets, both on-chain and off-chain, for feasibility and usefulness. We critiqued existing scoring systems and suggested a structure for a future Lido system.

Please find the full deliverable HERE.

Original grant proposal HERE.

Our focus was on identifying what data exists for the development of a NO scoring system that effectively balances stakeholder rewards with risk mitigation. While NO’s can influence the rewards their validators earn, they have far greater control and responsibility for minimizing penalties and slashing. Therefore, optimizing for these risk mitigation factors should be a better proxy for long-term performance.

By viewing a scoring system through the lens of risk, it will allow us to craft a healthy and resilient validator set, that should necessitate higher performance over infinite epochs.

Our investigation centered on identifying potential data sources and metrics crucial for an effective NO scoring system. We explore both on-chain and off-chain factors, recognizing their distinct impacts on overall NO performance and risk management. On-chain, we analyzed millions of data points to identify whether the datasets have the characteristics necessary to be used in scoring.

Key Findings and Recommendations

Limitations of On-Chain Data-Only Systems

While it is feasible to create a NO scoring system solely based on on-chain performance data, such a system would be substantially deficient. It would fail to comprehensively account for the myriad of risk factors, which are pivotal in ensuring a robust and reliable scoring system. We found that overwhelmingly risk mitigation data was necessary for the creation of any NO scoring system. Without it, any other system would optimize for another outcome, without any insight and transparency into the unknown accumulation of these risk factors. Therefore, an on-chain data-only approach would be considerably limited in its effectiveness.

Priority of Off-Chain Risk Data

Our research underscores the paramount importance of off-chain risk data in any effective scoring system. While this necessitates a departure from a fully trustless scoring system, it’s a necessary compromise to achieve a realistic assessment of NO performance.

Risk-Based Scoring Framework

The scoring system should prioritize minimizing penalties and slashing, key factors in long-term performance and stability. Key risk factors include internal processes, hardware, client and server locations, jurisdiction, and operator concentration.

Incorporating On-Chain and Off-Chain Data

The system should utilize a blend of on-chain data and critical off-chain risk data. This approach acknowledges the necessity of human involvement and increased transparency from NOs for a comprehensive risk assessment. Gathering this data is likely to conflict with a transition to permissionless anonymous NO’s. We find that it will be critical to create an incentive structure for NO’s to truthfully disclose information and a remediation system to investigate discrepancies. Without this information, the DAO has little transparency into the accumulation of risk in these factors and hence cannot properly maintain the health of the set.

MEV Data Exploration

Our study has identified MEV (Maximal Extractable Value) data as an intriguing area for future exploration. This includes potential optimizations for capturing MEV and tracking/preventing MEV theft by operators. However, currently, MEV data is not a viable metric for the scoring system due to implementation challenges and its relative unreliability as a dataset.

Data Source Reliability and Selection

Rated.network is identified as a suitable source for on-chain data, given its relatively high accuracy and robust API.

Community Engagement

Engaging with the Lido DAO community is crucial, especially around areas like client diversity, MEV strategies, and key management, to ensure the scoring system aligns with community values and risk tolerance. We find that the DAO may benefit from creating stricter mandates for NO’s regarding systems, internal processes, and information disclosure. The economic value to NO’s from participation in Lido is immense and hence the DAO has significant power to enforce standards that will allow for the creation of a stronger scoring system and a healthier validator set.

8 Likes

I’m really glad this research has been published. I’ve had a bit of a personal accident so will be out of commission for a few days, but I hope to engage in detail at the earliest possible opportunity.

PS
I took the liberty of relabeling this in the “community grants / initiatives” category.

2 Likes

First off, want to thank both the OP (@ccitizen) and Lido for funding this work. As the publisher of RAVER (with Rated) and rating system aficionado, it’s super valuable to have an outside view do this thorough a dive.

Also happy to see our DB & API powered a big chunk of the analysis.

Just read through the whole piece, coming back with a few comments:

1. The RAVER DOES incorporate inclusion delay. We in fact purposefully decided to keep it as a factor even post-Altair when its “tail” was chopped off, because we think there’s useful information embedded in it.

Quote:

Rated prioritizes generalizability and legibility in their design goals, hence not scoring
this lateness. For our scoring system, this is unlikely to be the best approach because it
would allow NO’s to accumulate penalties and missed rewards without being negatively
scored.

This can be moderated by monitoring of inclusion delay, which can be a helpful signal
for predicting an NO’s propensity for lateness. Rated does not use inclusion delay,
instead looking at lateness as binary based on correctness. [THIS IS INCORRECT]

However, it’s not obvious that the inclusion delay contains zero useful information. If a
validator is attesting extremely slowly, but never slow enough to be penalized, this
matters to us. This slowness suggests that for a given volatility in their network or
hardware performance, they will be more likely to be so delayed as to result in a penalty
when compared to other NO’s.

More specifically, inclusion delay is the denominator of the “attestation effectiveness” component of the RAVER. See more here: RAVER v3.0 [current] - Rated Docs

2. re: Aggregation & scoring validator indices, on daily increments and aggregating upwards from that. That’s indeed how we compute the RAVER, and glad to see that @ccitizen points out the benefits of batching in daily increments (for storage efficiency). We’re currently exploring moving to epoch boundary and the sum of duties across validator keys (i.e. an operator) in Q1 2024; at this point we need to gather more data as to what the marginal benefit is in terms of accuracy (I suspect small), but it might unlock efficiencies in terms of replicability (very important to us).

3. re: Sync committees; I’m very glad we seem to come to the same conclusion there, and felt like the following observation was a really good one.

Scoring sync committees differently could also add perverse incentives where NO’s
want to avoid missing sync committees and therefore defer maintenance or necessary
downtime.

This could encourage NO’s to take offline a subset of a cluster of validators on a single
server to perform some maintenance while trying to keep the validators in the sync
committee online. This adds the additional risk of slashing if transferring of keys
between servers is required.

4. re: EL missed rewards; at the peril of appearing like the man with the hammer, my view is that indexing on the specific value of a missed block in a given slot, comes with many disadvantages that are not amply referenced. Namely:

  • it is is unfair to the operator, given that the nominal value of the block is completely outside their control (relative value is in their control, but again, you are looking at a “at which second did the operator pick the block” type problem). maybe its ok in Lido’s active set today, but is it ok in a Lido set where you have a more diverse ensemble of operators, including—for example—the CSM?
  • the primary source of data is relay data APIs, and that’s just not reliable enough imo. relay data apis often break, have spotty archives, and come with no strong guarantees of integrity. this is compounded on the idea that relays still have no business model that is attached to them, and thus no real incentive to provide a better service there.
  • the min bid publishing idea seems good in theory, but to then implement appropriately it introduces the kind of complexity that I’m not sure is desired. I think it’s ok to want to use Relay APIs primary data to ascertain whether the protocol was attributed the MEV/EL rewards it should in the correct fee recipient address, but would caution against further embedding EL related data into “performance”. doing that would require requesting several orders of magnitude more data on a per slot basis from the Relay APIs, for 33% of the Beacon Chain no less.

Endnote
Overall agree with most of the findings. On our part we’re very confident that the RAVER is the best metric of onchain performance, as it provides with a very high density interface, that translates gracefully across operation sizes. It’s certainly not the whole picture, but it captures very well a big part of the whole picture.

2 Likes

Hey @ccitizen thank you for this research!
The remainder of the grant agreed previously has been transferred.

1 Like