Withdrawals. On validator exiting order

The risk of operators having complete control over exits is seemingly the largest single risk that Lido faces with the upcoming introduction of withdrawals. It’s easy to see how a single bad actor, perhaps maliciously or because they feel backed into a corner financially, can cause contagion. I won’t go into detail here, purposefully.

The best solution here is one that minimizes the reliance of Lido on operators. There’s been some talk about an EIP which would allow for withdrawals and exits only using withdrawal credentials, but that was pushed back on. It seems quite unlikely that this ends up being the solution, at least within the year.

Instead, (temporary) alternatives which try and solve the core problem of the Lido design not aligning with the Ethereum spec on withdrawals, should be embraced.

There’s very little time left until they are enabled and a solution is needed ASAP to ensure that Lido isn’t frozen in the case of operators poorly managing exits. All it takes is some operator incompetence for this to become ugly.

1 Like

Thank you for raising concerns about the potential risks associated with the upcoming introduction of withdrawals in Lido. I personally estimate the risk of node operators becoming rogue and sabotaging withdrawals as very low. Alternative solutions that do not involve the EIP you mentioned would require a custody (even a distributed one) which stores signed messages which poses a greater risk than node operators sabotaging since they are risking their own reputation.

However, it’s crucial to design the validator ejection algorithm in such a way that one failure of a node operator does not freeze withdrawals in the protocol. The dev team is actively working on this and looking forward to hearing from the community for proposals or thoughts on how best to handle this challenge. We are open to all constructive suggestions!

2 Likes

Primer: I don’t work for Lido. I have zero financial alignment with any node operator. I work in crypto but the outcome of this thread has almost zero impact on me. I’m just a curious person that wants to try and help, and with a skillset that allows me to. I’m quite busy so there is likely some mistakes in this post, as I had little time to do this research, or a misunderstanding of some Lido specific concept. Apologies in advance.

There’s a lot of things to consider when you’re looking at how to optimally eject validators.

Primarily, the distribution of exits will play a huge role. If distributions are normally distributed, then the model for exiting will have one effect. If the exits are closer to a laplace distribution (fat tails), which I anticipate they will be, then the model will have an entirely different effect.

What happens if we exit by size?

If we think of “Strategy A. Choose the validator for exit among Node Operators with a larger than 1% stake. Choose the Node Operator with the highest total age of the active validators.”

If the exits are normally distributed the largest operators get a fairly consistent amount of exit requests that brings them down towards the average, while the smaller operators are able to continue growing up towards this average.

Note: I’m aware that there are key limits on operators. But they seem to always get approved to go upwards, and I assume that will be the case unless an operator does something wrong. So, just keep in mind that this model assumes new capital flows uniformly to the smallest operators as if they have no limits (fair assumption IMO).

FAT tails

If the exits are very fat tailed, like with the sudden addition of withdrawals, market changes, regulatory changes, then what you see is that day-to-day there are very few exits. But on a few days of the year, an enormous amount of the validators need to exit.

Rather than the largest operators slowly exiting towards the average, they will have huge outflows all at once. This cuts a higher % of their revenue stream at once. This is not a desirable effect if the goal is to allow operators to run sustainable, profitable businesses where most participants have a roughly equal chance of profit.

Yes, you need to exit the largest players more, and because a single validator makes up a smaller % of their business they can survive it. But the rate at which this happens will matter.

With normally distributed exits the revenue change is manageable and smooth-ish. While with fat tailed exits, it is less manageable because their total costs cannot change overnight.

This might seem like overanalyzing when you look at the average day, but that’s precisely the point. With either distribution, the effect daily is similar, but once in a blue moon it will be much more painful.

Simulating by size

I created a simulation of the model, simply exiting the operators with the most validators first.

If we presume that 10% of validators need to exit (the period here is irrelevant because the largest validators get no new ETH anyways, so it could be a day, a week, a month. Any period whereby they cannot reasonably scale down costs):

This is the reduction in validators:

Blockscape - Lido 16.8%
P2P - P2P Validator - Lido 16.8%
Staking Facilities - Lido 16.8%
Stakefish - Lido 16.8%
DSRV - Lido 16.8%
Allnodes - Lido 16.8%
ChainLayer - Lido 16.8%
Kiln - Lido 16.8%
RockX - Lido 16.8%
BridgeTower - Lido 16.8%
Chorus One - Lido 12.2%
Figment - Lido 12.2%
Simply Staking - Lido 5.4%

This is with only a 10% exit rate over the period. The common argument against this is that while there will be a large amount of capital wanting to exit once withdrawals go live, there will also be people wanting to enter because they had liquidity requirements holding them back.

This is true. But, all of that new capital flows to the smallest operators first and because the gap between the largest and smallest is so big, it makes almost no difference to this analysis, unless you think Lido will 2-3x very quickly.

A more extreme example

I’d argue that a 20% exit rate is quite possible (though not a high likelihood), over a month or two perhaps, so let’s look at that:

Blockscape - Lido 29.8%
P2P. - P2P Validator - Lido 29.8%
Staking Facilities - Lido 29.8%
Stakefish - Lido 29.8%
DSRV - Lido 29.8%
Allnodes - Lido 29.8%
ChainLayer - Lido 29.8%
Kiln - Lido 29.8%
RockX - Lido 29.8%
BridgeTower - Lido 29.8%
Chorus One - Lido 25.9%
Figment - Lido 25.9%
Simply Staking - Lido 20.2%
Everstake - Lido 13.5%
InfStones - Lido 13.5%
Stakin - Lido 8.4%

The impact on the largest players is a little less than 2x, because you start to see more operators getting exited now. But the result is roughly the same, which is that the biggest operators get massively exited.

If we presume that validators are proportional to revenue (they almost certainly are not, you should expect to earn just slightly more per validator as you scale), then these large players would lose 16.8 - 29.8% of their revenue in whatever period this exit happens, presuming a 10% or 20% exit rate.

This might seem like a bit of a harsh, worst case scenario analysis. Fair point, perhaps you could argue that. But I’ve also ignored other factors. The 10% and 20% exit rate here are purely based on the starting number of validators i.e. today.

But what if capital flows in as well?

If Lido sees an influx of capital, the total number of validators grows, meaning that more can be exited. So, if Lido sees a 30% increase in capital (all of which flows to the smallest operators), then to get a very similar 10% exit rate scenario using the starting numbers would only require 7.7% of the validators at that point in the future, to exit.

Let’s look at what would happen if 30% additional capital flows into Lido, while 10% of validators (based on the current, pre in-flow numbers) exit. Overall you see a 17% increase in Lido capital and validators, but on the operator level the reduction in validators is exactly the same as if no capital has entered the protocol. Why? Because it all flows to the smallest operators and 30% isn’t enough to equalize the number of validators among operators. In reality, as new operators get added, this problem becomes more extreme as the largest players get dragged down to the newest operators.

Randomize it

If it’s decided that the main goal is just to reduce the % of the network than operators run, a more fair system that reaches the same goal is to exit randomly. Larger operators have a higher chance of being exited and so will be pushed towards the average. This comes at the expense of potentially exiting very new validators, but keep in mind new validators are assigned to the smallest operators and so the chance of this is also very small. A randomized approach is far fairer and when we assume fat tailed distributions this randomization acts to smooth the revenue impact on the largest players, to prevent any catastrophe.

It also has the characteristic of a stronger effect when the gap between the largest and smallest operators is largest, while as they trend towards each other the effect becomes less pronounced. This prevents you from continuing to exit heavily the largest operators when the difference between them and the smallest is not significant.

The downside to a randomized approach is that in some scenario where the biggest operator is much larger than the second, you don’t exit them exponentially more. In this scenario, it would seem reasonable to exit a player more if they are getting disproportionate to the rest of the group. However, given our fixed starting point, this shouldn’t happen unless another parameter is changed.

Simulating random

Running a simulation using the randomized approach (ran only once, didn’t have time to code up a version to run thousands of times), with the same 10% exit rate and 30% capital inflow:

Blockscape - Lido 9.3%
P2P - P2P Validator - Lido 9.9%
Staking Facilities - Lido 9.3%
Stakefish - Lido 10.0%
DSRV - Lido 9.5%
Allnodes - Lido 9.4%
ChainLayer - Lido 9.4%
Kiln - Lido 9.9%
RockX - Lido 9.1%
BridgeTower - Lido 10.1%
Chorus One - Lido 9.2%
Figment - Lido 9.0%
Simply Staking - Lido 9.3%
Everstake - Lido 9.4%
InfStones - Lido 9.1%
Stakin - Lido 7.9%

As you can see, the impact on the largest validators is massively reduced. Instead of seeing most large operators taking a 16.8% hit, we see between 9 and 10%.

Meanwhile, the smallest operator in this example ends with 5212 validators, while the largest has 6697. Close enough.

You tradeoff a lot of the severe pain that would be inflicted on the bigger operators, for a slower ramp-up period for the smaller ones. The important thing here is that the difference in capital inflow that it takes to get the smallest operators to within 20% of the size of the largest operators is very small. Whereas the reduction in impact on the largest operators is very big.

VERY bad financial example

To give an example of how much these revenue cuts could impact an operator, let’s look at a fake example:

Revenue 10m
Variable Costs 6m
Fixed Costs 2m

Profit 2m

If we assume the original 10% exit then we get a 16.8% revenue cut. That leaves us here:

Revenue 8.32m
Variable Costs 4.99m
Fixed Costs 2m

Profit 1.33m

I’ve purposefully put this right at the end of this section because I have little idea of what margins large operators are running at. This is just purely a rough example to show the impact of fixed costs and how a small cut to the top line passes down to a much larger cut to the bottom line.

With my very thin knowledge of this area, I’d still be very very surprised if large operators are running at a 20% net profit. With financing costs where they are, infrastructure necessary to get up and running, bloated staff counts etc it seems very unlikely they could surpass 20%. Which is why these exit models are so important.

Should we really be prioritizing network penetration / size?

It’s worth looking at where risk concentrates. There’s a lot of talk about the percent of the network that any given node operator controls, but in reality, this is a smaller concentration of risk than many other factors.

For example, Lido often talks about how its client diversity is superior to the network broadly by quoting the % of validators that use a minority client, versus the most popular client. But this doesn’t consider at all the concentration within those minority clients. Here’s what the data looks like, based on what I pulled last night from all Lido validators.

This data relies on Rated.Network and what consensus clients they believe the validators for each node operator under Lido are using. This is for the last 30 days:

Lighthouse: 46.8%
Prysm: 33.4%
Teku: 17.6%
Nimbus: 1.5%
Lodestar: 0.7%

As you can see, a huge percentage of all validators run on Lighthouse, with plenty also using Prysm. A client bug in either of these would be significant for Lido.

According to the Q3 22 Lido data, 61.4% of validators are on the cloud, with massive server concentration in Europe and NA. Also, 67% of validators are run by entities in Europe, with roughly 30% in the EU and 37% (eyeballing graph) in non-EU Europe.

The point I’m trying to make is that concentration of validators among node operators is less of an issue currently, purely from a risk standpoint, than other factors. So, if the goal is to exit validators in a way that reduces the risk for Lido users, doing it by size might not be best.

I also looked at which factors correlate most closely with APR percentage, which is flawed but perhaps useful, and validator count has a 0.32 Pearson correlation with APR, with p-value of 0.084. Probably shouldn’t read much into this, but I spent a bit of time doing this, so you’re going to have to read it. No other factor correlates closely at all with APR, and the data is too thin on slashing to read too much into it, so APR is likely the best metric right now unfortunately.

While I think keeping concentration under 1% is a valid and noble goal, it doesn’t help mitigate risk as much as focusing on other factors.

One counter point to using these alternative factors was included in the original post:

This is a fair point, but I wouldn’t describe these as “soft” factors like @arwer13 because they represent the bulk of the concentration of risk.

image

(completely off topic but this graph is not healthy, you would much prefer to see that diversification happens at the operator levels too, rather than just at the macro)

The number of validators an operator runs correlates loosely with the percentage of validators they have using Lighthouse. The same is not true for all other clients where there is a weak negative correlation with the number of validators that an operator runs.

The end result is that exiting by size (or randomly) will bring down the domination of Lighthouse in the Lido validators, but you will see Prysm increase a little as well. So, that is likely one positive effect of the proposed approaches.

The geographic location of servers and the jurisdiction of each operator is unclear. I’m sure I could find this online, but it would take a long time, and Lido must have the data because you produce quarterly reports (but don’t release the raw data, please do in the future).

What are you even saying dude?

Broadly, the point I’m trying to show here is that optimizing for a single goal of reducing the % of the network that any operator runs is highly likely to materially impact all of the other goals. By doing this, you’re making the decision that this goal comes ahead (and likely at the cost of) all the others.

It doesn’t seem that any analysis has been done on how this approach would impact the entire set of goals. I really hope that what I’m writing here is not the most deep analysis that’s been done on this and that more has been done internally, but not shared widely.

This seems like an important decision and one that’s worth trying to get right.

If there’s no time or willingness to find a more precise solution, random seems better than exiting by size.

12 Likes

Oh wow. I just say, that’s one of the most well-formed forum comments I’ve seen. Ever. Thank you

5 Likes