Yes I agree, there is some confusion there with regards to what we mean by “validator location” which we’ve tried to address in the FAQs in the docs. Essentially we’re looking specifically for the country (NOT the actual location) of the Validator Client specially, (e.g. Vouch as opposed to Teku or Lighthouse). While there are other ways to do try to gather this data, we think this is the most non-intrusive and safest way to do it, and will reasonably test the assumption that the distribution of validator clients largely tracks the distribution of beacon chain nodes (which it may or may not).
I think starting no a voluntary basis is definitely the best way forward, and it would be great to see some support from some of the NOs and see some geo-tags appearing!
Glad to see this discussion revived from the original ethresearch thread!
I think that’s the right approach to start off with, especially focusing on collecting this data in a responsible, non-intrusive way. One thing that popped into my head that we might want to keep in mind is sample bias? Aka are the folks who are most likely to respond to a call for self-disclosure (for some reason) more likely to be located in specific regions? Maybe is a non-concern for the initial sampling
Might be more manageable from this POV to start off with the Lido NO set, take learnings from that exercise on methodology (with disclaimers on potential risks of drawing broad conclusions from the dataset), and then gradually expand to other staking communities outside of Lido?
Much appreciated thank you! It would be great to get a view on how this approach might compliment DVT, or if you think it’s just not applicable in the DVT context. It would be great to get your perspective.
One thing I’m starting to realise from the feedback I’m getting, is that node operators are hesitant to reveal any more data than is necessary about their location, even at the country level. One idea I had that might assuage any fears operators might have, is to introduce some sort of simple differential privacy using randomised responses. If we ask operators to flip a coin, and report truthfully if heads, or pick a random value if tails, we would still be able to get a reasonably accurate idea of which countries validators are located in without doxxing anyone. The actual probability may need to be different from 50/50 but you get the idea.
Nice idea @orbmis, geographical diversity is important indeed.
With the crawler we have at MigaLabs, we can get the country of most beacon nodes we connect with, but as @Izzy mentioned that does not tell us whether that node has validators connected to it or not. However, we also have other extra information. For instance, beacon nodes also share which attestation channels they are interested in. If you are a beacon node without validators, most likely you will avoid the extra work of listening and forwarding messages on more channels than the ones you really need. Beacon nodes attached to validators do need to listen to the attestation networks required to validate blocks according to their respective duties. So if we cross these two datasets, we can already have an idea of the geographical distribution of validators.
In addition to this, we can also look at things like whether the node is a cloud node or a residential node, which we can also derive with a relatively high degree of confidence.
What I would suggest is to implement both strategies in parallel and keep them independent, and then when we get some first results compare both distributions. I would be happy to have my team working on this.
I really like the idea of trying to track the beacon nodes that share the attestation channels they’re interested in. Do you think that would be data you would be interested in publishing on your dashboard? (the Miga Labs dashboard is very cool by the way!).
I also think that trying to measure the ratio of nodes in data centres as opposed to other locations is important, because it gives an idea of the jurisdictional distribution as well as the purely geographical distribution. This would be similar to what Ethernodes do for execution clients.
I think implementing both strategies is a terrific idea. I have a beacon chain graffiti scraper running and so if we assume that even some validators participate, then we can use it as a sample with a standard confidence interval and compare it to other datasets.
And within DC deployments, wonder if it’s possible to delineate between managed public cloud vs bare metal? I do think the distinction is meaningful here.
A great discussion, and I just want to point to another discussion that I started that you might want to look at: Execution & Consensus Client Bootnodes - Node Operators - Lido Governance. One thing I would like to point out is that the use of the Graffiti field on a voluntary basis can be used for psyops strategies: e.g. a cartel of validators wants to provoke the regulators of a certain X country and intentionally always puts this X country in the geo-tag. Furthermore, any kind of IP addresses or location information may have been previously obfuscated. So the reliability of any kind of such data is somehow correlated with the willingness to be honest.
Just a note on bare metal: The problem with cloud solutions is the following; most of them a running under US law, so in case we even diversify on the cloud providers for the bootnodes, a single point of failure exists: the US enforcement possibility. So that’s why I support bare-metal solutions very much.
Those are very good points. I hadn’t thought about a cartel of validators wanting to provoke the regulators in a certain jurisdiction. Do you think this is something that’s likely? My idea is predicted upon the presumption that node operators don’t have any plausible incentive to be dishonest. However, what this discussion has surfaced is that there seems to more appetite for making it difficult to ascertain where validator clients are (maybe to the point that it doesn’t matter where they are located or concentrated, if nobody can effectively conclude or prove where they are).
I think the conversation you’ve started on boot nodes is hugely important, and I think it’s great that you’re making it visible. I’ll do my best to help out in that regard.
With regards to bare metals servers, correct me if I’m wrong I think it’s probably to point out that we’re talking about bare metals servers outside of large cloud data centres, (i.e. as opposed to provisioning a bare metal server AWS etc.).
I don’t know what’s the probability of such a scenario, but what I know is that having such a psyops/manipulation option doesn’t make that approach future-proof since there might be unknown unknowns we’re not aware of it right now. Also, I like to remind about Goodhart’s law which states: “When a measure becomes a target, it ceases to be a good measure.” So in case the validator locations become a target, the measure ceases to be a good measure at all since it can be manipulated.
Example: One (possible) scenario is that Gary Gensler declares Ethereum a security, Bitcoin maxis acquire a large amount of ETH (they might sacrifice their sats for this attack ;-)), and become a large pool of validators. In order to increase the regulatory scrutiny of the SEC, they spam the graffiti with the US country tag. SEC gets nervous and declares staking and becoming a validator as illegal except if you have a specific security dealer’s license (or whatever specific license is required).
Thanks a lot for your support. Highly appreciated.
Well this can mean both: making use of bare metal servers of local DCs/cloud providers or actually maintaining your own bare metal servers (aka as home staking and/or home-hosted EL/CL clients). The latter requires a certain degree of hardware and software competencies and thus is not suited for everyone.
I agree that the self-reporting approach may not be the most future proof, and leobago’s suggestion may be a better approach long term. However, I do think that the distribution of validators is something that should be measured. According to beaconcha.in there are 561,655 active validators, and according to Miga Labs, there are 11,499 nodes. While I see the argument that if it’s impossible to ascertain where those validators are, it shouldn’t matter where they are, I think it’s dangerous to rely too heavily on that assumption. I still maintain that we would benefit from understanding if there is a significant concentration of validators in specific jurisdictions and taking steps to remediate if needs be.
My team and me have been working on this idea and we just released our new dashboards today: https://monitoreth.io/
You can see two options there, one to see the data about the number of beacon nodes in the network, and another one to see the number of validator nodes in the network. For the validator nodes, we filter them from the beacon nodes that are registered to at least one attestation network, which should imply that they are running at least one validator. Nodes that are not subscribed to any attestation network cannot have validators.
Hey @leobago - this is a really interesting approach, excellent work! From the data I see there’s about 2,080 nodes that are registered to at least one attestation network, and a total of 5,554 attestation networks (with 645,924 number of active validators, that works out at 116 validators per attestation subnet - I don’t know if that sounds right, I’ll try to double check). The distribution is very interesting, with the vast majority of validators registering to a single subnet, and the second largest cohort registering to 64 subnets. I would assume that is the two biggest groups of validators, i.e. solo takers and Lido respectively.
Out o curiosity: anyone could use potentially this method to track the IP addresses of nodes that have validator clients attached, correct?
According to our data, there is about ~11.9K beacon nodes in the network, from which ~5.5K are subscribed to at least one attestation network (i.e., Validators). From this 5.5K, about ~1.9K are subscribed to only one attestation network, and about ~1.1K to 64 attestation networks, which is the total number of attestation networks that exist (64). So I would say there are about ~3K nodes running as solo stakers (less than 5 validators) and about ~1K nodes that belong to large institutional staking operators (e.g., Lido, etc).
To answer your other question: yes, anyone can use this method to track the IP addresses of all the nodes registered to an attestation network (which does not necessarily means all of them have validators), but you cannot use this method to track the IP address of any specific validator. In other words, we just have the list of nodes operating in the network and some partial information about them, but we don’t know who is who. Does this make sense?
Yes that makes sense, and in fact is very reassuring to hear. The results are really informative. It’s great to get that broad breakdown of the number of nodes attached to validators and the ratio of solos taker nodes to institutional node operators.
Do you think then we can get a general idea of where those nodes are distributed, and whether it follows the larger geographical distribution of nodes? i.e. out of the 3k solo stakes nodes, are they widely dispersed? Are institutional node operators largely located in EU/US, or have Lido’s node operator policies had the desired effect of encouraging a more geographical distribution?
Thanks for the feedback. We have not looked at the breakdown geographical distribution for solo stakers vs institutional stakers, but we can definitely do it.