StereumLabs: AI-Powered Monitoring & Alerting for Ethereum Node Infrastructure

stefa2k · April 14, 2026, 8:23am

Hey

We’re RockLogic - Lido Curated Set Node Operator, ISO 27001 certified, based in Austria. We’ve been building StereumLabs for a while now and some of you might know it already. Recently we’ve added an AI layer on top that I think is worth sharing here because it directly affects how node operators debug, monitor, and secure their infrastructure.

Not a grant proposal. Just want to show what we have, explain why we think it matters, and hear what you think.

What is StereumLabs?

Short version: a neutral observability platform for Ethereum clients. We run all 37 client combinations (6 EL × 6 CL + Erigon/Caplin standalone) on dedicated bare-metal hardware. Isolated environments, own servers, own network, own IPs, multiple datacenters - plus instances on Google Cloud. That means we can compare bare-metal vs cloud performance for every client combination. 20+ dashboards, full historical metrics and logs going back to September 2025, growing every day.

Funded in part by an Ethereum Foundation grant. Research partnership with MigaLabs who have published findings on our data. Client teams already have free access.

Full methodology and infrastructure details: docs.stereumlabs.com

What’s new: AI Chatbot

We built an AI chatbot that sits on top of our entire monitoring stack. Instead of clicking through dashboards and writing queries, you just ask:

“Compare disk growth between geth and erigon over the last 30 days”
“Which consensus client uses the most bandwidth as supernode?”
“How did the Prysm update from v7.1.1 to v7.1.2 affect resource usage?”

And you get a proper analysis back. With numbers, across all EL pairings, in seconds.

This is not a concept. We have a working proof of concept running against our live production data. Three examples of what comes out of the platform:

Nimbus v26.3.1: Validator monitoring and block building across 5 execution clients - we monitored 1,000 validators across 5 EC pairings for 48 hours and found that block building performance varies dramatically between execution clients, with one EC returning empty payloads despite being the fastest responder. Directly relevant for anyone who cares about proposal revenue.
Teku 25.12.0 vs 26.2.0 vs 26.3.0 cross-version analysis - three releases, RocksDB migration, jemalloc, impact on CPU, memory, GC, disk I/O, block import latency across all 6 EL pairings
Prysm v7.1.1 vs v7.1.2 resource comparison - memory, CPU, block processing, GC, and peer count across all 6 EL pairings

An experienced engineer could put this together manually. The Nimbus analysis alone required correlating Prometheus metrics, CC container logs, and EC container logs across 5 nodes over 48 hours. That’s days of work. The AI does it from a few questions.

What’s new: AI Alerting

The chatbot handles on-demand analysis. But when something goes wrong at 3am, you need alerts - and you need them to be useful.

We’re building a two-stage alerting system:

Stage 1: Classic threshold alerts. Attestation rate, disk usage, peer count, missed blocks. These fire in milliseconds. No AI in the loop, no extra latency. If the AI is unavailable for whatever reason, Stage 1 still works.

Stage 2: When an alert fires, the AI analyzes what happened. It pulls the relevant metrics and logs, compares your node data against our neutral baseline from all 37 client combinations, and tells you whether it’s a network-wide event, a client-specific issue, or something local to your environment.

That turns a 3am “something is broken” into “your Geth peer count dropped to 3, likely a network partition on your side, Prysm is healthy, check your firewall rules.”

Security monitoring

The same two-stage architecture applies to security events, and this is something we’ve been developing in depth - partly through work with security auditing partners.

We’ve built a catalog of 94 security-relevant triggers across 24 categories - from OS-level access control and key custody to slashing prevention, MEV/builder integrity, supply chain risks, and financial controls. Every single trigger is mapped to the corresponding ValOS risk IDs, covering 48 ValOS risks across all risk groups (Slashing, Downtime, Hacking, Key Custody, Infrastructure, Financial, Service Partner).

Some concrete examples of what these triggers detect:

Fee recipient manipulation. When an execution client restarts and reloads its configuration, reward addresses can be changed. A compromised operator or a malicious insider could redirect staking rewards. Traditional monitoring tells you “Geth restarted.” Our AI tells you “Geth restarted, fee recipient changed from 0xABC to 0xDEF, this change was not triggered through your deployment pipeline, severity: critical.” For operators managing significant ETH, detecting this in minutes instead of days is the difference between a security incident and a financial loss.
Slashing prevention. Duplicate validator key usage, anti-slashing DB integrity, failover conflicts (primary comes back online during failover), outdated slashing protection after incident response.
SSH access anomalies. Every login is checked: authorized key? Expected source IP? Expected time window? Unusual patterns get flagged with context and severity.
Configuration drift and change management. Unauthorized process starts, unexpected port openings, configuration file changes, validator key access patterns, unreviewed deployments. All correlated by the AI with full context from the monitoring stack.
Key custody and access control. Validator key access outside of expected processes, privilege escalation attempts, ex-employee access patterns, credential rotation failures.

The goal is continuous security monitoring that covers the gaps between point-in-time audits. We designed the trigger catalog specifically with ValOS alignment in mind - if your operation is assessed against ValOS, our monitoring provides ongoing detection for the risks it describes.

Connect your own infrastructure

Both the AI Chatbot and AI Alerting can be connected to your own nodes. Here’s how the architecture works:

Your raw data never enters RockLogic systems. The AI connects to your metrics endpoint with read-only access. Your data and our neutral StereumLabs baseline only meet inside the AI context of a single query. Nothing is stored, nothing is persisted on our side.

On-premise deployment is possible. For operators who need full data sovereignty, the entire stack can run in your environment. The AI accesses the StereumLabs baseline remotely (read-only) for comparison, but your data never leaves your perimeter.

AI reasoning is transparent. Every analysis shows which metrics and data sources were used. The AI doesn’t make infrastructure decisions - it delivers the analysis and its reasoning. The operator validates and decides what to do.

No vendor lock-in. The underlying monitoring is open-source. Your data, dashboards, and alert configurations are yours. Fully exportable.

One thing we want to be upfront about: the AI component processes queries through an external LLM API. The provider does not use API data for training. Enterprise DPA terms are available for operators who require them. We’re happy to walk through the data architecture in detail with anyone who wants to understand exactly how data flows.

Why this could matter for Lido

The Lido validator set keeps growing and diversifying. Curated Module v2 is evolving, CSM has hundreds of operators, the overall number of node operators keeps expanding. That’s great for decentralization but it also means more diversity in how operators run their monitoring - and more variance in how quickly problems get caught.

Curated operators could use this for standardized alerting across the professional set. Same alert rules, same AI analysis, same quality baseline. Faster incident response without everyone reinventing their own monitoring stack. And findings like the cross-EC block building analysis directly affect proposal revenue - knowing that your EC pairing returns empty payloads while others fill blocks to 23M gas is something you want to catch before it costs you money.

CSM and Simple DVT operators would benefit the most from the AI layer. The experience in Simple DVT has shown exactly this: some operators in DVT clusters are less experienced and struggle with troubleshooting when something goes wrong. AI root-cause analysis could help by pointing them to the most common issues immediately, without requiring deep PromQL knowledge or hours of manual dashboard work.

Protocol-level visibility. Uniform monitoring standards across the operator set could give Lido better insights into overall validator health and help catch systemic issues earlier.

We’re curious whether something like this could find a place in the broader Lido tooling landscape, and what shape that would take. Open to ideas.

Current status

Component	Status	Details
Dashboards (20+)	Live	All 37 client combinations covered
AI Chatbot	Working PoC	Running against live production data
AI Alerting	PoC works	Architecture validated, Stage 1 alerts active
Security Monitoring	Trigger catalog ready	94 triggers across 24 categories, all mapped to ValOS
Connect Your Own Infra	Ready for pilots	Architecture validated, looking for first operators

About us

RockLogic GmbH - Austrian IT infrastructure & blockchain company, Vienna
Lido Curated Set Node Operator
ISO 27001 certified
Multiple Ethereum Foundation Grants
Vertically integrated - own servers, own network, own IPs, multiple datacenters
General IT infrastructure experience beyond crypto - we also run infra for traditional businesses, which gives us a broader security perspective
Stereum - our open-source Ethereum node management tool

What we’re looking for

Honest feedback. Specifically:

Is this useful? Would you actually use AI-powered monitoring and alerting on your nodes?
What’s missing? What monitoring or alerting problems do you face today that nothing solves well?
Interested in trying it? We’re looking for operators who want to connect their infra to the AI Chatbot and/or Alerting for a pilot. Happy to set that up.
How should this fit in? If this is relevant for the Lido ecosystem, what’s the right way to integrate it?

If you want to see it in action, I’m happy to do a live demo. Drop a comment here or reach out on Telegram @stefa2k.

stefa2k · April 29, 2026, 4:02am

We just released a blob performance analysis across all six consensus clients on our 72-node fleet: Blob performance across consensus clients

The headline: gossip verification speed varies 50x between the fastest and slowest client at p95. That matters because every millisecond of verification delay eats into the attestation window, especially on blob-heavy MEV-Boost blocks.

A few findings that are directly relevant for Lido operators:

Teku custody groups. We observed that Teku 26.4.0 reports only 4 custody groups via the beacon_custody_groups metric, even with a validator client attached. Other clients that expose this metric (Lodestar, Grandine) report 128 as expected. We’re not 100% sure yet whether this reflects actual behavior or just how Teku exposes the metric. Would be curious if other Teku operators see the same, or if there’s a CLI flag we’re missing.

Lodestar verification latency. Lodestar’s p95 gossip verification is 469ms per data column. In a blob-heavy slot with 6 blobs, that verification backlog can push attestations past the 4-second deadline. The other five clients are all under 25ms at p95.

getBlobsV2 latency depends on your EC. We found that the execution client you pair with your CC directly affects how fast blobs are fetched via the Engine API. This feeds back into block proposal timing. The full breakdown by CC×EC pairing is in the post.

We also published two other pieces since the original thread that might be useful:

EC Sync Speed: From 1.5 Hours to 8+ Days - if you’ve ever had to resync a node in production, this gives you concrete numbers on what to expect per client.
Fusaka hardfork: hardware impact on non-supernodes - PeerDAS reduced network bandwidth by 60% but pushed CPU up 30%. Per-client breakdown included.

All of this comes out of the same infrastructure and AI tooling described in the original post. Happy to dig into any of these topics if there are questions.

Topic		Replies	Views
Nodeoperator AI Agents for Protocols, Infra teams & Solo Stakers Node Operators	2	79	February 11, 2026
Ethereum Node Operator EL Diversity Improvement Commitments Node Operators	27	5171	May 31, 2024
RockLogic Monthly Notes - 09/2023 Node Operators	0	1280	September 7, 2023
Lido on Ethereum Node Operator (InfStones) Platform Vulnerability Investigation - November 22, 2023 Node Operators	29	6958	June 20, 2024
Slashing Incident involving RockLogic GmbH Validators - April 13, 2023 Node Operators	37	12637	August 10, 2023