StereumLabs: AI-Powered Monitoring & Alerting for Ethereum Node Infrastructure

Hey :waving_hand:

We’re RockLogic - Lido Curated Set Node Operator, ISO 27001 certified, based in Austria. We’ve been building StereumLabs for a while now and some of you might know it already. Recently we’ve added an AI layer on top that I think is worth sharing here because it directly affects how node operators debug, monitor, and secure their infrastructure.

Not a grant proposal. Just want to show what we have, explain why we think it matters, and hear what you think.

What is StereumLabs?

Short version: a neutral observability platform for Ethereum clients. We run all 37 client combinations (6 EL × 6 CL + Erigon/Caplin standalone) on dedicated bare-metal hardware. Isolated environments, own servers, own network, own IPs, multiple datacenters - plus instances on Google Cloud. That means we can compare bare-metal vs cloud performance for every client combination. 20+ dashboards, full historical metrics and logs going back to September 2025, growing every day.

Funded in part by an Ethereum Foundation grant. Research partnership with MigaLabs who have published findings on our data. Client teams already have free access.

Full methodology and infrastructure details: docs.stereumlabs.com

What’s new: AI Chatbot

We built an AI chatbot that sits on top of our entire monitoring stack. Instead of clicking through dashboards and writing queries, you just ask:

  • “Compare disk growth between geth and erigon over the last 30 days”
  • “Which consensus client uses the most bandwidth as supernode?”
  • “How did the Prysm update from v7.1.1 to v7.1.2 affect resource usage?”

And you get a proper analysis back. With numbers, across all EL pairings, in seconds.

This is not a concept. We have a working proof of concept running against our live production data. Three examples of what comes out of the platform:

An experienced engineer could put this together manually. The Nimbus analysis alone required correlating Prometheus metrics, CC container logs, and EC container logs across 5 nodes over 48 hours. That’s days of work. The AI does it from a few questions.

What’s new: AI Alerting

The chatbot handles on-demand analysis. But when something goes wrong at 3am, you need alerts - and you need them to be useful.

We’re building a two-stage alerting system:

Stage 1: Classic threshold alerts. Attestation rate, disk usage, peer count, missed blocks. These fire in milliseconds. No AI in the loop, no extra latency. If the AI is unavailable for whatever reason, Stage 1 still works.

Stage 2: When an alert fires, the AI analyzes what happened. It pulls the relevant metrics and logs, compares your node data against our neutral baseline from all 37 client combinations, and tells you whether it’s a network-wide event, a client-specific issue, or something local to your environment.

That turns a 3am “something is broken” into “your Geth peer count dropped to 3, likely a network partition on your side, Prysm is healthy, check your firewall rules.”

Security monitoring

The same two-stage architecture applies to security events, and this is something we’ve been developing in depth - partly through work with security auditing partners.

We’ve built a catalog of 94 security-relevant triggers across 24 categories - from OS-level access control and key custody to slashing prevention, MEV/builder integrity, supply chain risks, and financial controls. Every single trigger is mapped to the corresponding ValOS risk IDs, covering 48 ValOS risks across all risk groups (Slashing, Downtime, Hacking, Key Custody, Infrastructure, Financial, Service Partner).

Some concrete examples of what these triggers detect:

  • Fee recipient manipulation. When an execution client restarts and reloads its configuration, reward addresses can be changed. A compromised operator or a malicious insider could redirect staking rewards. Traditional monitoring tells you “Geth restarted.” Our AI tells you “Geth restarted, fee recipient changed from 0xABC to 0xDEF, this change was not triggered through your deployment pipeline, severity: critical.” For operators managing significant ETH, detecting this in minutes instead of days is the difference between a security incident and a financial loss.

  • Slashing prevention. Duplicate validator key usage, anti-slashing DB integrity, failover conflicts (primary comes back online during failover), outdated slashing protection after incident response.

  • SSH access anomalies. Every login is checked: authorized key? Expected source IP? Expected time window? Unusual patterns get flagged with context and severity.

  • Configuration drift and change management. Unauthorized process starts, unexpected port openings, configuration file changes, validator key access patterns, unreviewed deployments. All correlated by the AI with full context from the monitoring stack.

  • Key custody and access control. Validator key access outside of expected processes, privilege escalation attempts, ex-employee access patterns, credential rotation failures.

The goal is continuous security monitoring that covers the gaps between point-in-time audits. We designed the trigger catalog specifically with ValOS alignment in mind - if your operation is assessed against ValOS, our monitoring provides ongoing detection for the risks it describes.

Connect your own infrastructure

Both the AI Chatbot and AI Alerting can be connected to your own nodes. Here’s how the architecture works:

Your raw data never enters RockLogic systems. The AI connects to your metrics endpoint with read-only access. Your data and our neutral StereumLabs baseline only meet inside the AI context of a single query. Nothing is stored, nothing is persisted on our side.

On-premise deployment is possible. For operators who need full data sovereignty, the entire stack can run in your environment. The AI accesses the StereumLabs baseline remotely (read-only) for comparison, but your data never leaves your perimeter.

AI reasoning is transparent. Every analysis shows which metrics and data sources were used. The AI doesn’t make infrastructure decisions - it delivers the analysis and its reasoning. The operator validates and decides what to do.

No vendor lock-in. The underlying monitoring is open-source. Your data, dashboards, and alert configurations are yours. Fully exportable.

One thing we want to be upfront about: the AI component processes queries through an external LLM API. The provider does not use API data for training. Enterprise DPA terms are available for operators who require them. We’re happy to walk through the data architecture in detail with anyone who wants to understand exactly how data flows.

Why this could matter for Lido

The Lido validator set keeps growing and diversifying. Curated Module v2 is evolving, CSM has hundreds of operators, the overall number of node operators keeps expanding. That’s great for decentralization but it also means more diversity in how operators run their monitoring - and more variance in how quickly problems get caught.

Curated operators could use this for standardized alerting across the professional set. Same alert rules, same AI analysis, same quality baseline. Faster incident response without everyone reinventing their own monitoring stack. And findings like the cross-EC block building analysis directly affect proposal revenue - knowing that your EC pairing returns empty payloads while others fill blocks to 23M gas is something you want to catch before it costs you money.

CSM and Simple DVT operators would benefit the most from the AI layer. The experience in Simple DVT has shown exactly this: some operators in DVT clusters are less experienced and struggle with troubleshooting when something goes wrong. AI root-cause analysis could help by pointing them to the most common issues immediately, without requiring deep PromQL knowledge or hours of manual dashboard work.

Protocol-level visibility. Uniform monitoring standards across the operator set could give Lido better insights into overall validator health and help catch systemic issues earlier.

We’re curious whether something like this could find a place in the broader Lido tooling landscape, and what shape that would take. Open to ideas.

Current status

Component Status Details
Dashboards (20+) :white_check_mark: Live All 37 client combinations covered
AI Chatbot :white_check_mark: Working PoC Running against live production data
AI Alerting :white_check_mark: PoC works Architecture validated, Stage 1 alerts active
Security Monitoring :wrench: Trigger catalog ready 94 triggers across 24 categories, all mapped to ValOS
Connect Your Own Infra :clipboard: Ready for pilots Architecture validated, looking for first operators

About us

  • RockLogic GmbH - Austrian IT infrastructure & blockchain company, Vienna
  • Lido Curated Set Node Operator
  • ISO 27001 certified
  • Multiple Ethereum Foundation Grants
  • Vertically integrated - own servers, own network, own IPs, multiple datacenters
  • General IT infrastructure experience beyond crypto - we also run infra for traditional businesses, which gives us a broader security perspective
  • Stereum - our open-source Ethereum node management tool

What we’re looking for

Honest feedback. Specifically:

  1. Is this useful? Would you actually use AI-powered monitoring and alerting on your nodes?
  2. What’s missing? What monitoring or alerting problems do you face today that nothing solves well?
  3. Interested in trying it? We’re looking for operators who want to connect their infra to the AI Chatbot and/or Alerting for a pilot. Happy to set that up.
  4. How should this fit in? If this is relevant for the Lido ecosystem, what’s the right way to integrate it?

If you want to see it in action, I’m happy to do a live demo. Drop a comment here or reach out on Telegram @stefa2k.

3 Likes