Prysm GRPC update

Events

On Tuesday, June 27th at about 01:00 UTC, Blocknative internal monitoring alerts signaled an increase in error rate with our Prysm beacon nodes and a reduction in block delivery. At that time our Lighthouse client took over.

Our team then restarted Prysm nodes twice (8:30 UTC and then again on 9:38 UTC) to no effect. At 10:45 UTC, the team deployed a fix to resolve Prysm crash recovery which then revealed a large number of error messages (code: 429 Too Many Requests). Code inspection of Flashbots and mainline Prysm branches did not reveal a reason for this rate limit error.

At around 13:00 UTC, we reproduced the rate limit error manually and noted a {"message": "grpc: received message larger than max (100004474 vs. 100000000)","code": 429} response. From this we determined that Prysm was unable to communicate properly with our relay due to the increase in the global validator set and registered validators. Prysm uses GRPC internally for communication and includes a hard limit on the size of messages.

At 13:11 UTC the Prysm GRPC message size limit was increased by 10x, which resolved the errors and the beacons behaved normally.

Several factors contributed to this event and its resolution time:

  1. Prysm GRPC hard limit (root cause)
  2. Difficulty in reproducing the error in a way that allowed deep inspection of error messages
  3. Prysm erroring with a 429 error, implying a rate limit issue rather than an internal server failure (eg. a message size issue)
  4. Recent change to mev-boost that requests blocks from all connected relays whether or not the relay knows about the block, making cause of getPayload request failures more difficult to determine. These “not found” payload requests are now a regular part of the network and no longer clearly indicate a relay issue.

Item 4 is particularly challenging because we can no longer use getPayload error rate as an indicator to pause the relay.

We are continuing to investigate why the relay was not 100% functional with a working Lighthouse beacon.

Impact

The Prysm GRPC hard limit resulted in its inability to communicate with the relay leading to 28 missed slots during the period. Once GRPC changes were implemented, rates returned to normal.

Improvements

  • Increased GRPC message size limit for Prysm client by 10x to account for larger validator set
  • Investigating better Lighthouse-only beacon support when Prysm is in a bad state
  • Developing a more nuanced error evaluation to trigger automated relay pause upon complex beacon failures.
1 Like

Possibly related: Increase default grpc max message size · Issue #11983 · prysmaticlabs/prysm · GitHub and associated PR Raise the max grpc message size to a very large value by default by prestonvanloon · Pull Request #12072 · prysmaticlabs/prysm · GitHub

4 Likes

Maybe I am missing something, but as Jim mentioned we have bumped up the max message size limit a while back, so using Prysm with the default settings shouldn’t be an issue. Was blocknative using the grpc max message flag and setting a custom value ?

1 Like