Prysm GRPC update

ldelisle_blocknative · July 5, 2023, 7:04pm

Events

On Tuesday, June 27th at about 01:00 UTC, Blocknative internal monitoring alerts signaled an increase in error rate with our Prysm beacon nodes and a reduction in block delivery. At that time our Lighthouse client took over.

Our team then restarted Prysm nodes twice (8:30 UTC and then again on 9:38 UTC) to no effect. At 10:45 UTC, the team deployed a fix to resolve Prysm crash recovery which then revealed a large number of error messages (code: 429 Too Many Requests). Code inspection of Flashbots and mainline Prysm branches did not reveal a reason for this rate limit error.

At around 13:00 UTC, we reproduced the rate limit error manually and noted a {"message": "grpc: received message larger than max (100004474 vs. 100000000)","code": 429} response. From this we determined that Prysm was unable to communicate properly with our relay due to the increase in the global validator set and registered validators. Prysm uses GRPC internally for communication and includes a hard limit on the size of messages.

At 13:11 UTC the Prysm GRPC message size limit was increased by 10x, which resolved the errors and the beacons behaved normally.

Several factors contributed to this event and its resolution time:

Prysm GRPC hard limit (root cause)
Difficulty in reproducing the error in a way that allowed deep inspection of error messages
Prysm erroring with a 429 error, implying a rate limit issue rather than an internal server failure (eg. a message size issue)
Recent change to mev-boost that requests blocks from all connected relays whether or not the relay knows about the block, making cause of getPayload request failures more difficult to determine. These “not found” payload requests are now a regular part of the network and no longer clearly indicate a relay issue.

Item 4 is particularly challenging because we can no longer use getPayload error rate as an indicator to pause the relay.

We are continuing to investigate why the relay was not 100% functional with a working Lighthouse beacon.

Impact

The Prysm GRPC hard limit resulted in its inability to communicate with the relay leading to 28 missed slots during the period. Once GRPC changes were implemented, rates returned to normal.

Improvements

Increased GRPC message size limit for Prysm client by 10x to account for larger validator set
Investigating better Lighthouse-only beacon support when Prysm is in a bad state
Developing a more nuanced error evaluation to trigger automated relay pause upon complex beacon failures.

jgm · July 5, 2023, 8:17pm

Possibly related: Increase default grpc max message size · Issue #11983 · prysmaticlabs/prysm · GitHub and associated PR Raise the max grpc message size to a very large value by default by prestonvanloon · Pull Request #12072 · prysmaticlabs/prysm · GitHub

Nishant_Das · July 10, 2023, 3:15am

Maybe I am missing something, but as Jim mentioned we have bumped up the max message size limit a while back, so using Prysm with the default settings shouldn’t be an issue. Was blocknative using the grpc max message flag and setting a custom value ?

Topic		Replies	Views
Updates to the Blocknative Relay Node Operators	2	4241	December 16, 2022
Blocknative Relay proposer public key update Node Operators	4	3890	March 31, 2023
Reimbursement for Certus One / Jump Crypto 2023-02-11 Ethereum Validators Incident Node Operators	1	3475	April 6, 2023
Blocknative Relay Data Storage Update Node Operators	1	3928	January 10, 2023
RockLogic Monthly Notes - 09/2023 Node Operators	0	1262	September 7, 2023

Prysm GRPC update

Related topics