Skip to content

feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API#3265

Open
amir-deris wants to merge 9 commits intomainfrom
amir/plt-264-migrate-evmrpc-to-standardized-otel
Open

feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API#3265
amir-deris wants to merge 9 commits intomainfrom
amir/plt-264-migrate-evmrpc-to-standardized-otel

Conversation

@amir-deris
Copy link
Copy Markdown
Contributor

@amir-deris amir-deris commented Apr 16, 2026

Summary

Follow-up to #3253, rebased to use the standardized OpenTelemetry Meter API per reviewer feedback (@masih: "standardise on OTEL and use runtime bindings to report to prometheus instead of direct metrics reporting via prometheus client").

  • Replaces utils/metrics calls in evmrpc (filter.go, utils.go, websockets.go) with a new rpc_telemetry.go layer using the OTel metric.Meter API (otel.GetMeterProvider())
  • Removes IncrementRpcRequestCounter, MeasureRpcRequestLatency, and IncWebsocketConnects from utils/metrics — evmrpc was the sole caller
  • Removes num_blocks_fetched batch counters from internal processBatch call sites — these were unused and not wired to any dashboard or alert
  • Adds success label to latency histogram so failed requests can be tracked separately (requested in Migrate evmrpc telemetry from utils/metrics to direct Prometheus client #3253 review)
  • Adds a test for the new telemetry layer

New metric names (OTel naming convention, exported via the process-wide MeterProvider):

  • evmrpc_rpc_requests_total — request throughput (labels: endpoint, connection, success)
  • evmrpc_rpc_request_latency_ms — request latency histogram (labels: endpoint, connection, success)
  • evmrpc_websocket_connects_total — websocket connection count

Note

Metric names have changed from the legacy sei_rpc_request_counter / sei_rpc_request_latency_ms / sei_websocket_connect series used in #3253. Existing dashboards in sei-infra (sei_node_monitoring, launch_warroom) will need updating.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 16, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedApr 20, 2026, 5:55 PM

@amir-deris amir-deris changed the title used otel for evmrpc, added test feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API Apr 16, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 89.34426% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.30%. Comparing base (9225538) to head (cc4e18d).

Files with missing lines Patch % Lines
evmrpc/send.go 25.00% 3 Missing ⚠️
evmrpc/utils.go 70.00% 2 Missing and 1 partial ⚠️
evmrpc/metrics.go 94.28% 1 Missing and 1 partial ⚠️
evmrpc/net.go 0.00% 2 Missing ⚠️
evmrpc/state.go 66.66% 2 Missing ⚠️
evmrpc/association.go 85.71% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3265      +/-   ##
==========================================
- Coverage   59.30%   59.30%   -0.01%     
==========================================
  Files        2071     2072       +1     
  Lines      169814   169817       +3     
==========================================
+ Hits       100707   100708       +1     
- Misses      60333    60334       +1     
- Partials     8774     8775       +1     
Flag Coverage Δ
sei-chain-pr 64.27% <89.34%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
evmrpc/block.go 82.15% <100.00%> (ø)
evmrpc/filter.go 69.39% <100.00%> (-0.28%) ⬇️
evmrpc/info.go 77.68% <100.00%> (ø)
evmrpc/sei_legacy.go 73.13% <100.00%> (ø)
evmrpc/sei_legacy_http.go 80.23% <ø> (ø)
evmrpc/simulate.go 73.94% <100.00%> (ø)
evmrpc/subscribe.go 62.36% <100.00%> (ø)
evmrpc/tracers.go 65.30% <100.00%> (ø)
evmrpc/tx.go 85.01% <100.00%> (ø)
evmrpc/txpool.go 75.00% <100.00%> (ø)
... and 9 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread evmrpc/utils.go
// Automatically detect success/failure based on panic state
panicValue := recover()
success := panicValue == nil || err != nil
success := panicValue == nil && err == nil
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous success value would become true as long as there was no panic (due to short circuit for OR), even in cases that err was not nil. Updated here to account for both panic and err.

@amir-deris amir-deris requested review from bdchatham and masih April 16, 2026 19:25
Comment thread evmrpc/rpc_telemetry.go Outdated
)

func recordRPCRequest(endpoint, connection string, success bool) {
ctx := context.Background()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should either 1) take a context when the struct is built or 2) take a context through the api.

  1. is not suitable because we are creating a singleton object effectively statically.
  2. is the right way to go, although will expand the scope of the refactor.

Once you pass this in, you can provide it to the delegated methods directly instead of generating anew.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will work on receiving the context as an argument for option 2.

Comment thread evmrpc/rpc_telemetry.go Outdated
func recordFilterLogFetchBatchComplete(pipeline string) {
ctx := context.Background()
rpcTelemetryMetrics.filterLogFetchBatches.Add(ctx, 1,
metric.WithAttributes(attribute.String("pipeline", pipeline)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline is a bit of an ambiguous attribute name. Can you fill me in on your thinking?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, the pipeline doesn't seem like a sensible name here. It seems the label distinguishes block-fetch worker batches vs log-extraction worker batches in the filter path,. How about the following options, do you prefer any of them?
stage
phase
batch_kind or kind
operation

Copy link
Copy Markdown
Contributor

@bdchatham bdchatham Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take a step back to reframe this.

What are we trying to measure with this metric:

filterLogFetchBatches: must(rpcTelemetryMeter.Int64Counter(
			"evmrpc_filter_log_fetch_batches_total",
			metric.WithDescription("Internal filter/getLogs block batches completed (per pipeline path, not per RPC)"),
			metric.WithUnit("{batch}")

For instance, as a service owner, when I am looking into performance or operational behavior, what does the count tell me about the operation of the service? And for the questions I am trying to use it to answer, does it give me the whole answer? Are other metrics needed to supplement the reasoning?

This is a meta question so don't feel like this should all be deduced from the line of code we're looking at 😄

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great point. I believe the intention initially was to keep the existing metrics and only do migration to OTEL. But now that we are doing this work, it makes sense to step back and think about the big picture and the value this initiative is going to bring. Along the way, we can decide which signals bring value and we should keep.
In this particular instance, it doesn't seem this metric is bringing much value and is not part of any dashboards I can find. So I can remove it if that is ok.

Comment thread evmrpc/info.go Outdated
Comment on lines +51 to +74
@@ -59,19 +59,19 @@ func (i *InfoAPI) BlockNumber() hexutil.Uint64 {
//nolint:revive
func (i *InfoAPI) ChainId() *hexutil.Big {
startTime := time.Now()
defer recordMetrics("eth_ChainId", i.connectionType, startTime)
defer recordMetrics(context.Background(), "eth_ChainId", i.connectionType, startTime)
return (*hexutil.Big)(i.keeper.ChainID(i.ctxProvider(LatestCtxHeight)))
}

func (i *InfoAPI) Coinbase() (addr common.Address, err error) {
startTime := time.Now()
defer recordMetricsWithError("eth_Coinbase", i.connectionType, startTime, err)
defer recordMetricsWithError(context.Background(), "eth_Coinbase", i.connectionType, startTime, err)
return i.keeper.GetFeeCollectorAddress(i.ctxProvider(LatestCtxHeight))
}

func (i *InfoAPI) Accounts() (result []common.Address, returnErr error) {
startTime := time.Now()
defer recordMetricsWithError("eth_Accounts", i.connectionType, startTime, returnErr)
defer recordMetricsWithError(context.Background(), "eth_Accounts", i.connectionType, startTime, returnErr)
Copy link
Copy Markdown
Contributor

@bdchatham bdchatham Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should sort these out to use real request-scoped contexts. Any code path that creates a background context (e.g. context.Background() or context.TODO()) deep in the stack is a smell. It means we've lost the request context that should have been passed from the caller.

The way this should work is every incoming request gets its own context at the API layer, and that context gets threaded down through the stack so each layer can use it. OTel wires this up automatically. Once the SDK and your API framework are wired up, spans and trace data get attached to that request context as it flows through, without you having to do anything extra per call site. This is how OTel makes our existing setup more powerful out of the box.

So the two things we need to close the loop on our telemetry strategy are: 1) make sure the OTel SDK and API framework integration are configured correctly so trace propagation is set up at the boundary, and 2) make sure every handler is passing its request context down the call stack rather than creating new detached contexts along the way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for feedback. Sounds good, I will work on adding the context correctly and not creating context.Background on the fly.

Comment thread evmrpc/rpc_telemetry.go Outdated
Comment thread evmrpc/rpc_telemetry.go Outdated
func recordFilterLogFetchBatchComplete(pipeline string) {
ctx := context.Background()
rpcTelemetryMetrics.filterLogFetchBatches.Add(ctx, 1,
metric.WithAttributes(attribute.String("pipeline", pipeline)),
Copy link
Copy Markdown
Contributor

@bdchatham bdchatham Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take a step back to reframe this.

What are we trying to measure with this metric:

filterLogFetchBatches: must(rpcTelemetryMeter.Int64Counter(
			"evmrpc_filter_log_fetch_batches_total",
			metric.WithDescription("Internal filter/getLogs block batches completed (per pipeline path, not per RPC)"),
			metric.WithUnit("{batch}")

For instance, as a service owner, when I am looking into performance or operational behavior, what does the count tell me about the operation of the service? And for the questions I am trying to use it to answer, does it give me the whole answer? Are other metrics needed to supplement the reasoning?

This is a meta question so don't feel like this should all be deduced from the line of code we're looking at 😄

Comment thread evmrpc/filter.go
defer func() {
metrics.IncrementRpcRequestCounter("num_blocks_fetched", "blocks", true)
}()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aligned with removing this. The naming is not representative of what it's actually measuring.

Only argument I could see for keeping this is if any of our own telemetry or external infra providers would use this.

@masih - technically this could break external party's observability using this. The value of this metric seems limited at best. I'm thinking we remove it and replace it with a better one in a follow-up,

wdyt?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with removing this. I recommend looking into git history to see why this was added. I suspect it was added to debug a point of potential contention.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it was added in this PR as part of regular implementation (not debugging or hotfixing):
#2195

Copy link
Copy Markdown
Collaborator

@masih masih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay glad to see OTEL integration picking up 🚀

Blockers:

  • standardise on seconds to measure time/duration/latency
  • Remove redundant counters

Left a bunch of naming suggestions and a few questions

Thanks @amir-deris 🙌

Comment thread evmrpc/rpc_telemetry.go Outdated
// configured by the application.

var (
rpcTelemetryMeter = otel.Meter("evmrpc_rpc")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There would be a single meter this entire package. I would simply name this meter.
  • I recommend removing the repetitive _rpc suffix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I just noticed there is another meter definition in evmrpc/worker_pool_metrics.go and its meter definition clashes with this one if we rename this meter:
meter = otel.Meter("evmrpc_workerpool")
Should we combine all meters into one file (metrics.go)?

Comment thread evmrpc/rpc_telemetry.go Outdated
var (
rpcTelemetryMeter = otel.Meter("evmrpc_rpc")

rpcTelemetryMetrics = struct {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I would call this metrics.

Comment thread evmrpc/rpc_telemetry.go Outdated
rpcTelemetryMeter = otel.Meter("evmrpc_rpc")

rpcTelemetryMetrics = struct {
requests metric.Int64Counter
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • General note on naming: this reads too abstract. I would add a Count at the end of it.
  • The counter itself is redundant, if we always measure latency in a histogram. See my other comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this counter, as discussed in histogram comment.

Comment thread evmrpc/rpc_telemetry.go Outdated

rpcTelemetryMetrics = struct {
requests metric.Int64Counter
requestLatencyMs metric.Float64Histogram
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always standardise on seconds for latency/duration measurements. Duration.Seconds() conveniently returns a float64 without the loss of precision.

Comment thread evmrpc/rpc_telemetry.go Outdated
rpcTelemetryMetrics = struct {
requests metric.Int64Counter
requestLatencyMs metric.Float64Histogram
websocketConnects metric.Int64Counter
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly I would add a Count at the end, and perhaps pick a simpler name: wsConnectionCount or similar.

Comment thread evmrpc/rpc_telemetry.go Outdated
func recordRPCRequest(ctx context.Context, endpoint, connection string, success bool) {
rpcTelemetryMetrics.requests.Add(ctx, 1,
metric.WithAttributes(
attribute.String("endpoint", endpoint),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define attrKeys for repeated tags as vars?

Comment thread evmrpc/rpc_telemetry.go Outdated
metric.WithAttributes(
attribute.String("endpoint", endpoint),
attribute.String("connection", connection),
attribute.Bool("success", success),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better approach is to take error and tag by error type for a fixed set of well known errors.

Also do we measure JSON RPC response code somewhere? that would be super useful.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added new method classifyRPCMetricError to record error type and response code with latency.

Comment thread evmrpc/filter.go
defer func() {
metrics.IncrementRpcRequestCounter("num_blocks_fetched", "blocks", true)
}()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with removing this. I recommend looking into git history to see why this was added. I suspect it was added to debug a point of potential contention.

Comment thread evmrpc/association.go
func (t *AssociationAPI) Associate(ctx context.Context, req *AssociateRequest) (returnErr error) {
startTime := time.Now()
defer recordMetricsWithError("sei_associate", t.connectionType, startTime, returnErr)
defer recordMetricsWithError(ctx, "sei_associate", t.connectionType, startTime, returnErr)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we keeping recordMetricsWithError calls at all?

Is the idea to not break the existing metrics, roll out both then remove the old stuff?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we need to keep these metrics to be backward compatible with any dashboards we might have. Should I remove these metrics?

Comment thread evmrpc/rpc_telemetry.go Outdated
@@ -0,0 +1,65 @@
package evmrpc
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventionally I would simply call this file metrics.go. Each package would have one as needed. that go file would initialise once on startup and that's it; all package level vars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants