feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API by amir-deris · Pull Request #3265 · sei-protocol/sei-chain

amir-deris · 2026-04-16T19:05:17Z

Summary

Follow-up to #3253, rebased to use the standardized OpenTelemetry Meter API per reviewer feedback (@masih: "standardise on OTEL and use runtime bindings to report to prometheus instead of direct metrics reporting via prometheus client").

Replaces utils/metrics calls in evmrpc (filter.go, utils.go, websockets.go) with a new rpc_telemetry.go layer using the OTel metric.Meter API (otel.GetMeterProvider())
Removes IncrementRpcRequestCounter, MeasureRpcRequestLatency, and IncWebsocketConnects from utils/metrics — evmrpc was the sole caller
Removes num_blocks_fetched batch counters from internal processBatch call sites — these were unused and not wired to any dashboard or alert
Adds success label to latency histogram so failed requests can be tracked separately (requested in Migrate evmrpc telemetry from utils/metrics to direct Prometheus client #3253 review)
Adds a test for the new telemetry layer

New metric names (OTel naming convention, exported via the process-wide MeterProvider):

evmrpc_rpc_requests_total — request throughput (labels: endpoint, connection, success)
evmrpc_rpc_request_latency_ms — request latency histogram (labels: endpoint, connection, success)
evmrpc_websocket_connects_total — websocket connection count

Note

Metric names have changed from the legacy sei_rpc_request_counter / sei_rpc_request_latency_ms / sei_websocket_connect series used in #3253. Existing dashboards in sei-infra (sei_node_monitoring, launch_warroom) will need updating.

github-actions · 2026-04-16T19:06:19Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	Apr 20, 2026, 5:55 PM

codecov · 2026-04-16T19:08:59Z

Codecov Report

❌ Patch coverage is 89.34426% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.30%. Comparing base (9225538) to head (cc4e18d).

Files with missing lines	Patch %	Lines
evmrpc/send.go	25.00%	3 Missing ⚠️
evmrpc/utils.go	70.00%	2 Missing and 1 partial ⚠️
evmrpc/metrics.go	94.28%	1 Missing and 1 partial ⚠️
evmrpc/net.go	0.00%	2 Missing ⚠️
evmrpc/state.go	66.66%	2 Missing ⚠️
evmrpc/association.go	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3265      +/-   ##
==========================================
- Coverage   59.30%   59.30%   -0.01%     
==========================================
  Files        2071     2072       +1     
  Lines      169814   169817       +3     
==========================================
+ Hits       100707   100708       +1     
- Misses      60333    60334       +1     
- Partials     8774     8775       +1

Flag	Coverage Δ
sei-chain-pr	`64.27% <89.34%> (?)`
sei-db	`70.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
evmrpc/block.go	`82.15% <100.00%> (ø)`
evmrpc/filter.go	`69.39% <100.00%> (-0.28%)`	⬇️
evmrpc/info.go	`77.68% <100.00%> (ø)`
evmrpc/sei_legacy.go	`73.13% <100.00%> (ø)`
evmrpc/sei_legacy_http.go	`80.23% <ø> (ø)`
evmrpc/simulate.go	`73.94% <100.00%> (ø)`
evmrpc/subscribe.go	`62.36% <100.00%> (ø)`
evmrpc/tracers.go	`65.30% <100.00%> (ø)`
evmrpc/tx.go	`85.01% <100.00%> (ø)`
evmrpc/txpool.go	`75.00% <100.00%> (ø)`
... and 9 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

amir-deris · 2026-04-16T19:24:58Z

 	// Automatically detect success/failure based on panic state
 	panicValue := recover()
-	success := panicValue == nil || err != nil
+	success := panicValue == nil && err == nil


The previous success value would become true as long as there was no panic (due to short circuit for OR), even in cases that err was not nil. Updated here to account for both panic and err.

bdchatham · 2026-04-16T19:33:42Z

+)
+
+func recordRPCRequest(endpoint, connection string, success bool) {
+	ctx := context.Background()


nit: we should either 1) take a context when the struct is built or 2) take a context through the api.

is not suitable because we are creating a singleton object effectively statically.

is the right way to go, although will expand the scope of the refactor.

Once you pass this in, you can provide it to the delegated methods directly instead of generating anew.

Sounds good. I will work on receiving the context as an argument for option 2.

bdchatham · 2026-04-16T19:52:27Z

+func recordFilterLogFetchBatchComplete(pipeline string) {
+	ctx := context.Background()
+	rpcTelemetryMetrics.filterLogFetchBatches.Add(ctx, 1,
+		metric.WithAttributes(attribute.String("pipeline", pipeline)),


Pipeline is a bit of an ambiguous attribute name. Can you fill me in on your thinking?

I agree with you, the pipeline doesn't seem like a sensible name here. It seems the label distinguishes block-fetch worker batches vs log-extraction worker batches in the filter path,. How about the following options, do you prefer any of them?
stage
phase
batch_kind or kind
operation

Let's take a step back to reframe this.

What are we trying to measure with this metric:

filterLogFetchBatches: must(rpcTelemetryMeter.Int64Counter( "evmrpc_filter_log_fetch_batches_total", metric.WithDescription("Internal filter/getLogs block batches completed (per pipeline path, not per RPC)"), metric.WithUnit("{batch}")

For instance, as a service owner, when I am looking into performance or operational behavior, what does the count tell me about the operation of the service? And for the questions I am trying to use it to answer, does it give me the whole answer? Are other metrics needed to supplement the reasoning?

This is a meta question so don't feel like this should all be deduced from the line of code we're looking at 😄

That is a great point. I believe the intention initially was to keep the existing metrics and only do migration to OTEL. But now that we are doing this work, it makes sense to step back and think about the big picture and the value this initiative is going to bring. Along the way, we can decide which signals bring value and we should keep.
In this particular instance, it doesn't seem this metric is bringing much value and is not part of any dashboards I can find. So I can remove it if that is ok.

bdchatham · 2026-04-16T20:43:30Z

@@ -59,19 +59,19 @@ func (i *InfoAPI) BlockNumber() hexutil.Uint64 {
 //nolint:revive
 func (i *InfoAPI) ChainId() *hexutil.Big {
 	startTime := time.Now()
-	defer recordMetrics("eth_ChainId", i.connectionType, startTime)
+	defer recordMetrics(context.Background(), "eth_ChainId", i.connectionType, startTime)
 	return (*hexutil.Big)(i.keeper.ChainID(i.ctxProvider(LatestCtxHeight)))
 }

 func (i *InfoAPI) Coinbase() (addr common.Address, err error) {
 	startTime := time.Now()
-	defer recordMetricsWithError("eth_Coinbase", i.connectionType, startTime, err)
+	defer recordMetricsWithError(context.Background(), "eth_Coinbase", i.connectionType, startTime, err)
 	return i.keeper.GetFeeCollectorAddress(i.ctxProvider(LatestCtxHeight))
 }

 func (i *InfoAPI) Accounts() (result []common.Address, returnErr error) {
 	startTime := time.Now()
-	defer recordMetricsWithError("eth_Accounts", i.connectionType, startTime, returnErr)
+	defer recordMetricsWithError(context.Background(), "eth_Accounts", i.connectionType, startTime, returnErr)


We should sort these out to use real request-scoped contexts. Any code path that creates a background context (e.g. context.Background() or context.TODO()) deep in the stack is a smell. It means we've lost the request context that should have been passed from the caller.

The way this should work is every incoming request gets its own context at the API layer, and that context gets threaded down through the stack so each layer can use it. OTel wires this up automatically. Once the SDK and your API framework are wired up, spans and trace data get attached to that request context as it flows through, without you having to do anything extra per call site. This is how OTel makes our existing setup more powerful out of the box.

So the two things we need to close the loop on our telemetry strategy are: 1) make sure the OTel SDK and API framework integration are configured correctly so trace propagation is set up at the boundary, and 2) make sure every handler is passing its request context down the call stack rather than creating new detached contexts along the way.

Thanks for feedback. Sounds good, I will work on adding the context correctly and not creating context.Background on the fly.

bdchatham · 2026-04-16T21:07:57Z

+func recordFilterLogFetchBatchComplete(pipeline string) {
+	ctx := context.Background()
+	rpcTelemetryMetrics.filterLogFetchBatches.Add(ctx, 1,
+		metric.WithAttributes(attribute.String("pipeline", pipeline)),


Let's take a step back to reframe this.

What are we trying to measure with this metric:

filterLogFetchBatches: must(rpcTelemetryMeter.Int64Counter( "evmrpc_filter_log_fetch_batches_total", metric.WithDescription("Internal filter/getLogs block batches completed (per pipeline path, not per RPC)"), metric.WithUnit("{batch}")

For instance, as a service owner, when I am looking into performance or operational behavior, what does the count tell me about the operation of the service? And for the questions I am trying to use it to answer, does it give me the whole answer? Are other metrics needed to supplement the reasoning?

This is a meta question so don't feel like this should all be deduced from the line of code we're looking at 😄

bdchatham · 2026-04-17T02:26:13Z

-	defer func() {
-		metrics.IncrementRpcRequestCounter("num_blocks_fetched", "blocks", true)
-	}()
-


I'm aligned with removing this. The naming is not representative of what it's actually measuring.

Only argument I could see for keeping this is if any of our own telemetry or external infra providers would use this.

@masih - technically this could break external party's observability using this. The value of this metric seems limited at best. I'm thinking we remove it and replace it with a better one in a follow-up,

wdyt?

I am OK with removing this. I recommend looking into git history to see why this was added. I suspect it was added to debug a point of potential contention.

It seems it was added in this PR as part of regular implementation (not debugging or hotfixing):
#2195

masih

Yay glad to see OTEL integration picking up 🚀

Blockers:

standardise on seconds to measure time/duration/latency
Remove redundant counters

Left a bunch of naming suggestions and a few questions

Thanks @amir-deris 🙌

masih · 2026-04-17T08:34:12Z

+// configured by the application.
+
+var (
+	rpcTelemetryMeter = otel.Meter("evmrpc_rpc")


There would be a single meter this entire package. I would simply name this meter.

I recommend removing the repetitive _rpc suffix.

Actually I just noticed there is another meter definition in evmrpc/worker_pool_metrics.go and its meter definition clashes with this one if we rename this meter:
meter = otel.Meter("evmrpc_workerpool")
Should we combine all meters into one file (metrics.go)?

masih · 2026-04-17T08:34:28Z

+var (
+	rpcTelemetryMeter = otel.Meter("evmrpc_rpc")
+
+	rpcTelemetryMetrics = struct {


Similarly, I would call this metrics.

masih · 2026-04-17T08:35:44Z

+	rpcTelemetryMeter = otel.Meter("evmrpc_rpc")
+
+	rpcTelemetryMetrics = struct {
+		requests          metric.Int64Counter


General note on naming: this reads too abstract. I would add a Count at the end of it.

The counter itself is redundant, if we always measure latency in a histogram. See my other comment.

I will remove this counter, as discussed in histogram comment.

masih · 2026-04-17T08:36:32Z

+
+	rpcTelemetryMetrics = struct {
+		requests          metric.Int64Counter
+		requestLatencyMs  metric.Float64Histogram


Always standardise on seconds for latency/duration measurements. Duration.Seconds() conveniently returns a float64 without the loss of precision.

masih · 2026-04-17T08:38:03Z

+	rpcTelemetryMetrics = struct {
+		requests          metric.Int64Counter
+		requestLatencyMs  metric.Float64Histogram
+		websocketConnects metric.Int64Counter


Similarly I would add a Count at the end, and perhaps pick a simpler name: wsConnectionCount or similar.

masih · 2026-04-17T08:45:17Z

+func recordRPCRequest(ctx context.Context, endpoint, connection string, success bool) {
+	rpcTelemetryMetrics.requests.Add(ctx, 1,
+		metric.WithAttributes(
+			attribute.String("endpoint", endpoint),


Define attrKeys for repeated tags as vars?

masih · 2026-04-17T08:46:24Z

+		metric.WithAttributes(
+			attribute.String("endpoint", endpoint),
+			attribute.String("connection", connection),
+			attribute.Bool("success", success),


A better approach is to take error and tag by error type for a fixed set of well known errors.

Also do we measure JSON RPC response code somewhere? that would be super useful.

I added new method classifyRPCMetricError to record error type and response code with latency.

masih · 2026-04-17T08:47:54Z

-	defer func() {
-		metrics.IncrementRpcRequestCounter("num_blocks_fetched", "blocks", true)
-	}()
-


I am OK with removing this. I recommend looking into git history to see why this was added. I suspect it was added to debug a point of potential contention.

masih · 2026-04-17T08:49:07Z

 func (t *AssociationAPI) Associate(ctx context.Context, req *AssociateRequest) (returnErr error) {
 	startTime := time.Now()
-	defer recordMetricsWithError("sei_associate", t.connectionType, startTime, returnErr)
+	defer recordMetricsWithError(ctx, "sei_associate", t.connectionType, startTime, returnErr)


Why are we keeping recordMetricsWithError calls at all?

Is the idea to not break the existing metrics, roll out both then remove the old stuff?

I thought we need to keep these metrics to be backward compatible with any dashboards we might have. Should I remove these metrics?

masih · 2026-04-17T08:50:01Z

@@ -0,0 +1,65 @@
+package evmrpc


Conventionally I would simply call this file metrics.go. Each package would have one as needed. that go file would initialise once on startup and that's it; all package level vars.

…ute in metric

…otel

used otel for evmrpc, added test

39d3ce0

amir-deris self-assigned this Apr 16, 2026

amir-deris added the non-app-hash-breaking label Apr 16, 2026

amir-deris changed the title ~~used otel for evmrpc, added test~~ feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API Apr 16, 2026

updated definition of success

d6ca26c

amir-deris commented Apr 16, 2026

View reviewed changes

amir-deris requested review from bdchatham and masih April 16, 2026 19:25

bdchatham reviewed Apr 16, 2026

View reviewed changes

Passed context to rpc_telemetry methods

9674cb3

bdchatham reviewed Apr 16, 2026

View reviewed changes

amir-deris added 3 commits April 16, 2026 15:36

removed redundant metric

35a80ae

Added missing context in evmrpc api calls with metric

c96937f

Fixing info_test and tx_test

5ddbade

bdchatham approved these changes Apr 17, 2026

View reviewed changes

masih reviewed Apr 17, 2026

View reviewed changes

amir-deris added 3 commits April 19, 2026 17:16

Adressing some pr feedback

4aabf00

pr feedback: t.Context instead of ctx.Background(), error code attrib…

470c163

…ute in metric

Merge branch 'main' into amir/plt-264-migrate-evmrpc-to-standardized-…

cc4e18d

…otel

Conversation

amir-deris commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Note

Uh oh!

github-actions bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdchatham Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdchatham Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bdchatham Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masih left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

amir-deris commented Apr 16, 2026 •

edited

Loading

github-actions bot commented Apr 16, 2026 •

edited

Loading

codecov bot commented Apr 16, 2026 •

edited

Loading

bdchatham Apr 16, 2026 •

edited

Loading

bdchatham Apr 16, 2026 •

edited

Loading

bdchatham Apr 16, 2026 •

edited

Loading