feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API#3265
feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API#3265amir-deris wants to merge 9 commits intomainfrom
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3265 +/- ##
==========================================
- Coverage 59.30% 59.30% -0.01%
==========================================
Files 2071 2072 +1
Lines 169814 169817 +3
==========================================
+ Hits 100707 100708 +1
- Misses 60333 60334 +1
- Partials 8774 8775 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| // Automatically detect success/failure based on panic state | ||
| panicValue := recover() | ||
| success := panicValue == nil || err != nil | ||
| success := panicValue == nil && err == nil |
There was a problem hiding this comment.
The previous success value would become true as long as there was no panic (due to short circuit for OR), even in cases that err was not nil. Updated here to account for both panic and err.
| ) | ||
|
|
||
| func recordRPCRequest(endpoint, connection string, success bool) { | ||
| ctx := context.Background() |
There was a problem hiding this comment.
nit: we should either 1) take a context when the struct is built or 2) take a context through the api.
- is not suitable because we are creating a singleton object effectively statically.
- is the right way to go, although will expand the scope of the refactor.
Once you pass this in, you can provide it to the delegated methods directly instead of generating anew.
There was a problem hiding this comment.
Sounds good. I will work on receiving the context as an argument for option 2.
| func recordFilterLogFetchBatchComplete(pipeline string) { | ||
| ctx := context.Background() | ||
| rpcTelemetryMetrics.filterLogFetchBatches.Add(ctx, 1, | ||
| metric.WithAttributes(attribute.String("pipeline", pipeline)), |
There was a problem hiding this comment.
Pipeline is a bit of an ambiguous attribute name. Can you fill me in on your thinking?
There was a problem hiding this comment.
I agree with you, the pipeline doesn't seem like a sensible name here. It seems the label distinguishes block-fetch worker batches vs log-extraction worker batches in the filter path,. How about the following options, do you prefer any of them?
stage
phase
batch_kind or kind
operation
There was a problem hiding this comment.
Let's take a step back to reframe this.
What are we trying to measure with this metric:
filterLogFetchBatches: must(rpcTelemetryMeter.Int64Counter(
"evmrpc_filter_log_fetch_batches_total",
metric.WithDescription("Internal filter/getLogs block batches completed (per pipeline path, not per RPC)"),
metric.WithUnit("{batch}")
For instance, as a service owner, when I am looking into performance or operational behavior, what does the count tell me about the operation of the service? And for the questions I am trying to use it to answer, does it give me the whole answer? Are other metrics needed to supplement the reasoning?
This is a meta question so don't feel like this should all be deduced from the line of code we're looking at 😄
There was a problem hiding this comment.
That is a great point. I believe the intention initially was to keep the existing metrics and only do migration to OTEL. But now that we are doing this work, it makes sense to step back and think about the big picture and the value this initiative is going to bring. Along the way, we can decide which signals bring value and we should keep.
In this particular instance, it doesn't seem this metric is bringing much value and is not part of any dashboards I can find. So I can remove it if that is ok.
| @@ -59,19 +59,19 @@ func (i *InfoAPI) BlockNumber() hexutil.Uint64 { | |||
| //nolint:revive | |||
| func (i *InfoAPI) ChainId() *hexutil.Big { | |||
| startTime := time.Now() | |||
| defer recordMetrics("eth_ChainId", i.connectionType, startTime) | |||
| defer recordMetrics(context.Background(), "eth_ChainId", i.connectionType, startTime) | |||
| return (*hexutil.Big)(i.keeper.ChainID(i.ctxProvider(LatestCtxHeight))) | |||
| } | |||
|
|
|||
| func (i *InfoAPI) Coinbase() (addr common.Address, err error) { | |||
| startTime := time.Now() | |||
| defer recordMetricsWithError("eth_Coinbase", i.connectionType, startTime, err) | |||
| defer recordMetricsWithError(context.Background(), "eth_Coinbase", i.connectionType, startTime, err) | |||
| return i.keeper.GetFeeCollectorAddress(i.ctxProvider(LatestCtxHeight)) | |||
| } | |||
|
|
|||
| func (i *InfoAPI) Accounts() (result []common.Address, returnErr error) { | |||
| startTime := time.Now() | |||
| defer recordMetricsWithError("eth_Accounts", i.connectionType, startTime, returnErr) | |||
| defer recordMetricsWithError(context.Background(), "eth_Accounts", i.connectionType, startTime, returnErr) | |||
There was a problem hiding this comment.
We should sort these out to use real request-scoped contexts. Any code path that creates a background context (e.g. context.Background() or context.TODO()) deep in the stack is a smell. It means we've lost the request context that should have been passed from the caller.
The way this should work is every incoming request gets its own context at the API layer, and that context gets threaded down through the stack so each layer can use it. OTel wires this up automatically. Once the SDK and your API framework are wired up, spans and trace data get attached to that request context as it flows through, without you having to do anything extra per call site. This is how OTel makes our existing setup more powerful out of the box.
So the two things we need to close the loop on our telemetry strategy are: 1) make sure the OTel SDK and API framework integration are configured correctly so trace propagation is set up at the boundary, and 2) make sure every handler is passing its request context down the call stack rather than creating new detached contexts along the way.
There was a problem hiding this comment.
Thanks for feedback. Sounds good, I will work on adding the context correctly and not creating context.Background on the fly.
| func recordFilterLogFetchBatchComplete(pipeline string) { | ||
| ctx := context.Background() | ||
| rpcTelemetryMetrics.filterLogFetchBatches.Add(ctx, 1, | ||
| metric.WithAttributes(attribute.String("pipeline", pipeline)), |
There was a problem hiding this comment.
Let's take a step back to reframe this.
What are we trying to measure with this metric:
filterLogFetchBatches: must(rpcTelemetryMeter.Int64Counter(
"evmrpc_filter_log_fetch_batches_total",
metric.WithDescription("Internal filter/getLogs block batches completed (per pipeline path, not per RPC)"),
metric.WithUnit("{batch}")
For instance, as a service owner, when I am looking into performance or operational behavior, what does the count tell me about the operation of the service? And for the questions I am trying to use it to answer, does it give me the whole answer? Are other metrics needed to supplement the reasoning?
This is a meta question so don't feel like this should all be deduced from the line of code we're looking at 😄
| defer func() { | ||
| metrics.IncrementRpcRequestCounter("num_blocks_fetched", "blocks", true) | ||
| }() | ||
|
|
There was a problem hiding this comment.
I'm aligned with removing this. The naming is not representative of what it's actually measuring.
Only argument I could see for keeping this is if any of our own telemetry or external infra providers would use this.
@masih - technically this could break external party's observability using this. The value of this metric seems limited at best. I'm thinking we remove it and replace it with a better one in a follow-up,
wdyt?
There was a problem hiding this comment.
I am OK with removing this. I recommend looking into git history to see why this was added. I suspect it was added to debug a point of potential contention.
There was a problem hiding this comment.
It seems it was added in this PR as part of regular implementation (not debugging or hotfixing):
#2195
masih
left a comment
There was a problem hiding this comment.
Yay glad to see OTEL integration picking up 🚀
Blockers:
- standardise on
secondsto measure time/duration/latency - Remove redundant counters
Left a bunch of naming suggestions and a few questions
Thanks @amir-deris 🙌
| // configured by the application. | ||
|
|
||
| var ( | ||
| rpcTelemetryMeter = otel.Meter("evmrpc_rpc") |
There was a problem hiding this comment.
- There would be a single
meterthis entire package. I would simply name thismeter. - I recommend removing the repetitive
_rpcsuffix.
There was a problem hiding this comment.
Actually I just noticed there is another meter definition in evmrpc/worker_pool_metrics.go and its meter definition clashes with this one if we rename this meter:
meter = otel.Meter("evmrpc_workerpool")
Should we combine all meters into one file (metrics.go)?
| var ( | ||
| rpcTelemetryMeter = otel.Meter("evmrpc_rpc") | ||
|
|
||
| rpcTelemetryMetrics = struct { |
There was a problem hiding this comment.
Similarly, I would call this metrics.
| rpcTelemetryMeter = otel.Meter("evmrpc_rpc") | ||
|
|
||
| rpcTelemetryMetrics = struct { | ||
| requests metric.Int64Counter |
There was a problem hiding this comment.
- General note on naming: this reads too abstract. I would add a
Countat the end of it. - The counter itself is redundant, if we always measure latency in a histogram. See my other comment.
There was a problem hiding this comment.
I will remove this counter, as discussed in histogram comment.
|
|
||
| rpcTelemetryMetrics = struct { | ||
| requests metric.Int64Counter | ||
| requestLatencyMs metric.Float64Histogram |
There was a problem hiding this comment.
Always standardise on seconds for latency/duration measurements. Duration.Seconds() conveniently returns a float64 without the loss of precision.
| rpcTelemetryMetrics = struct { | ||
| requests metric.Int64Counter | ||
| requestLatencyMs metric.Float64Histogram | ||
| websocketConnects metric.Int64Counter |
There was a problem hiding this comment.
Similarly I would add a Count at the end, and perhaps pick a simpler name: wsConnectionCount or similar.
| func recordRPCRequest(ctx context.Context, endpoint, connection string, success bool) { | ||
| rpcTelemetryMetrics.requests.Add(ctx, 1, | ||
| metric.WithAttributes( | ||
| attribute.String("endpoint", endpoint), |
There was a problem hiding this comment.
Define attrKeys for repeated tags as vars?
| metric.WithAttributes( | ||
| attribute.String("endpoint", endpoint), | ||
| attribute.String("connection", connection), | ||
| attribute.Bool("success", success), |
There was a problem hiding this comment.
A better approach is to take error and tag by error type for a fixed set of well known errors.
Also do we measure JSON RPC response code somewhere? that would be super useful.
There was a problem hiding this comment.
I added new method classifyRPCMetricError to record error type and response code with latency.
| defer func() { | ||
| metrics.IncrementRpcRequestCounter("num_blocks_fetched", "blocks", true) | ||
| }() | ||
|
|
There was a problem hiding this comment.
I am OK with removing this. I recommend looking into git history to see why this was added. I suspect it was added to debug a point of potential contention.
| func (t *AssociationAPI) Associate(ctx context.Context, req *AssociateRequest) (returnErr error) { | ||
| startTime := time.Now() | ||
| defer recordMetricsWithError("sei_associate", t.connectionType, startTime, returnErr) | ||
| defer recordMetricsWithError(ctx, "sei_associate", t.connectionType, startTime, returnErr) |
There was a problem hiding this comment.
Why are we keeping recordMetricsWithError calls at all?
Is the idea to not break the existing metrics, roll out both then remove the old stuff?
There was a problem hiding this comment.
I thought we need to keep these metrics to be backward compatible with any dashboards we might have. Should I remove these metrics?
| @@ -0,0 +1,65 @@ | |||
| package evmrpc | |||
There was a problem hiding this comment.
Conventionally I would simply call this file metrics.go. Each package would have one as needed. that go file would initialise once on startup and that's it; all package level vars.
Summary
Follow-up to #3253, rebased to use the standardized OpenTelemetry Meter API per reviewer feedback (@masih: "standardise on OTEL and use runtime bindings to report to prometheus instead of direct metrics reporting via prometheus client").
utils/metricscalls inevmrpc(filter.go,utils.go,websockets.go) with a newrpc_telemetry.golayer using the OTelmetric.MeterAPI (otel.GetMeterProvider())IncrementRpcRequestCounter,MeasureRpcRequestLatency, andIncWebsocketConnectsfromutils/metrics— evmrpc was the sole callernum_blocks_fetchedbatch counters from internalprocessBatchcall sites — these were unused and not wired to any dashboard or alertsuccesslabel to latency histogram so failed requests can be tracked separately (requested in Migrate evmrpc telemetry from utils/metrics to direct Prometheus client #3253 review)New metric names (OTel naming convention, exported via the process-wide MeterProvider):
evmrpc_rpc_requests_total— request throughput (labels:endpoint,connection,success)evmrpc_rpc_request_latency_ms— request latency histogram (labels:endpoint,connection,success)evmrpc_websocket_connects_total— websocket connection countNote
Metric names have changed from the legacy
sei_rpc_request_counter/sei_rpc_request_latency_ms/sei_websocket_connectseries used in #3253. Existing dashboards in sei-infra (sei_node_monitoring, launch_warroom) will need updating.