feat(bigtable): add client side metric instrumentation to basic rpcs#16712
feat(bigtable): add client side metric instrumentation to basic rpcs#16712daniel-sanche wants to merge 8 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates client-side metrics tracking into the Bigtable data client for both asynchronous and synchronous implementations. It wraps key operations—including sample_row_keys, mutate_row, check_and_mutate_row, and read_modify_write_row—with metrics collection logic and introduces a tracked_retry wrapper to monitor retry attempts. Additionally, the PR refactors system tests by consolidating fixtures into a shared SystemTestRunner class and adds new system tests specifically for metrics. Feedback focuses on regressions in retry logic where critical arguments like sleep_generator and exception_factory were omitted in the transition to tracked_retry. There are also suggestions to improve resource cleanup in test fixtures and to relax restrictive timing assertions in tests to prevent flakiness.
| ), | ||
| clusters=cluster_config, | ||
| ) | ||
| operation.result(timeout=240) |
There was a problem hiding this comment.
If operation.result(timeout=240) raises an exception (e.g., a TimeoutError), the fixture will stop execution and the delete_instance call in the teardown phase will never be reached. This can lead to leaked Bigtable instances in the test project. Consider wrapping the creation and yield in a try...finally block or ensuring cleanup happens even on creation failure.
| operation.zone | ||
| == cluster_config[operation.cluster_id].location.split("/")[-1] | ||
| ) | ||
| assert operation.duration_ns > 0 and operation.duration_ns < 1e9 |
There was a problem hiding this comment.
The assertion operation.duration_ns < 1e9 (1 second) might be too restrictive for system tests running against a live backend. Network latency or backend load could easily cause an RPC to exceed 1 second, leading to flaky tests. It is recommended to remove this upper bound or increase it significantly.
| assert operation.duration_ns > 0 and operation.duration_ns < 1e9 | |
| assert operation.duration_ns > 0 |
mutianf
left a comment
There was a problem hiding this comment.
some final nit otherwise lgtm
| tuple[Exception, Exception|None]: | ||
| tuple of the exception to raise, and a cause exception if applicable | ||
| """ | ||
| exc_list = exc_list.copy() |
There was a problem hiding this comment.
why do we need to copy the exception list now?
| # validate operation | ||
| operation = handler.completed_operations[0] | ||
| assert isinstance(operation, CompletedOperationMetric) | ||
| assert operation.final_status.value[0] == 0 |
There was a problem hiding this comment.
why is this final_status.value[0] and not just final_status.value? or final_status.name = 'OK'?
| for i in range(num_retryable): | ||
| attempt = handler.completed_attempts[i] | ||
| assert isinstance(attempt, CompletedAttemptMetric) | ||
| assert attempt.end_status.name == "ABORTED" |
There was a problem hiding this comment.
carry over my comment from before, I don't think we retry aborted error for mutate row by default, so maybe use unavailable instead? And also a little suprised that this test passes?
Migration of googleapis/python-bigtable#1188 to the monorepo
This PR builds off of googleapis/python-bigtable#1187 to add instrumentation to basic data client rpcs (check_and_mutate, read_modify_write, sample_row_keys, mutate_row)
Metrics are not currently being exported anywhere, just collected and dropped. A future PR will add a GCP exporter to the system