Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Load Impact:
|
| Area | Impact |
|---|---|
| DB calls at startup | +1 SELECT per restart; +1 ALTER TABLE on first deploy only |
| DB calls per transition | None — status_updated_at piggybacks on the existing UPDATE |
| DB calls for reading previous status | None — in-memory reads from the live SQLAlchemy instance |
| CPU per transition | Negligible — one datetime.now() + one histogram bucket increment |
| Memory | Negligible — one additional datetime field per loaded ExecutionNode |
| Network (OTel export) | One additional histogram metric stream, exported in background batches |
c1a2020 to
a53502b
Compare
a53502b to
959b2f6
Compare
949e41c to
c8256fc
Compare
Ark-kun
left a comment
There was a problem hiding this comment.
Hi, Morgan. Thank you for your patience.
Let's make several changes to improve this feature.
- Let's follow the Kubernetes example and record history and times of all status transitions.
In some cases Kubernetes records "Last time resource was in state X" to compress the state history when states can be repeated hundreds of times. In our case, we don't expect so much status re-entrance, so let's just record the whole history.
This will allow much richer and future-proof metrics.
Example:
container_execution_status_history:
- status: PENDING
first_observed_at: XXX
- status: RUNNING
first_observed_at: YYY
-
Instead of creating a new DB column and adding a manually written migration, let's use
extra_data. It's designed for the extra data that does not fit into the current static DB schema. This brings SQL tables closer to the usability of the Document DBs. -
Instead of changing the orchestrator, let's set up a SqlAlchemy event listener for the
ExecutionNode.container_execution_statuschange. This allows reliably intercepting all status changes. And it also keeps the orchestrator code simpler.
ba1fae1 to
2fd4062
Compare
|
Thanks for the thoughtful review, Alexey! I've reworked the implementation in this push. What changed: Status history in SQLAlchemy event listeners (
All model-specific business logic lives on
Typed event system ( Metrics ( Orchestrator simplified — |
ecc6948 to
d51d844
Compare
d51d844 to
ae305ae
Compare
a35967f to
f851682
Compare
ae305ae to
6632d71
Compare
f851682 to
38ecf70
Compare
7e42851 to
79b4de3
Compare
|
@Ark-kun Okay changed this around to be more event based. Let me know how you like it |
38ecf70 to
09e430b
Compare
79b4de3 to
747ee6f
Compare
b661857 to
747ee6f
Compare
09e430b to
cb9bb8f
Compare
747ee6f to
e425fd8
Compare
7037c83 to
abca7d4
Compare
abca7d4 to
83a6328
Compare
|
@Ark-kun This has been simplified per our discussion. Verified locally in Prometheus:white_check_mark: |
**Changes:** * Adds histogram measurement for execution node status duration without adding additional database load to the system
83a6328 to
34dc0a9
Compare
| prev_time = datetime.datetime.fromisoformat(prev["first_observed_at"]) | ||
| curr_time = datetime.datetime.fromisoformat(curr["first_observed_at"]) | ||
| try: | ||
| metrics.execution_status_transition_duration.record( |
There was a problem hiding this comment.
OTel is built to never throw a Runtime exception, but in case it ever did, I don't want that to result in a rollback of the commit - so I've wrapped this with a try/except.
| def _handle_container_execution_status_set( | ||
| execution: backend_types_sql.ExecutionNode, | ||
| value: typing.Any, | ||
| _old_value: typing.Any, |
There was a problem hiding this comment.
Do we need to check whether new value is different from the old value?
(If SqlAlchemy already checks that and does not fire event when the value is the same, then no need to change anything.)
| @sql_event.listens_for(backend_types_sql.ExecutionNode.container_execution_status, "set") | ||
| def _handle_container_execution_status_set( | ||
| execution: backend_types_sql.ExecutionNode, | ||
| value: typing.Any, |
There was a problem hiding this comment.
Maybe let's add type for value?
| @sql_event.listens_for(backend_types_sql.ExecutionNode.container_execution_status, "set") | ||
| def _handle_container_execution_status_set( | ||
| execution: backend_types_sql.ExecutionNode, | ||
| value: typing.Any, | ||
| _old_value: typing.Any, | ||
| _initiator: typing.Any, | ||
| ) -> None: | ||
| if value is None: | ||
| return | ||
| if execution.extra_data is None: | ||
| execution.extra_data = {} | ||
| history: list = execution.extra_data.get( | ||
| backend_types_sql.EXECUTION_NODE_EXTRA_DATA_STATUS_HISTORY_KEY, [] | ||
| ) | ||
| entry = { | ||
| "status": value.value, | ||
| "first_observed_at": datetime.datetime.now(datetime.timezone.utc).strftime( | ||
| "%Y-%m-%dT%H:%M:%SZ" | ||
| ), | ||
| } | ||
| execution.extra_data = { | ||
| **execution.extra_data, | ||
| backend_types_sql.EXECUTION_NODE_EXTRA_DATA_STATUS_HISTORY_KEY: history + [entry], |
There was a problem hiding this comment.
Let's move this whole function to the orchestrator module. I think maintaining the status history becomes part of it's job, so it belongs there. I'm kind of on the fence here, but modifications to ExecutionNode is probably the jurisdiction of the orchestrator.
This will also solve the issue of wiring it up automatically without relying on an import.
And _handle_before_commit function can go to some instrumentation module.
Ark-kun
left a comment
There was a problem hiding this comment.
Thank you. Approved.
But let's split the sql_event_listeners module (simple).
And let's check that we do not add history entries for duplicate status.

Changes
status_updated_atcolumn toExecutionNodetable to track when execution status last changedstatus_updated_attimestamp whencontainer_execution_statuschangesexecution_status_transition_durationhistogram metric to measure time spent in each execution status_transition_execution_status()helper function to centralize status updates and metric recording across all status transitionsstatus_updated_atcolumn to existing tablesShow of work
Note: Attribute names have since changed to
execution_status_prefixLocal smoke-test and verification completed ✅