feat: Measure execution status duration by morgan-wowk · Pull Request #174 · TangleML/tangle

morgan-wowk · 2026-03-18T04:37:44Z

Changes

Adds status_updated_at column to ExecutionNode table to track when execution status last changed
Implements SQLAlchemy event listener to automatically update status_updated_at timestamp when container_execution_status changes
Creates execution_status_transition_duration histogram metric to measure time spent in each execution status
Adds _transition_execution_status() helper function to centralize status updates and metric recording across all status transitions
Implements database migration logic to add the new status_updated_at column to existing tables
Replaces direct status assignments throughout the orchestrator with calls to the new transition helper function

Show of work

Note: Attribute names have since changed to `execution_status_` prefix

Local smoke-test and verification completed ✅

morgan-wowk · 2026-03-18T04:37:57Z

feat: Metrics to count all non-terminal execution statuses #186
feat: Measure execution status duration #174 👈 (View in Graphite)
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

morgan-wowk · 2026-03-18T04:49:25Z

Load Impact: `feat: Measure execution status duration`

The commit adds zero additional database calls during steady-state operation. status_updated_at is written as part of the existing UPDATE statement that already commits container_execution_status — SQLAlchemy batches both columns into one flush. Reading the previous status and timestamp before each transition is always an in-memory access from the already-loaded SQLAlchemy instance. The only one-time database cost is a single SELECT on every process restart to check for the new column, plus a single ALTER TABLE on first deployment. On the metrics side, record_status_transition appends to an in-memory histogram bucket and returns immediately; the OTel SDK exports observations to the collector in background batches, adding one additional metric stream with no synchronous I/O on the hot path.

Area	Impact
DB calls at startup	+1 `SELECT` per restart; +1 `ALTER TABLE` on first deploy only
DB calls per transition	None — `status_updated_at` piggybacks on the existing `UPDATE`
DB calls for reading previous status	None — in-memory reads from the live SQLAlchemy instance
CPU per transition	Negligible — one `datetime.now()` + one histogram bucket increment
Memory	Negligible — one additional `datetime` field per loaded `ExecutionNode`
Network (OTel export)	One additional histogram metric stream, exported in background batches

Ark-kun

Hi, Morgan. Thank you for your patience.
Let's make several changes to improve this feature.

Let's follow the Kubernetes example and record history and times of all status transitions.
In some cases Kubernetes records "Last time resource was in state X" to compress the state history when states can be repeated hundreds of times. In our case, we don't expect so much status re-entrance, so let's just record the whole history.
This will allow much richer and future-proof metrics.

Example:

container_execution_status_history:
  - status: PENDING
     first_observed_at: XXX
  - status: RUNNING
     first_observed_at: YYY

Instead of creating a new DB column and adding a manually written migration, let's use extra_data. It's designed for the extra data that does not fit into the current static DB schema. This brings SQL tables closer to the usability of the Document DBs.
Instead of changing the orchestrator, let's set up a SqlAlchemy event listener for the ExecutionNode.container_execution_status change. This allows reliably intercepting all status changes. And it also keeps the orchestrator code simpler.

morgan-wowk · 2026-04-08T19:55:32Z

Thanks for the thoughtful review, Alexey! I've reworked the implementation in this push.

What changed:

Status history in extra_data — the status_updated_at column and its migration are gone. Each status transition is now recorded as { "status": ..., "first_observed_at": ... } appended to extra_data["container_execution_status_history"].

SQLAlchemy event listeners (sql_event_listeners.py) — two thin hooks do the wiring:

A set listener on ExecutionNode.container_execution_status calls execution.stage_status_history_entry(value=value) to timestamp and stage the entry onto the instance.
A before_flush session listener calls obj.handle_before_flush() for any ExecutionNode in the dirty/new set.

All model-specific business logic lives on ExecutionNode itself:

stage_status_history_entry() — builds the timestamped entry and queues it in _pending_extra_data.
apply_status_history_entries() — merges pending entries into extra_data and emits transition events; only clears its own key from _pending_extra_data so other staged keys are left intact.
handle_before_flush() — entry point called by the session listener; delegates to apply_status_history_entries().

Typed event system (event_listeners.py) — a small generic pub/sub module with subscribe(event_type, callback, asynchronous) and emit(event). StatusTransitionEvent is a frozen dataclass carrying from_status, to_status, and duration_seconds. Listeners can opt into async dispatch (daemon thread) or synchronous dispatch per subscription.

Metrics (instrumentation/metrics.py) — register_listener() subscribes to StatusTransitionEvent with asynchronous=False so the histogram record happens synchronously on the flush thread, then the OTel SDK batches the export independently.

Orchestrator simplified — _transition_execution_status() is gone entirely. All call sites are plain execution.container_execution_status = ContainerExecutionStatus.X assignments.

morgan-wowk · 2026-04-09T00:36:35Z

@Ark-kun Okay changed this around to be more event based. Let me know how you like it

morgan-wowk · 2026-04-13T22:14:57Z

@Ark-kun This has been simplified per our discussion.

Verified locally in Prometheus:white_check_mark:

**Changes:** * Adds histogram measurement for execution node status duration without adding additional database load to the system

morgan-wowk · 2026-04-14T20:59:41Z

+            prev_time = datetime.datetime.fromisoformat(prev["first_observed_at"])
+            curr_time = datetime.datetime.fromisoformat(curr["first_observed_at"])
+            try:
+                metrics.execution_status_transition_duration.record(


OTel is built to never throw a Runtime exception, but in case it ever did, I don't want that to result in a rollback of the commit - so I've wrapped this with a try/except.

Ark-kun · 2026-04-17T20:13:20Z

+def _handle_container_execution_status_set(
+    execution: backend_types_sql.ExecutionNode,
+    value: typing.Any,
+    _old_value: typing.Any,


Do we need to check whether new value is different from the old value?
(If SqlAlchemy already checks that and does not fire event when the value is the same, then no need to change anything.)

Ark-kun · 2026-04-17T20:13:49Z

+@sql_event.listens_for(backend_types_sql.ExecutionNode.container_execution_status, "set")
+def _handle_container_execution_status_set(
+    execution: backend_types_sql.ExecutionNode,
+    value: typing.Any,


Maybe let's add type for value?

Ark-kun · 2026-04-17T20:21:49Z

+@sql_event.listens_for(backend_types_sql.ExecutionNode.container_execution_status, "set")
+def _handle_container_execution_status_set(
+    execution: backend_types_sql.ExecutionNode,
+    value: typing.Any,
+    _old_value: typing.Any,
+    _initiator: typing.Any,
+) -> None:
+    if value is None:
+        return
+    if execution.extra_data is None:
+        execution.extra_data = {}
+    history: list = execution.extra_data.get(
+        backend_types_sql.EXECUTION_NODE_EXTRA_DATA_STATUS_HISTORY_KEY, []
+    )
+    entry = {
+        "status": value.value,
+        "first_observed_at": datetime.datetime.now(datetime.timezone.utc).strftime(
+            "%Y-%m-%dT%H:%M:%SZ"
+        ),
+    }
+    execution.extra_data = {
+        **execution.extra_data,
+        backend_types_sql.EXECUTION_NODE_EXTRA_DATA_STATUS_HISTORY_KEY: history + [entry],


Let's move this whole function to the orchestrator module. I think maintaining the status history becomes part of it's job, so it belongs there. I'm kind of on the fence here, but modifications to ExecutionNode is probably the jurisdiction of the orchestrator.
This will also solve the issue of wiring it up automatically without relying on an import.

And _handle_before_commit function can go to some instrumentation module.

Ark-kun

Thank you. Approved.

But let's split the sql_event_listeners module (simple).
And let's check that we do not add history entries for duplicate status.

morgan-wowk marked this pull request as ready for review March 18, 2026 04:45

morgan-wowk requested a review from Ark-kun as a code owner March 18, 2026 04:45

morgan-wowk requested a review from Volv-G March 18, 2026 05:03

yuechao-qin requested changes Mar 18, 2026

View reviewed changes

Comment thread cloud_pipelines_backend/instrumentation/metrics.py Outdated

morgan-wowk force-pushed the task-status-duration branch from c1a2020 to a53502b Compare March 20, 2026 18:50

morgan-wowk requested a review from yuechao-qin March 20, 2026 18:57

morgan-wowk force-pushed the task-status-duration branch from a53502b to 959b2f6 Compare March 20, 2026 19:00

yuechao-qin approved these changes Mar 20, 2026

View reviewed changes

Comment thread cloud_pipelines_backend/database_ops.py Outdated

morgan-wowk force-pushed the task-status-duration branch 2 times, most recently from 949e41c to c8256fc Compare March 24, 2026 01:16

Ark-kun requested changes Apr 7, 2026

View reviewed changes

morgan-wowk force-pushed the task-status-duration branch 2 times, most recently from ba1fae1 to 2fd4062 Compare April 8, 2026 19:53

morgan-wowk force-pushed the task-status-duration branch 3 times, most recently from ecc6948 to d51d844 Compare April 8, 2026 20:32

morgan-wowk marked this pull request as draft April 8, 2026 20:35

morgan-wowk changed the base branch from master to graphite-base/174 April 8, 2026 22:40

morgan-wowk force-pushed the task-status-duration branch from d51d844 to ae305ae Compare April 8, 2026 22:40

morgan-wowk changed the base branch from graphite-base/174 to event-system April 8, 2026 22:40

This was referenced Apr 8, 2026

feat: Metrics to count all non-terminal execution statuses #186

Open

refactor: Move ContainerExecutionStatus to enums module #202

Closed

feat: Add event_listeners module for event lifecycle support #203

Closed

morgan-wowk force-pushed the event-system branch from a35967f to f851682 Compare April 8, 2026 23:01

morgan-wowk force-pushed the task-status-duration branch from ae305ae to 6632d71 Compare April 8, 2026 23:01

morgan-wowk force-pushed the event-system branch from f851682 to 38ecf70 Compare April 8, 2026 23:17

morgan-wowk force-pushed the task-status-duration branch 4 times, most recently from 7e42851 to 79b4de3 Compare April 9, 2026 00:12

morgan-wowk marked this pull request as ready for review April 9, 2026 00:35

morgan-wowk requested a review from Ark-kun April 9, 2026 00:35

morgan-wowk force-pushed the event-system branch from 38ecf70 to 09e430b Compare April 9, 2026 19:15

morgan-wowk force-pushed the task-status-duration branch from 79b4de3 to 747ee6f Compare April 9, 2026 19:15

Ark-kun reviewed Apr 9, 2026

View reviewed changes

Comment thread cloud_pipelines_backend/backend_types_sql.py Outdated

Ark-kun reviewed Apr 9, 2026

View reviewed changes

Comment thread cloud_pipelines_backend/backend_types_sql.py Outdated

morgan-wowk force-pushed the task-status-duration branch from b661857 to 747ee6f Compare April 9, 2026 21:26

morgan-wowk changed the base branch from event-system to graphite-base/174 April 13, 2026 21:07

morgan-wowk force-pushed the graphite-base/174 branch from 09e430b to cb9bb8f Compare April 13, 2026 21:07

morgan-wowk force-pushed the task-status-duration branch from 747ee6f to e425fd8 Compare April 13, 2026 21:07

morgan-wowk changed the base branch from graphite-base/174 to master April 13, 2026 21:07

morgan-wowk force-pushed the task-status-duration branch 2 times, most recently from 7037c83 to abca7d4 Compare April 13, 2026 21:31

morgan-wowk requested a review from Ark-kun April 13, 2026 21:35

morgan-wowk force-pushed the task-status-duration branch from abca7d4 to 83a6328 Compare April 13, 2026 21:40

feat: Measure execution status duration

34dc0a9

**Changes:** * Adds histogram measurement for execution node status duration without adding additional database load to the system

morgan-wowk force-pushed the task-status-duration branch from 83a6328 to 34dc0a9 Compare April 14, 2026 20:58

morgan-wowk commented Apr 14, 2026

View reviewed changes

Ark-kun reviewed Apr 17, 2026

View reviewed changes

Ark-kun approved these changes Apr 17, 2026

View reviewed changes

Conversation

morgan-wowk commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Show of work

Note: Attribute names have since changed to execution_status_ prefix

Uh oh!

morgan-wowk commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

morgan-wowk commented Mar 18, 2026

Load Impact: feat: Measure execution status duration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ark-kun left a comment

Choose a reason for hiding this comment

Uh oh!

morgan-wowk commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

morgan-wowk commented Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

morgan-wowk commented Apr 13, 2026

Uh oh!

morgan-wowk Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Ark-kun Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ark-kun Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Ark-kun Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Ark-kun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

morgan-wowk commented Mar 18, 2026 •

edited

Loading

Note: Attribute names have since changed to `execution_status_` prefix

morgan-wowk commented Mar 18, 2026 •

edited

Loading

Load Impact: `feat: Measure execution status duration`

morgan-wowk commented Apr 8, 2026 •

edited

Loading

Ark-kun Apr 17, 2026 •

edited

Loading