Skip to content

feat: Measure execution status duration#174

Open
morgan-wowk wants to merge 1 commit intomasterfrom
task-status-duration
Open

feat: Measure execution status duration#174
morgan-wowk wants to merge 1 commit intomasterfrom
task-status-duration

Conversation

@morgan-wowk
Copy link
Copy Markdown
Collaborator

@morgan-wowk morgan-wowk commented Mar 18, 2026

Changes

  • Adds status_updated_at column to ExecutionNode table to track when execution status last changed
  • Implements SQLAlchemy event listener to automatically update status_updated_at timestamp when container_execution_status changes
  • Creates execution_status_transition_duration histogram metric to measure time spent in each execution status
  • Adds _transition_execution_status() helper function to centralize status updates and metric recording across all status transitions
  • Implements database migration logic to add the new status_updated_at column to existing tables
  • Replaces direct status assignments throughout the orchestrator with calls to the new transition helper function

Show of work

Note: Attribute names have since changed to execution_status_ prefix

image.png

Local smoke-test and verification completed ✅

Copy link
Copy Markdown
Collaborator Author

morgan-wowk commented Mar 18, 2026

@morgan-wowk morgan-wowk marked this pull request as ready for review March 18, 2026 04:45
@morgan-wowk morgan-wowk requested a review from Ark-kun as a code owner March 18, 2026 04:45
Copy link
Copy Markdown
Collaborator Author

Load Impact: feat: Measure execution status duration

The commit adds zero additional database calls during steady-state operation. status_updated_at is written as part of the existing UPDATE statement that already commits container_execution_status — SQLAlchemy batches both columns into one flush. Reading the previous status and timestamp before each transition is always an in-memory access from the already-loaded SQLAlchemy instance. The only one-time database cost is a single SELECT on every process restart to check for the new column, plus a single ALTER TABLE on first deployment. On the metrics side, record_status_transition appends to an in-memory histogram bucket and returns immediately; the OTel SDK exports observations to the collector in background batches, adding one additional metric stream with no synchronous I/O on the hot path.

Area Impact
DB calls at startup +1 SELECT per restart; +1 ALTER TABLE on first deploy only
DB calls per transition Nonestatus_updated_at piggybacks on the existing UPDATE
DB calls for reading previous status None — in-memory reads from the live SQLAlchemy instance
CPU per transition Negligible — one datetime.now() + one histogram bucket increment
Memory Negligible — one additional datetime field per loaded ExecutionNode
Network (OTel export) One additional histogram metric stream, exported in background batches

@morgan-wowk morgan-wowk requested a review from Volv-G March 18, 2026 05:03
Comment thread cloud_pipelines_backend/backend_types_sql.py Outdated
Comment thread cloud_pipelines_backend/database_ops.py Outdated
Comment thread cloud_pipelines_backend/database_ops.py Outdated
Comment thread cloud_pipelines_backend/database_ops.py Outdated
Comment thread cloud_pipelines_backend/database_ops.py Outdated
Comment thread cloud_pipelines_backend/orchestrator_sql.py Outdated
Comment thread cloud_pipelines_backend/orchestrator_sql.py Outdated
Comment thread cloud_pipelines_backend/orchestrator_sql.py Outdated
Comment thread cloud_pipelines_backend/instrumentation/metrics.py Outdated
@morgan-wowk morgan-wowk force-pushed the task-status-duration branch from c1a2020 to a53502b Compare March 20, 2026 18:50
@morgan-wowk morgan-wowk requested a review from yuechao-qin March 20, 2026 18:57
@morgan-wowk morgan-wowk force-pushed the task-status-duration branch from a53502b to 959b2f6 Compare March 20, 2026 19:00
Comment thread cloud_pipelines_backend/database_ops.py Outdated
@morgan-wowk morgan-wowk force-pushed the task-status-duration branch 2 times, most recently from 949e41c to c8256fc Compare March 24, 2026 01:16
Copy link
Copy Markdown
Contributor

@Ark-kun Ark-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Morgan. Thank you for your patience.
Let's make several changes to improve this feature.

  1. Let's follow the Kubernetes example and record history and times of all status transitions.
    In some cases Kubernetes records "Last time resource was in state X" to compress the state history when states can be repeated hundreds of times. In our case, we don't expect so much status re-entrance, so let's just record the whole history.
    This will allow much richer and future-proof metrics.

Example:

container_execution_status_history:
  - status: PENDING
     first_observed_at: XXX
  - status: RUNNING
     first_observed_at: YYY
  1. Instead of creating a new DB column and adding a manually written migration, let's use extra_data. It's designed for the extra data that does not fit into the current static DB schema. This brings SQL tables closer to the usability of the Document DBs.

  2. Instead of changing the orchestrator, let's set up a SqlAlchemy event listener for the ExecutionNode.container_execution_status change. This allows reliably intercepting all status changes. And it also keeps the orchestrator code simpler.

@morgan-wowk morgan-wowk force-pushed the task-status-duration branch 2 times, most recently from ba1fae1 to 2fd4062 Compare April 8, 2026 19:53
@morgan-wowk
Copy link
Copy Markdown
Collaborator Author

morgan-wowk commented Apr 8, 2026

Thanks for the thoughtful review, Alexey! I've reworked the implementation in this push.

What changed:

Status history in extra_data — the status_updated_at column and its migration are gone. Each status transition is now recorded as { "status": ..., "first_observed_at": ... } appended to extra_data["container_execution_status_history"].

SQLAlchemy event listeners (sql_event_listeners.py) — two thin hooks do the wiring:

  • A set listener on ExecutionNode.container_execution_status calls execution.stage_status_history_entry(value=value) to timestamp and stage the entry onto the instance.
  • A before_flush session listener calls obj.handle_before_flush() for any ExecutionNode in the dirty/new set.

All model-specific business logic lives on ExecutionNode itself:

  • stage_status_history_entry() — builds the timestamped entry and queues it in _pending_extra_data.
  • apply_status_history_entries() — merges pending entries into extra_data and emits transition events; only clears its own key from _pending_extra_data so other staged keys are left intact.
  • handle_before_flush() — entry point called by the session listener; delegates to apply_status_history_entries().

Typed event system (event_listeners.py) — a small generic pub/sub module with subscribe(event_type, callback, asynchronous) and emit(event). StatusTransitionEvent is a frozen dataclass carrying from_status, to_status, and duration_seconds. Listeners can opt into async dispatch (daemon thread) or synchronous dispatch per subscription.

Metrics (instrumentation/metrics.py) — register_listener() subscribes to StatusTransitionEvent with asynchronous=False so the histogram record happens synchronously on the flush thread, then the OTel SDK batches the export independently.

Orchestrator simplified_transition_execution_status() is gone entirely. All call sites are plain execution.container_execution_status = ContainerExecutionStatus.X assignments.

@morgan-wowk morgan-wowk force-pushed the task-status-duration branch 3 times, most recently from ecc6948 to d51d844 Compare April 8, 2026 20:32
@morgan-wowk morgan-wowk marked this pull request as draft April 8, 2026 20:35
@morgan-wowk morgan-wowk changed the base branch from master to graphite-base/174 April 8, 2026 22:40
@morgan-wowk morgan-wowk force-pushed the task-status-duration branch from d51d844 to ae305ae Compare April 8, 2026 22:40
@morgan-wowk morgan-wowk changed the base branch from graphite-base/174 to event-system April 8, 2026 22:40
@morgan-wowk morgan-wowk force-pushed the task-status-duration branch from ae305ae to 6632d71 Compare April 8, 2026 23:01
@morgan-wowk morgan-wowk force-pushed the task-status-duration branch 4 times, most recently from 7e42851 to 79b4de3 Compare April 9, 2026 00:12
@morgan-wowk morgan-wowk marked this pull request as ready for review April 9, 2026 00:35
@morgan-wowk morgan-wowk requested a review from Ark-kun April 9, 2026 00:35
Copy link
Copy Markdown
Collaborator Author

@Ark-kun Okay changed this around to be more event based. Let me know how you like it

Comment thread cloud_pipelines_backend/backend_types_sql.py Outdated
Comment thread cloud_pipelines_backend/backend_types_sql.py Outdated
@morgan-wowk morgan-wowk force-pushed the task-status-duration branch from b661857 to 747ee6f Compare April 9, 2026 21:26
@morgan-wowk morgan-wowk changed the base branch from event-system to graphite-base/174 April 13, 2026 21:07
@morgan-wowk morgan-wowk force-pushed the task-status-duration branch from 747ee6f to e425fd8 Compare April 13, 2026 21:07
@morgan-wowk morgan-wowk changed the base branch from graphite-base/174 to master April 13, 2026 21:07
@morgan-wowk morgan-wowk force-pushed the task-status-duration branch 2 times, most recently from 7037c83 to abca7d4 Compare April 13, 2026 21:31
@morgan-wowk morgan-wowk requested a review from Ark-kun April 13, 2026 21:35
@morgan-wowk morgan-wowk force-pushed the task-status-duration branch from abca7d4 to 83a6328 Compare April 13, 2026 21:40
Copy link
Copy Markdown
Collaborator Author

@Ark-kun This has been simplified per our discussion.

Verified locally in Prometheus:white_check_mark:

**Changes:**

* Adds histogram measurement for execution node status duration without adding additional database load to the system
@morgan-wowk morgan-wowk force-pushed the task-status-duration branch from 83a6328 to 34dc0a9 Compare April 14, 2026 20:58
prev_time = datetime.datetime.fromisoformat(prev["first_observed_at"])
curr_time = datetime.datetime.fromisoformat(curr["first_observed_at"])
try:
metrics.execution_status_transition_duration.record(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTel is built to never throw a Runtime exception, but in case it ever did, I don't want that to result in a rollback of the commit - so I've wrapped this with a try/except.

def _handle_container_execution_status_set(
execution: backend_types_sql.ExecutionNode,
value: typing.Any,
_old_value: typing.Any,
Copy link
Copy Markdown
Contributor

@Ark-kun Ark-kun Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check whether new value is different from the old value?
(If SqlAlchemy already checks that and does not fire event when the value is the same, then no need to change anything.)

@sql_event.listens_for(backend_types_sql.ExecutionNode.container_execution_status, "set")
def _handle_container_execution_status_set(
execution: backend_types_sql.ExecutionNode,
value: typing.Any,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's add type for value?

Comment on lines +21 to +43
@sql_event.listens_for(backend_types_sql.ExecutionNode.container_execution_status, "set")
def _handle_container_execution_status_set(
execution: backend_types_sql.ExecutionNode,
value: typing.Any,
_old_value: typing.Any,
_initiator: typing.Any,
) -> None:
if value is None:
return
if execution.extra_data is None:
execution.extra_data = {}
history: list = execution.extra_data.get(
backend_types_sql.EXECUTION_NODE_EXTRA_DATA_STATUS_HISTORY_KEY, []
)
entry = {
"status": value.value,
"first_observed_at": datetime.datetime.now(datetime.timezone.utc).strftime(
"%Y-%m-%dT%H:%M:%SZ"
),
}
execution.extra_data = {
**execution.extra_data,
backend_types_sql.EXECUTION_NODE_EXTRA_DATA_STATUS_HISTORY_KEY: history + [entry],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this whole function to the orchestrator module. I think maintaining the status history becomes part of it's job, so it belongs there. I'm kind of on the fence here, but modifications to ExecutionNode is probably the jurisdiction of the orchestrator.
This will also solve the issue of wiring it up automatically without relying on an import.

And _handle_before_commit function can go to some instrumentation module.

Copy link
Copy Markdown
Contributor

@Ark-kun Ark-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Approved.

But let's split the sql_event_listeners module (simple).
And let's check that we do not add history entries for duplicate status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants