Skip to content

Fix start_date not restored for rescheduled tasks when scheduler queu…#64816

Open
peachchen0716 wants to merge 6 commits intoapache:mainfrom
peachchen0716:fix/reschedule-start-date-restoration
Open

Fix start_date not restored for rescheduled tasks when scheduler queu…#64816
peachchen0716 wants to merge 6 commits intoapache:mainfrom
peachchen0716:fix/reschedule-start-date-restoration

Conversation

@peachchen0716
Copy link
Copy Markdown
Contributor

@peachchen0716 peachchen0716 commented Apr 6, 2026

Problem

When a sensor runs in reschedule mode, the supervisor sends start_date=utcnow() on every poke. The ti_run execution API endpoint applied this value unconditionally, resetting start_date on each re-poke. This inflated the dagrun.first_task_scheduling_delay metric by including reschedule wait time.

A guard already existed for deferred tasks (if ti.next_kwargs: data.pop("start_date")), but no equivalent existed for rescheduled tasks.

Fix

Added a reschedule guard in ti_run (execution_api/routes/task_instances.py): when start_date is present in the update payload and the task has prior TaskReschedule records, the original start_date from the first reschedule entry is restored instead of accepting the supervisor's utcnow() value.

Also fixed _check_and_change_state_before_execution (used in test utilities only) to preserve start_date for rescheduled tasks the same way.

Testing

  • test_ti_run_restores_start_date_for_rescheduled_task — verifies the production path (ti_run) restores start_date from TaskReschedule on a subsequent poke

Breeze verification

Triggered verify_reschedule_start_date DAG (reschedule-mode PythonSensor, poke every 10 s) and observed ti.start_date across three pokes:

Event Before fix After fix
Poke 1 2025-07-14T10:00:00Z 2025-07-14T10:00:00Z
Poke 2 (+21 s) 2025-07-14T10:00:21Z 2025-07-14T10:00:00Z
Poke 3 (+42 s) 2025-07-14T10:00:42Z 2025-07-14T10:00:00Z

Before the fix start_date drifted on every poke; after the fix it stays at the first-poke value.


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Sonnet 4.6

Generated-by: Claude Sonnet 4.6 following the guidelines

peachchen0716 pushed a commit to peachchen0716/airflow that referenced this pull request Apr 6, 2026
Arthur Chen added 2 commits April 6, 2026 20:56
…es them

In _check_and_change_state_before_execution, the code that restores
TaskInstance.start_date to the original first-poke time was gated on
ti.state == UP_FOR_RESCHEDULE. In the normal scheduler flow the scheduler
advances state to QUEUED before the worker picks up the task, so
ti.refresh_from_db() returns QUEUED and the guard never fires. This causes
start_date to be reset to utcnow() on every re-execution, inflating the
dagrun.first_task_start_delay and dagrun.first_task_scheduling_delay metrics
by the full reschedule wait time.

Replace the state guard with an unconditional TaskReschedule lookup scoped
to the current try_number. The query returns None for non-rescheduled tasks
so behavior is unchanged in the normal case; for rescheduled tasks it
correctly restores start_date from the first poke regardless of whether
state is UP_FOR_RESCHEDULE or QUEUED at execution time.
The supervisor always sends start_date=utcnow() when calling ti_run to
mark a task as RUNNING. For sensors in reschedule mode this overwrote
the original start_date on every re-poke, inflating the
dagrun.first_task_scheduling_delay metric by the full reschedule wait.

The fix mirrors the existing deferral guard (next_kwargs): if a
TaskReschedule record exists for the TI, restore start_date from the
first record instead of accepting the supervisor's utcnow() value.

Also fix the newsfragment which referenced a non-existent metric name
(dagrun.first_task_start_delay) — the real metric is
dagrun.first_task_scheduling_delay.

Verified in Breeze: start_date stayed fixed across all reschedule pokes,
confirmed stable through to SUCCESS.
@peachchen0716 peachchen0716 force-pushed the fix/reschedule-start-date-restoration branch from 881928c to 868f047 Compare April 7, 2026 04:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant