Skip to content

cache connections in OpenLineage SQL hook lineage#64843

Draft
mobuchowski wants to merge 3 commits intoapache:mainfrom
mobuchowski:mobuchowski/cache-connection
Draft

cache connections in OpenLineage SQL hook lineage#64843
mobuchowski wants to merge 3 commits intoapache:mainfrom
mobuchowski:mobuchowski/cache-connection

Conversation

@mobuchowski
Copy link
Copy Markdown
Contributor

@mobuchowski mobuchowski commented Apr 7, 2026

emit_lineage_from_sql_extras called hook.get_connection() twice per SQL extra — once in _resolve_namespace and once inside get_openlineage_facets_with_sql. For N extras from the same hook (common when a task runs multiple SQL statements) this is N×2 redundant round-trips that all return the same connection object. Each call hits SecretsManager (miss) then the Airflow API server, making it the dominant cost of SQL hook lineage processing.

Fix: build a conn_id → hook mapping before the loop, then define three @functools.cache-decorated local closures keyed by conn_id:

  • _get_connection — one hook.get_connection() call per unique conn_id
  • _get_database_info — derived from _get_connection, cached separately
  • _get_namespace — derived from _get_database_info, cached separately

Each concern is cached independently via functools.cache; no manual dict, no tuple packing. hook.get_connection() fires exactly once per unique conn_id regardless of how many SQL extras share it.

Also includes two related fixes:

  • extractors/manager.py: wrap get_hook_lineage() in try/except so an exception in SQL extras processing can't silently suppress the task-level COMPLETE event via @print_warning
  • plugins/listener.py: call logging.shutdown() before os._exit(0) in the fork child so buffered log messages (including failure warnings) are flushed before exit
Was generative AI tooling used to co-author this PR?
  • Yes - Claude Code

Generated-by: Claude Code following the guidelines

mobuchowski and others added 2 commits April 7, 2026 16:22
…tras

emit_lineage_from_sql_extras called hook.get_connection() twice per SQL
extra: once in _resolve_namespace and once inside
get_openlineage_facets_with_sql. For N extras from the same hook this
is N*2 round-trips (SecretsManager miss + Airflow API server hit each
time) all returning the same object.

Add _conn_cache dict keyed by conn_id so the connection, database_info,
and namespace are resolved exactly once per unique conn_id. Introduce
_resolve_connection_info() that returns all three from a single
get_connection() call. Add optional connection/database_info params to
get_openlineage_facets_with_sql so callers can pass pre-fetched values
and skip the internal lookup entirely.

Also fix two related issues found while investigating:

- extractors/manager.py: wrap get_hook_lineage() in try/except in the
  no-extractor path. An unhandled exception from emit_lineage_from_sql_extras
  -> _create_ol_event_pair propagated to @print_warning on on_success(),
  silently suppressing the task-level COMPLETE event.

- plugins/listener.py: call logging.shutdown() before os._exit(0) in the
  fork child. os._exit bypasses Python's stdio flush so any buffered log
  messages at exit (including failure warnings) were silently dropped,
  making fork failures invisible in task logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Maciej Obuchowski <maciej.obuchowski@datadoghq.com>
Replace the _conn_cache dict and _resolve_connection_info tuple helper
with three @functools.cache-decorated local functions keyed by conn_id:

  _get_connection     - one hook.get_connection() call per conn_id
  _get_database_info  - derived from _get_connection, cached separately
  _get_namespace      - derived from _get_database_info, cached separately

Each concern is cached independently. No tuple packing/unpacking.
_resolve_connection_info removed; _resolve_namespace restored to a
simple single-lookup implementation for external callers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Maciej Obuchowski <maciej.obuchowski@datadoghq.com>
@mobuchowski mobuchowski changed the title Mobuchowski/cache connection cache connections in OpenLineage SQL hook lineage Apr 7, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Maciej Obuchowski <maciej.obuchowski@datadoghq.com>
# No extractor found — fall back to hook lineage. This call must be wrapped in
# try/except: it runs emit_lineage_from_sql_extras → _create_ol_event_pair which
# is not guarded internally. An uncaught exception here would propagate into
# on_success()'s @print_warning, silently suppressing the task-level COMPLETE event.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path is not only for COMPLETE events, so we should adjust comments and logging.

@@ -70,6 +71,27 @@ def emit_lineage_from_sql_extras(task_instance, sql_extras: list, is_successful:
events: list[RunEvent] = []
query_count = 0

# Build conn_id -> hook mapping before iterating. Hook instances are not hashable so
# conn_id (a plain string) is used as the @cache key throughout.
_hook_by_conn_id = {_get_hook_conn_id(e.context): e.context for e in sql_extras}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enough to reduce hooks to using conn_id as key here? Maybe id(hook)? Wondering if it's possible for two hooks to send HLL, with the same db connection, but different params resulting in different database info ?

# logging so buffered records (including any warnings above) are flushed
# before the process exits. Without this, the final log lines are silently
# dropped, making failures invisible.
logging.shutdown()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if there are any implications of doing that in a provider, for other logs flowing in airflow. Can you double check that it's isolated here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're on our own process there, but will doublecheck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants