cache connections in OpenLineage SQL hook lineage#64843
Draft
mobuchowski wants to merge 3 commits intoapache:mainfrom
Draft
cache connections in OpenLineage SQL hook lineage#64843mobuchowski wants to merge 3 commits intoapache:mainfrom
mobuchowski wants to merge 3 commits intoapache:mainfrom
Conversation
…tras emit_lineage_from_sql_extras called hook.get_connection() twice per SQL extra: once in _resolve_namespace and once inside get_openlineage_facets_with_sql. For N extras from the same hook this is N*2 round-trips (SecretsManager miss + Airflow API server hit each time) all returning the same object. Add _conn_cache dict keyed by conn_id so the connection, database_info, and namespace are resolved exactly once per unique conn_id. Introduce _resolve_connection_info() that returns all three from a single get_connection() call. Add optional connection/database_info params to get_openlineage_facets_with_sql so callers can pass pre-fetched values and skip the internal lookup entirely. Also fix two related issues found while investigating: - extractors/manager.py: wrap get_hook_lineage() in try/except in the no-extractor path. An unhandled exception from emit_lineage_from_sql_extras -> _create_ol_event_pair propagated to @print_warning on on_success(), silently suppressing the task-level COMPLETE event. - plugins/listener.py: call logging.shutdown() before os._exit(0) in the fork child. os._exit bypasses Python's stdio flush so any buffered log messages at exit (including failure warnings) were silently dropped, making fork failures invisible in task logs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Maciej Obuchowski <maciej.obuchowski@datadoghq.com>
Replace the _conn_cache dict and _resolve_connection_info tuple helper with three @functools.cache-decorated local functions keyed by conn_id: _get_connection - one hook.get_connection() call per conn_id _get_database_info - derived from _get_connection, cached separately _get_namespace - derived from _get_database_info, cached separately Each concern is cached independently. No tuple packing/unpacking. _resolve_connection_info removed; _resolve_namespace restored to a simple single-lookup implementation for external callers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Maciej Obuchowski <maciej.obuchowski@datadoghq.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Maciej Obuchowski <maciej.obuchowski@datadoghq.com>
kacpermuda
reviewed
Apr 7, 2026
| # No extractor found — fall back to hook lineage. This call must be wrapped in | ||
| # try/except: it runs emit_lineage_from_sql_extras → _create_ol_event_pair which | ||
| # is not guarded internally. An uncaught exception here would propagate into | ||
| # on_success()'s @print_warning, silently suppressing the task-level COMPLETE event. |
Collaborator
There was a problem hiding this comment.
This path is not only for COMPLETE events, so we should adjust comments and logging.
| @@ -70,6 +71,27 @@ def emit_lineage_from_sql_extras(task_instance, sql_extras: list, is_successful: | |||
| events: list[RunEvent] = [] | |||
| query_count = 0 | |||
|
|
|||
| # Build conn_id -> hook mapping before iterating. Hook instances are not hashable so | |||
| # conn_id (a plain string) is used as the @cache key throughout. | |||
| _hook_by_conn_id = {_get_hook_conn_id(e.context): e.context for e in sql_extras} | |||
Collaborator
There was a problem hiding this comment.
Is it enough to reduce hooks to using conn_id as key here? Maybe id(hook)? Wondering if it's possible for two hooks to send HLL, with the same db connection, but different params resulting in different database info ?
| # logging so buffered records (including any warnings above) are flushed | ||
| # before the process exits. Without this, the final log lines are silently | ||
| # dropped, making failures invisible. | ||
| logging.shutdown() |
Collaborator
There was a problem hiding this comment.
I am not sure if there are any implications of doing that in a provider, for other logs flowing in airflow. Can you double check that it's isolated here?
Contributor
Author
There was a problem hiding this comment.
We're on our own process there, but will doublecheck.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
emit_lineage_from_sql_extrascalledhook.get_connection()twice per SQL extra — once in_resolve_namespaceand once insideget_openlineage_facets_with_sql. For N extras from the same hook (common when a task runs multiple SQL statements) this is N×2 redundant round-trips that all return the same connection object. Each call hits SecretsManager (miss) then the Airflow API server, making it the dominant cost of SQL hook lineage processing.Fix: build a
conn_id → hookmapping before the loop, then define three@functools.cache-decorated local closures keyed byconn_id:_get_connection— onehook.get_connection()call per uniqueconn_id_get_database_info— derived from_get_connection, cached separately_get_namespace— derived from_get_database_info, cached separatelyEach concern is cached independently via
functools.cache; no manual dict, no tuple packing.hook.get_connection()fires exactly once per uniqueconn_idregardless of how many SQL extras share it.Also includes two related fixes:
extractors/manager.py: wrapget_hook_lineage()in try/except so an exception in SQL extras processing can't silently suppress the task-level COMPLETE event via@print_warningplugins/listener.py: calllogging.shutdown()beforeos._exit(0)in the fork child so buffered log messages (including failure warnings) are flushed before exitWas generative AI tooling used to co-author this PR?
Generated-by: Claude Code following the guidelines