Skip to content

[Feature] Python connector: support pre-signed URL refresh for long-running queries #941

@pra91

Description

@pra91

Summary

The Python Delta Sharing connector currently has no mechanism to refresh
pre-signed URLs that expire mid-query. Snapshot reads of large tables that
take longer than the URL TTL (typically 1 hour) fail with HTTP 400 /
403 errors when the reader gets to a file whose URL has expired.

The Spark connector solves this with a PreSignedUrlCache plus a background
refresh thread driven by the refreshToken returned in the
endStreamAction (see #383, #69). The Python connector implements
neither half: it does not parse expirationTimestamp from add actions
and it does not parse the endStreamAction / refreshToken at all.

I'd like to add equivalent support to the Python connector. Filing this
issue per CONTRIBUTING.md since the change is >100 LOC and changes
user-visible behaviour, and would appreciate maintainer guidance on the
preferred shape before I open a PR.

Motivation

  • Real-world shares can have tens of thousands of files; a single
    load_as_pandas() call against such a table reliably exceeds the 1h
    pre-signed URL TTL on commodity hardware / network.
  • The PROTOCOL.md already documents expirationTimestamp (per file) and
    endStreamAction with refreshToken and minUrlExpirationTimestamp,
    so this is purely a client-side gap.
  • Recipients today have to chunk reads themselves with limit= /
    start_at= (which itself is incomplete — see Pagination in load_as_pandas for large tables #114) or accept periodic
    failures on long reads.

Today's behaviour (gap analysis)

  • delta_sharing/protocol.py::AddFile.from_json does not extract
    expirationTimestamp.
  • delta_sharing/rest_client.py::list_files_in_table does not look for
    an endStreamAction line and discards refreshToken /
    minUrlExpirationTimestamp.
  • delta_sharing/reader.py:
    • The legacy parquet path _to_pandas reads each action.url directly
      via fsspec / pyarrow with no awareness of expiry.
    • The kernel path __to_pandas_kernel writes response.lines to a
      temp delta log and hands it to delta-kernel-rust-sharing-wrapper,
      which has no way to ask the client for fresh URLs.

So both read paths can fail, and the legacy path will fail
deterministically on any read whose wall-clock exceeds the TTL.

Proposed approach

I'd like to mirror the Spark design at a high level, adapted to Python:

1. Protocol parsing (small, mechanical)

  • Add expiration_timestamp: Optional[int] = None to FileAction /
    AddFile and parse it in from_json.
  • Add a new EndStreamAction dataclass (refresh_token,
    next_page_token, min_url_expiration_timestamp) and parse the
    trailing endStreamAction line in list_files_in_table /
    list_table_changes.
  • Extend ListFilesInTableResponse with
    refresh_token: Optional[str] and
    min_url_expiration_timestamp: Optional[int].

This part is a no-op for callers and required regardless of what we do
about refresh, so it could ship as a standalone PR if preferred.

2. RefreshableTable cache + background refresh

  • New module delta_sharing/url_cache.py with a CachedTable that
    holds the most recent add_files indexed by id, the
    refresh_token, and the next refresh deadline.
  • A module-level CachedTableManager (singleton) running a daemon
    thread that wakes every check_interval_seconds (default 60s),
    iterates registered tables, and re-issues list_files_in_table with
    the stored refresh_token when min_url_expiration_timestamp is
    within refresh_threshold_seconds (default 600s) of now().
  • Same identity guarantees as the Spark side: refreshed responses are
    matched to existing entries by file id, and any file whose id is
    not present in the refresh response keeps its old URL (we don't
    invent files, we don't drop in-flight ones).

3. Read-path integration

  • Legacy path (_to_pandas): replace direct action.url access with
    a small _resolve_url(action) helper that returns the live URL from
    the cache (falling back to action.url when refresh isn't enabled or
    not supported by the server).

  • Kernel path (__to_pandas_kernel): trickier because the wrapper
    consumes raw JSON lines. Two options I'd like guidance on:

    • (a) Re-materialise the temp delta log on a refresh tick (cheap,
      works today, but only helps queries that haven't yet started the
      Rust scan).
    • (b) Add a callback/refresh hook to
      delta-kernel-rust-sharing-wrapper so the Rust side can ask Python
      for the latest URL by id. Cleaner, but cross-repo and requires a
      wrapper API change.

    I'd start with (a) for parity with the legacy path and file a
    follow-up for (b).

4. Configuration

Three knobs, mirroring the duckdb/Spark naming where reasonable:

  • delta_sharing.url_refresh.enabled (env var
    DELTA_SHARING_URL_REFRESH_ENABLED), default True.
  • delta_sharing.url_refresh.threshold_seconds, default 600.
  • delta_sharing.url_refresh.check_interval_seconds, default 60.

Configurable via either env vars or a small configure_url_refresh()
helper on the public API.

5. Tests

  • Unit tests for EndStreamAction.from_json and the new AddFile
    field.
  • Unit tests for CachedTableManager with a fake clock and a fake
    rest client (no network, no real threads — drive the loop manually).
  • An opt-in integration test against the public reference share that
    artificially shortens the refresh threshold so the refresh path is
    exercised in a sub-minute test run.

Scope / non-goals

  • Not changing the Rust wrapper in this issue (see option (a) above).
  • Not implementing pagination (nextPageToken) here — happy to do it
    in a follow-up since it shares plumbing with the
    endStreamAction parser.
  • Not addressing the historical-version refresh edge case described
    in Delta Sharing client needs to refresh urls for the correct table version #383 — same caveat applies as for the Spark connector.

Open questions for maintainers

  1. Are you happy with this overall shape, or would you prefer the
    refresh logic to live behind an explicit opt-in (e.g. a flag on
    SharingClient) rather than on-by-default?
  2. Preference between a single PR (~500–700 LOC incl. tests) vs.
    splitting into (a) protocol parsing, (b) cache + legacy path,
    (c) kernel path?
  3. Any objection to introducing a daemon thread inside the library, or
    would you rather the refresh be driven lazily on each file open?
    (Lazy is simpler but loses the "refresh before expiry" property and
    adds latency to the first read after expiry.)
  4. Is the delta_sharing.url_refresh.* config namespace acceptable, or
    should I put the knobs on DeltaSharingProfile instead?

Happy to take this on if there's appetite. Marking as a draft proposal
until I hear back.


Related: #383, #69, #114.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions