[Feature] Python connector: support pre-signed URL refresh for long-running queries

## Summary

The Python Delta Sharing connector currently has no mechanism to refresh
pre-signed URLs that expire mid-query. Snapshot reads of large tables that
take longer than the URL TTL (typically 1 hour) fail with HTTP `400` /
`403` errors when the reader gets to a file whose URL has expired.

The Spark connector solves this with a `PreSignedUrlCache` plus a background
refresh thread driven by the `refreshToken` returned in the
`endStreamAction` (see #383, #69). The Python connector implements
neither half: it does not parse `expirationTimestamp` from `add` actions
and it does not parse the `endStreamAction` / `refreshToken` at all.

I'd like to add equivalent support to the Python connector. Filing this
issue per `CONTRIBUTING.md` since the change is >100 LOC and changes
user-visible behaviour, and would appreciate maintainer guidance on the
preferred shape before I open a PR.

## Motivation

- Real-world shares can have tens of thousands of files; a single
  `load_as_pandas()` call against such a table reliably exceeds the 1h
  pre-signed URL TTL on commodity hardware / network.
- The PROTOCOL.md already documents `expirationTimestamp` (per file) and
  `endStreamAction` with `refreshToken` and `minUrlExpirationTimestamp`,
  so this is purely a client-side gap.
- Recipients today have to chunk reads themselves with `limit=` /
  `start_at=` (which itself is incomplete — see #114) or accept periodic
  failures on long reads.

## Today's behaviour (gap analysis)

- `delta_sharing/protocol.py::AddFile.from_json` does not extract
  `expirationTimestamp`.
- `delta_sharing/rest_client.py::list_files_in_table` does not look for
  an `endStreamAction` line and discards `refreshToken` /
  `minUrlExpirationTimestamp`.
- `delta_sharing/reader.py`:
  - The legacy parquet path `_to_pandas` reads each `action.url` directly
    via `fsspec` / `pyarrow` with no awareness of expiry.
  - The kernel path `__to_pandas_kernel` writes `response.lines` to a
    temp delta log and hands it to `delta-kernel-rust-sharing-wrapper`,
    which has no way to ask the client for fresh URLs.

So both read paths can fail, and the legacy path will fail
*deterministically* on any read whose wall-clock exceeds the TTL.

## Proposed approach

I'd like to mirror the Spark design at a high level, adapted to Python:

### 1. Protocol parsing (small, mechanical)

- Add `expiration_timestamp: Optional[int] = None` to `FileAction` /
  `AddFile` and parse it in `from_json`.
- Add a new `EndStreamAction` dataclass (`refresh_token`,
  `next_page_token`, `min_url_expiration_timestamp`) and parse the
  trailing `endStreamAction` line in `list_files_in_table` /
  `list_table_changes`.
- Extend `ListFilesInTableResponse` with
  `refresh_token: Optional[str]` and
  `min_url_expiration_timestamp: Optional[int]`.

This part is a no-op for callers and required regardless of what we do
about refresh, so it could ship as a standalone PR if preferred.

### 2. `RefreshableTable` cache + background refresh

- New module `delta_sharing/url_cache.py` with a `CachedTable` that
  holds the most recent `add_files` indexed by `id`, the
  `refresh_token`, and the next refresh deadline.
- A module-level `CachedTableManager` (singleton) running a daemon
  thread that wakes every `check_interval_seconds` (default `60s`),
  iterates registered tables, and re-issues `list_files_in_table` with
  the stored `refresh_token` when `min_url_expiration_timestamp` is
  within `refresh_threshold_seconds` (default `600s`) of `now()`.
- Same identity guarantees as the Spark side: refreshed responses are
  matched to existing entries by file `id`, and any file whose `id` is
  not present in the refresh response keeps its old URL (we don't
  invent files, we don't drop in-flight ones).

### 3. Read-path integration

- **Legacy path (`_to_pandas`)**: replace direct `action.url` access with
  a small `_resolve_url(action)` helper that returns the live URL from
  the cache (falling back to `action.url` when refresh isn't enabled or
  not supported by the server).
- **Kernel path (`__to_pandas_kernel`)**: trickier because the wrapper
  consumes raw JSON lines. Two options I'd like guidance on:
  - **(a)** Re-materialise the temp delta log on a refresh tick (cheap,
    works today, but only helps queries that haven't yet started the
    Rust scan).
  - **(b)** Add a callback/refresh hook to
    `delta-kernel-rust-sharing-wrapper` so the Rust side can ask Python
    for the latest URL by `id`. Cleaner, but cross-repo and requires a
    wrapper API change.
  
  I'd start with (a) for parity with the legacy path and file a
  follow-up for (b).

### 4. Configuration

Three knobs, mirroring the duckdb/Spark naming where reasonable:

- `delta_sharing.url_refresh.enabled` (env var
  `DELTA_SHARING_URL_REFRESH_ENABLED`), default `True`.
- `delta_sharing.url_refresh.threshold_seconds`, default `600`.
- `delta_sharing.url_refresh.check_interval_seconds`, default `60`.

Configurable via either env vars or a small `configure_url_refresh()`
helper on the public API.

### 5. Tests

- Unit tests for `EndStreamAction.from_json` and the new `AddFile`
  field.
- Unit tests for `CachedTableManager` with a fake clock and a fake
  rest client (no network, no real threads — drive the loop manually).
- An opt-in integration test against the public reference share that
  artificially shortens the refresh threshold so the refresh path is
  exercised in a sub-minute test run.

## Scope / non-goals

- Not changing the Rust wrapper in this issue (see option (a) above).
- Not implementing pagination (`nextPageToken`) here — happy to do it
  in a follow-up since it shares plumbing with the
  `endStreamAction` parser.
- Not addressing the historical-version refresh edge case described
  in #383 — same caveat applies as for the Spark connector.

## Open questions for maintainers

1. Are you happy with this overall shape, or would you prefer the
   refresh logic to live behind an explicit opt-in (e.g. a flag on
   `SharingClient`) rather than on-by-default?
2. Preference between a single PR (~500–700 LOC incl. tests) vs.
   splitting into (a) protocol parsing, (b) cache + legacy path,
   (c) kernel path?
3. Any objection to introducing a daemon thread inside the library, or
   would you rather the refresh be driven lazily on each file open?
   (Lazy is simpler but loses the "refresh before expiry" property and
   adds latency to the first read after expiry.)
4. Is the `delta_sharing.url_refresh.*` config namespace acceptable, or
   should I put the knobs on `DeltaSharingProfile` instead?

Happy to take this on if there's appetite. Marking as a draft proposal
until I hear back.

---

Related: #383, #69, #114.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Python connector: support pre-signed URL refresh for long-running queries #941

Summary

Motivation

Today's behaviour (gap analysis)

Proposed approach

1. Protocol parsing (small, mechanical)

2. `RefreshableTable` cache + background refresh

3. Read-path integration

4. Configuration

5. Tests

Scope / non-goals

Open questions for maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Python connector: support pre-signed URL refresh for long-running queries #941

Description

Summary

Motivation

Today's behaviour (gap analysis)

Proposed approach

1. Protocol parsing (small, mechanical)

2. RefreshableTable cache + background refresh

3. Read-path integration

4. Configuration

5. Tests

Scope / non-goals

Open questions for maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. `RefreshableTable` cache + background refresh