You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Python Delta Sharing connector currently has no mechanism to refresh
pre-signed URLs that expire mid-query. Snapshot reads of large tables that
take longer than the URL TTL (typically 1 hour) fail with HTTP 400 / 403 errors when the reader gets to a file whose URL has expired.
The Spark connector solves this with a PreSignedUrlCache plus a background
refresh thread driven by the refreshToken returned in the endStreamAction (see #383, #69). The Python connector implements
neither half: it does not parse expirationTimestamp from add actions
and it does not parse the endStreamAction / refreshToken at all.
I'd like to add equivalent support to the Python connector. Filing this
issue per CONTRIBUTING.md since the change is >100 LOC and changes
user-visible behaviour, and would appreciate maintainer guidance on the
preferred shape before I open a PR.
Motivation
Real-world shares can have tens of thousands of files; a single load_as_pandas() call against such a table reliably exceeds the 1h
pre-signed URL TTL on commodity hardware / network.
The PROTOCOL.md already documents expirationTimestamp (per file) and endStreamAction with refreshToken and minUrlExpirationTimestamp,
so this is purely a client-side gap.
Recipients today have to chunk reads themselves with limit= / start_at= (which itself is incomplete — see Pagination in load_as_pandas for large tables #114) or accept periodic
failures on long reads.
Today's behaviour (gap analysis)
delta_sharing/protocol.py::AddFile.from_json does not extract expirationTimestamp.
delta_sharing/rest_client.py::list_files_in_table does not look for
an endStreamAction line and discards refreshToken / minUrlExpirationTimestamp.
delta_sharing/reader.py:
The legacy parquet path _to_pandas reads each action.url directly
via fsspec / pyarrow with no awareness of expiry.
The kernel path __to_pandas_kernel writes response.lines to a
temp delta log and hands it to delta-kernel-rust-sharing-wrapper,
which has no way to ask the client for fresh URLs.
So both read paths can fail, and the legacy path will fail deterministically on any read whose wall-clock exceeds the TTL.
Proposed approach
I'd like to mirror the Spark design at a high level, adapted to Python:
1. Protocol parsing (small, mechanical)
Add expiration_timestamp: Optional[int] = None to FileAction / AddFile and parse it in from_json.
Add a new EndStreamAction dataclass (refresh_token, next_page_token, min_url_expiration_timestamp) and parse the
trailing endStreamAction line in list_files_in_table / list_table_changes.
Extend ListFilesInTableResponse with refresh_token: Optional[str] and min_url_expiration_timestamp: Optional[int].
This part is a no-op for callers and required regardless of what we do
about refresh, so it could ship as a standalone PR if preferred.
2. RefreshableTable cache + background refresh
New module delta_sharing/url_cache.py with a CachedTable that
holds the most recent add_files indexed by id, the refresh_token, and the next refresh deadline.
A module-level CachedTableManager (singleton) running a daemon
thread that wakes every check_interval_seconds (default 60s),
iterates registered tables, and re-issues list_files_in_table with
the stored refresh_token when min_url_expiration_timestamp is
within refresh_threshold_seconds (default 600s) of now().
Same identity guarantees as the Spark side: refreshed responses are
matched to existing entries by file id, and any file whose id is
not present in the refresh response keeps its old URL (we don't
invent files, we don't drop in-flight ones).
3. Read-path integration
Legacy path (_to_pandas): replace direct action.url access with
a small _resolve_url(action) helper that returns the live URL from
the cache (falling back to action.url when refresh isn't enabled or
not supported by the server).
Kernel path (__to_pandas_kernel): trickier because the wrapper
consumes raw JSON lines. Two options I'd like guidance on:
(a) Re-materialise the temp delta log on a refresh tick (cheap,
works today, but only helps queries that haven't yet started the
Rust scan).
(b) Add a callback/refresh hook to delta-kernel-rust-sharing-wrapper so the Rust side can ask Python
for the latest URL by id. Cleaner, but cross-repo and requires a
wrapper API change.
I'd start with (a) for parity with the legacy path and file a
follow-up for (b).
4. Configuration
Three knobs, mirroring the duckdb/Spark naming where reasonable:
delta_sharing.url_refresh.enabled (env var DELTA_SHARING_URL_REFRESH_ENABLED), default True.
Configurable via either env vars or a small configure_url_refresh()
helper on the public API.
5. Tests
Unit tests for EndStreamAction.from_json and the new AddFile
field.
Unit tests for CachedTableManager with a fake clock and a fake
rest client (no network, no real threads — drive the loop manually).
An opt-in integration test against the public reference share that
artificially shortens the refresh threshold so the refresh path is
exercised in a sub-minute test run.
Scope / non-goals
Not changing the Rust wrapper in this issue (see option (a) above).
Not implementing pagination (nextPageToken) here — happy to do it
in a follow-up since it shares plumbing with the endStreamAction parser.
Are you happy with this overall shape, or would you prefer the
refresh logic to live behind an explicit opt-in (e.g. a flag on SharingClient) rather than on-by-default?
Preference between a single PR (~500–700 LOC incl. tests) vs.
splitting into (a) protocol parsing, (b) cache + legacy path,
(c) kernel path?
Any objection to introducing a daemon thread inside the library, or
would you rather the refresh be driven lazily on each file open?
(Lazy is simpler but loses the "refresh before expiry" property and
adds latency to the first read after expiry.)
Is the delta_sharing.url_refresh.* config namespace acceptable, or
should I put the knobs on DeltaSharingProfile instead?
Happy to take this on if there's appetite. Marking as a draft proposal
until I hear back.
Summary
The Python Delta Sharing connector currently has no mechanism to refresh
pre-signed URLs that expire mid-query. Snapshot reads of large tables that
take longer than the URL TTL (typically 1 hour) fail with HTTP
400/403errors when the reader gets to a file whose URL has expired.The Spark connector solves this with a
PreSignedUrlCacheplus a backgroundrefresh thread driven by the
refreshTokenreturned in theendStreamAction(see #383, #69). The Python connector implementsneither half: it does not parse
expirationTimestampfromaddactionsand it does not parse the
endStreamAction/refreshTokenat all.I'd like to add equivalent support to the Python connector. Filing this
issue per
CONTRIBUTING.mdsince the change is >100 LOC and changesuser-visible behaviour, and would appreciate maintainer guidance on the
preferred shape before I open a PR.
Motivation
load_as_pandas()call against such a table reliably exceeds the 1hpre-signed URL TTL on commodity hardware / network.
expirationTimestamp(per file) andendStreamActionwithrefreshTokenandminUrlExpirationTimestamp,so this is purely a client-side gap.
limit=/start_at=(which itself is incomplete — see Pagination in load_as_pandas for large tables #114) or accept periodicfailures on long reads.
Today's behaviour (gap analysis)
delta_sharing/protocol.py::AddFile.from_jsondoes not extractexpirationTimestamp.delta_sharing/rest_client.py::list_files_in_tabledoes not look foran
endStreamActionline and discardsrefreshToken/minUrlExpirationTimestamp.delta_sharing/reader.py:_to_pandasreads eachaction.urldirectlyvia
fsspec/pyarrowwith no awareness of expiry.__to_pandas_kernelwritesresponse.linesto atemp delta log and hands it to
delta-kernel-rust-sharing-wrapper,which has no way to ask the client for fresh URLs.
So both read paths can fail, and the legacy path will fail
deterministically on any read whose wall-clock exceeds the TTL.
Proposed approach
I'd like to mirror the Spark design at a high level, adapted to Python:
1. Protocol parsing (small, mechanical)
expiration_timestamp: Optional[int] = NonetoFileAction/AddFileand parse it infrom_json.EndStreamActiondataclass (refresh_token,next_page_token,min_url_expiration_timestamp) and parse thetrailing
endStreamActionline inlist_files_in_table/list_table_changes.ListFilesInTableResponsewithrefresh_token: Optional[str]andmin_url_expiration_timestamp: Optional[int].This part is a no-op for callers and required regardless of what we do
about refresh, so it could ship as a standalone PR if preferred.
2.
RefreshableTablecache + background refreshdelta_sharing/url_cache.pywith aCachedTablethatholds the most recent
add_filesindexed byid, therefresh_token, and the next refresh deadline.CachedTableManager(singleton) running a daemonthread that wakes every
check_interval_seconds(default60s),iterates registered tables, and re-issues
list_files_in_tablewiththe stored
refresh_tokenwhenmin_url_expiration_timestampiswithin
refresh_threshold_seconds(default600s) ofnow().matched to existing entries by file
id, and any file whoseidisnot present in the refresh response keeps its old URL (we don't
invent files, we don't drop in-flight ones).
3. Read-path integration
Legacy path (
_to_pandas): replace directaction.urlaccess witha small
_resolve_url(action)helper that returns the live URL fromthe cache (falling back to
action.urlwhen refresh isn't enabled ornot supported by the server).
Kernel path (
__to_pandas_kernel): trickier because the wrapperconsumes raw JSON lines. Two options I'd like guidance on:
works today, but only helps queries that haven't yet started the
Rust scan).
delta-kernel-rust-sharing-wrapperso the Rust side can ask Pythonfor the latest URL by
id. Cleaner, but cross-repo and requires awrapper API change.
I'd start with (a) for parity with the legacy path and file a
follow-up for (b).
4. Configuration
Three knobs, mirroring the duckdb/Spark naming where reasonable:
delta_sharing.url_refresh.enabled(env varDELTA_SHARING_URL_REFRESH_ENABLED), defaultTrue.delta_sharing.url_refresh.threshold_seconds, default600.delta_sharing.url_refresh.check_interval_seconds, default60.Configurable via either env vars or a small
configure_url_refresh()helper on the public API.
5. Tests
EndStreamAction.from_jsonand the newAddFilefield.
CachedTableManagerwith a fake clock and a fakerest client (no network, no real threads — drive the loop manually).
artificially shortens the refresh threshold so the refresh path is
exercised in a sub-minute test run.
Scope / non-goals
nextPageToken) here — happy to do itin a follow-up since it shares plumbing with the
endStreamActionparser.in Delta Sharing client needs to refresh urls for the correct table version #383 — same caveat applies as for the Spark connector.
Open questions for maintainers
refresh logic to live behind an explicit opt-in (e.g. a flag on
SharingClient) rather than on-by-default?splitting into (a) protocol parsing, (b) cache + legacy path,
(c) kernel path?
would you rather the refresh be driven lazily on each file open?
(Lazy is simpler but loses the "refresh before expiry" property and
adds latency to the first read after expiry.)
delta_sharing.url_refresh.*config namespace acceptable, orshould I put the knobs on
DeltaSharingProfileinstead?Happy to take this on if there's appetite. Marking as a draft proposal
until I hear back.
Related: #383, #69, #114.