Skip to content

[Python] Add table handles and snapshot object model#862

Open
zacdav-db wants to merge 6 commits intodelta-io:mainfrom
zacdav-db:codex/pr1-python-table-snapshot
Open

[Python] Add table handles and snapshot object model#862
zacdav-db wants to merge 6 commits intodelta-io:mainfrom
zacdav-db:codex/pr1-python-table-snapshot

Conversation

@zacdav-db
Copy link
Copy Markdown

@zacdav-db zacdav-db commented Mar 11, 2026

Summary:
Add the core object model for Python snapshot reads.

This PR introduces SharingClient.table, DeltaSharingTable, TableSnapshot, table.snapshot, and table.to_pandas as a pure full-snapshot materializer.

Scope:

  • table handles
  • snapshot configuration
  • pandas materialization
  • legacy parity coverage for load_as_pandas vs table.snapshot(...).to_pandas
  • minimal pandas quickstart and README updates

Not included:

  • Arrow APIs
  • lazy batch readers
  • object-based CDF APIs
  • DuckDB examples

Testing:

  • py_compile on touched Python files
  • focused pytest subset for table creation, snapshot wiring, parity, and direct real-table snapshot reads

Part of #860.

@zacdav-db
Copy link
Copy Markdown
Author

@PatrickJin-db this is the first smaller PR in sequence for changes to add clearer UI/UX + arrow. Referencing goals of #860

Copy link
Copy Markdown
Collaborator

@linzhou-db linzhou-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!
need @PatrickJin-db to add a unit test on real table.

Or could you try to add one based on existing tables? and let Patrick try to run the test locally.

Comment thread python/delta_sharing/delta_sharing.py Outdated
).to_pandas()


class DeltaSharingSnapshot:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we call it TableSnapshot?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, will make this change.

@zacdav-db
Copy link
Copy Markdown
Author

zacdav-db commented Apr 10, 2026

Addressed the changes for DeltaSharingSnapshot --> TableSnapshot.

Also added a test.

Copy link
Copy Markdown
Collaborator

@PatrickJin-db PatrickJin-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall direction looks good. this same interface can also be used for polars in the future.

will try running the tests locally tomorrow.

Comment thread python/delta_sharing/delta_sharing.py Outdated
version: Optional[int] = None,
timestamp: Optional[str] = None,
use_delta_format: Optional[bool] = None,
convert_in_batches: bool = False,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the larger PR (#861) and it seems like convert_in_batches is only used by to_pandas and not to_arrow. If you don't plan to use convert_in_batches in to_arrow, then I think it makes more sense to have it be an argument of to_pandas rather than a field of TableSnapshot.

Comment thread examples/python/quickstart_pandas.py Outdated
# Fetch 10 rows from a table and convert it to a Pandas DataFrame. This can be used to read sample data from a table that cannot fit in the memory.
print("########### Loading 10 rows from delta_sharing.default.owid-covid-data as a Pandas DataFrame #############")
data = delta_sharing.load_as_pandas(table_url, limit=10)
# Configure a scan and fetch 10 rows from a table as a Pandas DataFrame.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: preserve the original comment

Comment thread python/delta_sharing/delta_sharing.py Outdated
convert_in_batches=self._convert_in_batches,
)

def to_pandas(self) -> pd.DataFrame:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add to_spark

@zacdav-db
Copy link
Copy Markdown
Author

overall direction looks good. this same interface can also be used for polars in the future.

will try running the tests locally tomorrow.

Thanks. I'll move some of the changes in secondary PRs into this one based on the feedback.

Zac Davies added 4 commits April 10, 2026 14:11
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
(cherry picked from commit 4df37d7)
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
@zacdav-db
Copy link
Copy Markdown
Author

zacdav-db commented Apr 10, 2026

Given the request for more of the to_<library> methods I've pulled forward the to_arrow code and tests. CDF is still in next PR.

@zacdav-db
Copy link
Copy Markdown
Author

@PatrickJin-db let me know if you've had time to test locally or if there is anything I can do to help move things along.

@PatrickJin-db
Copy link
Copy Markdown
Collaborator

PatrickJin-db commented May 5, 2026

@zacdav-db Sorry for the wait. A few general asks I have are:

  1. Can we keep the interfaces changes separate from the implementation of to_arrow? My recommendation is a) new TableSnapshot interface (perhaps leave out to_arrow for now), with some basic tests ensuring the new to_pandas and to_spark methods work, followed by b) implementing to_arrow, to_record_batches, etc with both unit and integration tests making sure it works end-to-end.
  2. We also recently migrated this repo to use uv for dependency management. Can you add any required dependencies to pyproject.toml?

Also, you should be able to run unit tests locally. I am only required to run tests if they are integration tests (those marked by SKIP_INTEGRATION).

captured["convert_in_batches"] = self._convert_in_batches
return expected

monkeypatch.setattr("delta_sharing.delta_sharing.DeltaSharingReader.to_arrow", fake_to_arrow)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this monkeypatch kind of defeat the purpose of this test?

captured["convert_in_batches"] = self._convert_in_batches
return expected

monkeypatch.setattr("delta_sharing.delta_sharing.DeltaSharingReader.to_pandas", fake_to_pandas)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. In general I'd prefer not using monkeypatch for unit tests, and keeping the server response and parquet file data as the only things we mock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants