Skip to content

[Python] Add table-handle, Arrow, and CDF object APIs#861

Closed
zacdav-db wants to merge 20 commits intodelta-io:mainfrom
zacdav-db:main
Closed

[Python] Add table-handle, Arrow, and CDF object APIs#861
zacdav-db wants to merge 20 commits intodelta-io:mainfrom
zacdav-db:main

Conversation

@zacdav-db
Copy link
Copy Markdown

@zacdav-db zacdav-db commented Mar 8, 2026

Summary

This PR adds an additive object-based Python API for Delta Sharing reads and extends that object model to change data feed.

It introduces:

  • SharingClient.table("share.schema.table")
  • DeltaSharingTable
  • DeltaSharingScan
  • DeltaSharingChanges
  • load_as_arrow(...)
  • table.to_arrow(...)
  • table.to_record_batches(...)
  • table.to_record_batch_reader(...)
  • table.changes(...).to_pandas()
  • table.changes(...).to_arrow()
  • table.changes(...).to_record_batches()
  • table.changes(...).to_record_batch_reader()

The existing URL-based pandas and CDF APIs remain supported.

Closes #860

Motivation

The current Python API requires callers to build "<profile>#<share>.<schema>.<table>" strings even when they are already working with a SharingClient and table concepts. That is awkward and makes Arrow-native consumers harder to support cleanly.

CDF also remained outside the new object model. This PR brings snapshot and CDF reads under the same table-oriented surface while preserving existing legacy behavior.

What Changed

Python API

Added:

  • SharingClient.table(...)
  • DeltaSharingTable.scan(...)
  • DeltaSharingTable.changes(...)
  • DeltaSharingTable.to_pandas(...)
  • DeltaSharingTable.to_arrow(...)
  • DeltaSharingTable.to_record_batches(...)
  • DeltaSharingTable.to_record_batch_reader(...)
  • DeltaSharingChanges.to_pandas()
  • DeltaSharingChanges.to_arrow()
  • DeltaSharingChanges.to_record_batches()
  • DeltaSharingChanges.to_record_batch_reader()
  • load_as_arrow(...)

Compatibility

Kept unchanged:

  • load_as_pandas("<profile>#<share>.<schema>.<table>")
  • load_table_changes_as_pandas(...)
  • existing URL parsing behavior
  • existing SharingClient.list_* behavior

Added regression coverage to verify:

  • legacy load_as_pandas(...) matches client.table(...).to_pandas(...)
  • legacy load_table_changes_as_pandas(...) matches client.table(...).changes(...).to_pandas(...)

Reader internals

  • Added Arrow table reads
  • Added lazy RecordBatch iteration
  • Added RecordBatchReader support
  • Refactored CDF reads onto a shared Arrow-style stream/materialization path
  • Preserved legacy CDF format semantics: delta format is only used when explicitly requested

Examples and docs

  • Updated examples/python/quickstart_pandas.py to lead with the new syntax and keep the older syntax as a compatibility example
  • Added examples/python/quickstart_arrow.py
  • Updated examples/README.md
  • Updated python/README.md

Testing

Ran:

uv run --python 3.10 --with ./python --with pytest python -m pytest \
  python/delta_sharing/tests/test_reader.py \
  -k 'table_changes or to_arrow or to_record_batch'

uv run --python 3.10 --with ./python --with pytest python -m pytest \
  python/delta_sharing/tests/test_delta_sharing.py \
  -k 'sharing_client_table or delta_sharing_table_changes or load_as_arrow or load_table_changes'

python3 -m py_compile \
  python/delta_sharing/delta_sharing.py \
  python/delta_sharing/reader.py \
  python/delta_sharing/__init__.py \
  python/delta_sharing/tests/test_delta_sharing.py \
  python/delta_sharing/tests/test_reader.py

Local results:

  • CDF/snapshot reader subsets: passing
  • public API subset: passing
  • integration-gated CDF tests: skipped in local non-integration environment
  • Black check: passing

Notes

This PR is intentionally additive. It does not remove or deprecate the existing URL-based snapshot or CDF APIs.

Zac Davies added 5 commits March 8, 2026 16:33
Add an additive object-based Python API for Delta Sharing reads.
This introduces SharingClient.table(...), DeltaSharingTable,
DeltaSharingScan, load_as_arrow(...), and lazy Arrow batch surfaces
via to_record_batches(...) and to_record_batch_reader(...).

Keep the legacy URL-based pandas interface intact and add regression
coverage to ensure the new table-handle pandas path matches the
existing load_as_pandas(...) behavior.

Update examples and docs to show the new table-handle API, add an
Arrow + DuckDB quickstart, and document the extra duckdb example
dependency.

Closes delta-io#860

Signed-off-by: Zac Davies <zac@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Zac Davies added 9 commits March 8, 2026 17:50
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
@zacdav-db
Copy link
Copy Markdown
Author

To assess performance of changes I've run a benchmarks on delta_sharing.default.nyctaxi_2019 (examples/open-datasets.share) comparing the new methods against existing to_pandas() method that is labelled as legacy_pandas.

Limit 10,000

Method Completed Mean (s) Median (s) Stddev (s) Min (s) Max (s) Mean Rows/s % diff vs table_pandas
record_batches 10 3.37 3.30 0.19 3.20 3.84 2,972.49 -24.26%
duckdb_record_batch_reader 10 3.44 3.47 0.20 3.12 3.85 2,912.15 -22.66%
table_arrow 10 3.54 3.42 0.26 3.25 4.07 2,836.53 -20.48%
table_pandas 10 4.45 4.40 0.28 4.06 4.97 2,253.36 +0.00%
legacy_pandas 10 4.47 4.43 0.09 4.35 4.62 2,238.45 +0.34%

Limit 1,000,000

Method Completed Mean (s) Median (s) Stddev (s) Min (s) Max (s) Mean Rows/s % diff vs table_pandas
record_batches 10 11.31 11.30 0.26 10.89 11.81 88,455.29 -30.82%
table_arrow 10 11.39 11.38 0.31 10.89 11.82 87,840.25 -30.33%
duckdb_record_batch_reader 10 11.49 11.46 0.43 10.75 12.06 87,144.39 -29.73%
table_pandas 10 16.35 16.41 0.40 15.48 16.90 61,194.63 +0.00%
legacy_pandas 10 16.36 16.44 0.27 15.88 16.72 61,123.83 +0.08%

@zacdav-db zacdav-db changed the title [Python] Add table-handle and Arrow-native read APIs [Python] Add table-handle, Arrow, and CDF object APIs Mar 8, 2026
Zac Davies added 2 commits March 9, 2026 00:20
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
@zacdav-db
Copy link
Copy Markdown
Author

@PatrickJin-db can you please help review?

@linzhou-db
Copy link
Copy Markdown
Collaborator

Is there a chance to split this into a couple small PRs to make the review process easy and better understanding of unit test coverage?

Zac Davies added 2 commits March 11, 2026 11:43
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
@zacdav-db
Copy link
Copy Markdown
Author

I can give it a go, do you have any preference for how we should split it?

Zac Davies added 2 commits March 11, 2026 11:55
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
@zacdav-db
Copy link
Copy Markdown
Author

Closing in favour of smaller PRs, starting with #862

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python connector: add first-class table handles and object-based Arrow/CDF read APIs

2 participants