[Python] Add table handles and snapshot object model#862
[Python] Add table handles and snapshot object model#862zacdav-db wants to merge 6 commits intodelta-io:mainfrom
Conversation
|
@PatrickJin-db this is the first smaller PR in sequence for changes to add clearer UI/UX + arrow. Referencing goals of #860 |
linzhou-db
left a comment
There was a problem hiding this comment.
Looking good!
need @PatrickJin-db to add a unit test on real table.
Or could you try to add one based on existing tables? and let Patrick try to run the test locally.
| ).to_pandas() | ||
|
|
||
|
|
||
| class DeltaSharingSnapshot: |
There was a problem hiding this comment.
What if we call it TableSnapshot?
|
Addressed the changes for Also added a test. |
PatrickJin-db
left a comment
There was a problem hiding this comment.
overall direction looks good. this same interface can also be used for polars in the future.
will try running the tests locally tomorrow.
| version: Optional[int] = None, | ||
| timestamp: Optional[str] = None, | ||
| use_delta_format: Optional[bool] = None, | ||
| convert_in_batches: bool = False, |
There was a problem hiding this comment.
I took a look at the larger PR (#861) and it seems like convert_in_batches is only used by to_pandas and not to_arrow. If you don't plan to use convert_in_batches in to_arrow, then I think it makes more sense to have it be an argument of to_pandas rather than a field of TableSnapshot.
| # Fetch 10 rows from a table and convert it to a Pandas DataFrame. This can be used to read sample data from a table that cannot fit in the memory. | ||
| print("########### Loading 10 rows from delta_sharing.default.owid-covid-data as a Pandas DataFrame #############") | ||
| data = delta_sharing.load_as_pandas(table_url, limit=10) | ||
| # Configure a scan and fetch 10 rows from a table as a Pandas DataFrame. |
There was a problem hiding this comment.
nit: preserve the original comment
| convert_in_batches=self._convert_in_batches, | ||
| ) | ||
|
|
||
| def to_pandas(self) -> pd.DataFrame: |
There was a problem hiding this comment.
let's also add to_spark
Thanks. I'll move some of the changes in secondary PRs into this one based on the feedback. |
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
(cherry picked from commit 4df37d7)
Signed-off-by: Zac Davies <zachary.davies+data@databricks.com>
|
Given the request for more of the |
|
@PatrickJin-db let me know if you've had time to test locally or if there is anything I can do to help move things along. |
|
@zacdav-db Sorry for the wait. A few general asks I have are:
Also, you should be able to run unit tests locally. I am only required to run tests if they are integration tests (those marked by SKIP_INTEGRATION). |
| captured["convert_in_batches"] = self._convert_in_batches | ||
| return expected | ||
|
|
||
| monkeypatch.setattr("delta_sharing.delta_sharing.DeltaSharingReader.to_arrow", fake_to_arrow) |
There was a problem hiding this comment.
doesn't this monkeypatch kind of defeat the purpose of this test?
| captured["convert_in_batches"] = self._convert_in_batches | ||
| return expected | ||
|
|
||
| monkeypatch.setattr("delta_sharing.delta_sharing.DeltaSharingReader.to_pandas", fake_to_pandas) |
There was a problem hiding this comment.
same here. In general I'd prefer not using monkeypatch for unit tests, and keeping the server response and parquet file data as the only things we mock.
Summary:
Add the core object model for Python snapshot reads.
This PR introduces SharingClient.table, DeltaSharingTable, TableSnapshot, table.snapshot, and table.to_pandas as a pure full-snapshot materializer.
Scope:
Not included:
Testing:
Part of #860.