[python] Support row-level Blob access by XiaoHongbo-Hope · Pull Request #7891 · apache/paimon

XiaoHongbo-Hope · 2026-05-18T13:02:47Z

Purpose

Tests

…b-as-descriptor setting

Revert the read-path changes from 4cf7b23 that always resolved blob descriptors to data. Restore master behavior where blob-as-descriptor flag is respected. Add Blob.from_bytes(data, file_io) as a unified entry point (equivalent to Java's Blob.fromBytes) that auto-dispatches BlobData vs BlobRef based on the bytes content.

- Add Optional[bytes] type annotation for data parameter - Raise ValueError when file_io is None but bytes are BlobDescriptor - Add unit tests for from_bytes covering all branches

- from_bytes(None) returns None (not empty BlobData), matching Java - Add allow_blob_data parameter (Java's 4th arg): when False, always interpret bytes as descriptor - Integrate into read path: _blob_cell_to_data now delegates to Blob.from_bytes instead of duplicating dispatch logic

Add get_blob(pos) to InternalRow and to_blob_iterator() to TableRead, enabling lazy Blob access with streaming support.

JingsongLi

Review: [python] Support row-level Blob access

Overall this is a useful addition that enables lazy/streaming blob access at the row level. The Blob.from_bytes() factory and get_blob() API are clean and well-tested.

1. Shared mutable state in to_blob_iterator (correctness / thread-safety)

In table_read.py, the method mutates self.table.options immediately (setting BLOB_AS_DESCRIPTOR = True) but only restores the original value inside the generator's finally block. Two problems:

Deferred restoration: Since this is a generator, the finally block only executes when the generator is exhausted or closed. If a caller never fully consumes it, the table option remains mutated indefinitely.
Concurrent use: Any other read on the same table instance will see BLOB_AS_DESCRIPTOR = True unexpectedly.

A safer pattern would be to pass the option override as a parameter rather than mutating the shared table options.

2. OffsetRow.get_blob() when no blob context is set

Will produce an AttributeError on NoneType if called on a row not created via to_blob_iterator(). The base class raises a clear NotImplementedError, but the override skips that guard. Consider checking when self._file_io is None.

3. Blob.from_bytes with allow_blob_data=False edge case

When allow_blob_data=False and the input is raw bytes without the blob descriptor magic prefix, the code enters the descriptor-deserialization path which will fail with an opaque error. Raise a ValueError explicitly instead.

4. Minor: return type annotation

to_blob_iterator is annotated as -> Iterator but could be -> Iterator[InternalRow].

5. Minor: redundant None check in data_file_batch_reader.py

After refactor, blob = Blob.from_bytes(value, self.file_io) followed by blob.to_data() if blob is not None else None -- at this point value is guaranteed non-None, so the ternary is dead code.

Nice work on the Blob.from_bytes() unification and the lazy-access pattern.

to_iterator() now yields rows whose BLOB columns are raw stored bytes (descriptor or inline) rather than eagerly resolved payload bytes; row.get_blob(pos) returns a Blob for descriptor cells and raises on inline cells, matching ColumnarRow.getBlob in Java. RecordBatchReader.read_batch() takes no arguments; file_io is held on the reader and threaded onto the row at iterator construction. Removed read-time descriptor-to-data conversion paths (_convert_descriptor_stored_blob_columns, _blob_cell_to_data, BlobDescriptorConvertReader wrapping). FormatBlobReader emits descriptor bytes regardless of blob_as_descriptor option. to_blob_iterator() is now an alias for to_iterator(); deprecated kwargs on DataFileBatchReader / FormatBlobReader and the OffsetRow with_blob_context shim are kept for one release and will be removed in a follow-up. Tests that asserted on resolved blob payload bytes from to_iterator() / to_arrow() must now resolve descriptor cells via Blob.from_bytes(bytes, file_io, allow_blob_data=False).to_data() (see the _resolve_blobs helper added in blob_test.py / blob_table_test.py).

This reverts commit 151fa9c.

Drop the BLOB-specific parameters on RecordBatchReader.read_batch() to match Java RecordReader.readBatch(); inject file_io directly onto OffsetRow at iterator construction (mirroring Java ColumnarRow.setFileIO). OffsetRow.get_blob(pos) becomes a one-liner (Blob.from_bytes(value, file_io)), dropping the _blob_field_indices defensive check Java has no counterpart for. to_blob_iterator() becomes a thin alias of to_iterator(). The read-path behaviour (BLOB_AS_DESCRIPTOR option consumption, eager descriptor resolution when the option is false) is preserved — this PR only aligns the row/iterator API shape, not the read-path semantics. Compatibility: OffsetRow.with_blob_context(file_io, ...) is kept as a thin alias forwarding to set_file_io(file_io) for one release. Existing callers of read.to_iterator() / read.to_arrow() see no behaviour change.

XiaoHongbo-Hope added 3 commits May 18, 2026 20:52

[python] Always resolve blob to actual data on read regardless of blo…

4cf7b23

…b-as-descriptor setting

[python] Fix Blob.from_bytes type annotation and add tests

a610d97

- Add Optional[bytes] type annotation for data parameter - Raise ValueError when file_io is None but bytes are BlobDescriptor - Add unit tests for from_bytes covering all branches

XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 09:45

XiaoHongbo-Hope marked this pull request as draft May 19, 2026 10:00

XiaoHongbo-Hope added 2 commits May 19, 2026 18:09

[python] Fix flake8 lint errors

5a13539

XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 14:10

XiaoHongbo-Hope changed the title ~~[python] Support transparent blob resolution on read~~ [python] Add Blob.from_bytes unified API May 19, 2026

XiaoHongbo-Hope changed the title ~~[python] Add Blob.from_bytes unified API~~ [python] Add Blob.from_bytes to support interpreting blob bytes as Blob object May 19, 2026

XiaoHongbo-Hope changed the title ~~[python] Add Blob.from_bytes to support interpreting blob bytes as Blob object~~ [python] Support unified blob reads May 19, 2026

XiaoHongbo-Hope marked this pull request as draft May 19, 2026 16:00

XiaoHongbo-Hope changed the title ~~[python] Support unified blob reads~~ [python] Add unified Blob.from_bytes resolver May 19, 2026

XiaoHongbo-Hope changed the title ~~[python] Add unified Blob.from_bytes resolver~~ [python] Support row-level Blob access May 21, 2026

[python] Support row-level Blob access aligned with Java getBlob

3a2a85d

Add get_blob(pos) to InternalRow and to_blob_iterator() to TableRead, enabling lazy Blob access with streaming support.

JingsongLi reviewed May 23, 2026

View reviewed changes

XiaoHongbo-Hope added 3 commits May 24, 2026 16:20

Revert "[python] Align BLOB read path with Java getBlob semantics"

5983598

This reverts commit 151fa9c.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Support row-level Blob access#7891

[python] Support row-level Blob access#7891
XiaoHongbo-Hope wants to merge 9 commits into
apache:masterfrom
XiaoHongbo-Hope:inline_blob

XiaoHongbo-Hope commented May 18, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XiaoHongbo-Hope commented May 18, 2026

Purpose

Tests

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Review: [python] Support row-level Blob access

1. Shared mutable state in to_blob_iterator (correctness / thread-safety)

2. OffsetRow.get_blob() when no blob context is set

3. Blob.from_bytes with allow_blob_data=False edge case

4. Minor: return type annotation

5. Minor: redundant None check in data_file_batch_reader.py

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants