Skip to content

[python] Support row-level Blob access#7891

Draft
XiaoHongbo-Hope wants to merge 9 commits into
apache:masterfrom
XiaoHongbo-Hope:inline_blob
Draft

[python] Support row-level Blob access#7891
XiaoHongbo-Hope wants to merge 9 commits into
apache:masterfrom
XiaoHongbo-Hope:inline_blob

Conversation

@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

Purpose

Tests

Revert the read-path changes from 4cf7b23 that always resolved blob
descriptors to data. Restore master behavior where blob-as-descriptor
flag is respected. Add Blob.from_bytes(data, file_io) as a unified
entry point (equivalent to Java's Blob.fromBytes) that auto-dispatches
BlobData vs BlobRef based on the bytes content.
- Add Optional[bytes] type annotation for data parameter
- Raise ValueError when file_io is None but bytes are BlobDescriptor
- Add unit tests for from_bytes covering all branches
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 09:45
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft May 19, 2026 10:00
- from_bytes(None) returns None (not empty BlobData), matching Java
- Add allow_blob_data parameter (Java's 4th arg): when False, always
  interpret bytes as descriptor
- Integrate into read path: _blob_cell_to_data now delegates to
  Blob.from_bytes instead of duplicating dispatch logic
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 14:10
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Support transparent blob resolution on read [python] Add Blob.from_bytes unified API May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Add Blob.from_bytes unified API [python] Add Blob.from_bytes to support interpreting blob bytes as Blob object May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Add Blob.from_bytes to support interpreting blob bytes as Blob object [python] Support unified blob reads May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft May 19, 2026 16:00
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Support unified blob reads [python] Add unified Blob.from_bytes resolver May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Add unified Blob.from_bytes resolver [python] Support row-level Blob access May 21, 2026
Add get_blob(pos) to InternalRow and to_blob_iterator() to TableRead,
enabling lazy Blob access with streaming support.
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [python] Support row-level Blob access

Overall this is a useful addition that enables lazy/streaming blob access at the row level. The Blob.from_bytes() factory and get_blob() API are clean and well-tested.

1. Shared mutable state in to_blob_iterator (correctness / thread-safety)

In table_read.py, the method mutates self.table.options immediately (setting BLOB_AS_DESCRIPTOR = True) but only restores the original value inside the generator's finally block. Two problems:

  • Deferred restoration: Since this is a generator, the finally block only executes when the generator is exhausted or closed. If a caller never fully consumes it, the table option remains mutated indefinitely.
  • Concurrent use: Any other read on the same table instance will see BLOB_AS_DESCRIPTOR = True unexpectedly.

A safer pattern would be to pass the option override as a parameter rather than mutating the shared table options.

2. OffsetRow.get_blob() when no blob context is set

Will produce an AttributeError on NoneType if called on a row not created via to_blob_iterator(). The base class raises a clear NotImplementedError, but the override skips that guard. Consider checking when self._file_io is None.

3. Blob.from_bytes with allow_blob_data=False edge case

When allow_blob_data=False and the input is raw bytes without the blob descriptor magic prefix, the code enters the descriptor-deserialization path which will fail with an opaque error. Raise a ValueError explicitly instead.

4. Minor: return type annotation

to_blob_iterator is annotated as -> Iterator but could be -> Iterator[InternalRow].

5. Minor: redundant None check in data_file_batch_reader.py

After refactor, blob = Blob.from_bytes(value, self.file_io) followed by blob.to_data() if blob is not None else None -- at this point value is guaranteed non-None, so the ternary is dead code.

Nice work on the Blob.from_bytes() unification and the lazy-access pattern.

to_iterator() now yields rows whose BLOB columns are raw stored bytes
(descriptor or inline) rather than eagerly resolved payload bytes;
row.get_blob(pos) returns a Blob for descriptor cells and raises on
inline cells, matching ColumnarRow.getBlob in Java.

RecordBatchReader.read_batch() takes no arguments; file_io is held on
the reader and threaded onto the row at iterator construction.

Removed read-time descriptor-to-data conversion paths
(_convert_descriptor_stored_blob_columns, _blob_cell_to_data,
BlobDescriptorConvertReader wrapping). FormatBlobReader emits descriptor
bytes regardless of blob_as_descriptor option.

to_blob_iterator() is now an alias for to_iterator(); deprecated kwargs
on DataFileBatchReader / FormatBlobReader and the OffsetRow
with_blob_context shim are kept for one release and will be removed in a
follow-up.

Tests that asserted on resolved blob payload bytes from to_iterator() /
to_arrow() must now resolve descriptor cells via
Blob.from_bytes(bytes, file_io, allow_blob_data=False).to_data() (see
the _resolve_blobs helper added in blob_test.py / blob_table_test.py).
Drop the BLOB-specific parameters on RecordBatchReader.read_batch() to
match Java RecordReader.readBatch(); inject file_io directly onto
OffsetRow at iterator construction (mirroring Java ColumnarRow.setFileIO).

OffsetRow.get_blob(pos) becomes a one-liner (Blob.from_bytes(value,
file_io)), dropping the _blob_field_indices defensive check Java has no
counterpart for.

to_blob_iterator() becomes a thin alias of to_iterator(). The read-path
behaviour (BLOB_AS_DESCRIPTOR option consumption, eager descriptor
resolution when the option is false) is preserved — this PR only aligns
the row/iterator API shape, not the read-path semantics.

Compatibility: OffsetRow.with_blob_context(file_io, ...) is kept as a
thin alias forwarding to set_file_io(file_io) for one release. Existing
callers of read.to_iterator() / read.to_arrow() see no behaviour change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants