[python] Support row-level Blob access#7891
Conversation
…b-as-descriptor setting
Revert the read-path changes from 4cf7b23 that always resolved blob descriptors to data. Restore master behavior where blob-as-descriptor flag is respected. Add Blob.from_bytes(data, file_io) as a unified entry point (equivalent to Java's Blob.fromBytes) that auto-dispatches BlobData vs BlobRef based on the bytes content.
- Add Optional[bytes] type annotation for data parameter - Raise ValueError when file_io is None but bytes are BlobDescriptor - Add unit tests for from_bytes covering all branches
- from_bytes(None) returns None (not empty BlobData), matching Java - Add allow_blob_data parameter (Java's 4th arg): when False, always interpret bytes as descriptor - Integrate into read path: _blob_cell_to_data now delegates to Blob.from_bytes instead of duplicating dispatch logic
Add get_blob(pos) to InternalRow and to_blob_iterator() to TableRead, enabling lazy Blob access with streaming support.
JingsongLi
left a comment
There was a problem hiding this comment.
Review: [python] Support row-level Blob access
Overall this is a useful addition that enables lazy/streaming blob access at the row level. The Blob.from_bytes() factory and get_blob() API are clean and well-tested.
1. Shared mutable state in to_blob_iterator (correctness / thread-safety)
In table_read.py, the method mutates self.table.options immediately (setting BLOB_AS_DESCRIPTOR = True) but only restores the original value inside the generator's finally block. Two problems:
- Deferred restoration: Since this is a generator, the finally block only executes when the generator is exhausted or closed. If a caller never fully consumes it, the table option remains mutated indefinitely.
- Concurrent use: Any other read on the same table instance will see BLOB_AS_DESCRIPTOR = True unexpectedly.
A safer pattern would be to pass the option override as a parameter rather than mutating the shared table options.
2. OffsetRow.get_blob() when no blob context is set
Will produce an AttributeError on NoneType if called on a row not created via to_blob_iterator(). The base class raises a clear NotImplementedError, but the override skips that guard. Consider checking when self._file_io is None.
3. Blob.from_bytes with allow_blob_data=False edge case
When allow_blob_data=False and the input is raw bytes without the blob descriptor magic prefix, the code enters the descriptor-deserialization path which will fail with an opaque error. Raise a ValueError explicitly instead.
4. Minor: return type annotation
to_blob_iterator is annotated as -> Iterator but could be -> Iterator[InternalRow].
5. Minor: redundant None check in data_file_batch_reader.py
After refactor, blob = Blob.from_bytes(value, self.file_io) followed by blob.to_data() if blob is not None else None -- at this point value is guaranteed non-None, so the ternary is dead code.
Nice work on the Blob.from_bytes() unification and the lazy-access pattern.
to_iterator() now yields rows whose BLOB columns are raw stored bytes (descriptor or inline) rather than eagerly resolved payload bytes; row.get_blob(pos) returns a Blob for descriptor cells and raises on inline cells, matching ColumnarRow.getBlob in Java. RecordBatchReader.read_batch() takes no arguments; file_io is held on the reader and threaded onto the row at iterator construction. Removed read-time descriptor-to-data conversion paths (_convert_descriptor_stored_blob_columns, _blob_cell_to_data, BlobDescriptorConvertReader wrapping). FormatBlobReader emits descriptor bytes regardless of blob_as_descriptor option. to_blob_iterator() is now an alias for to_iterator(); deprecated kwargs on DataFileBatchReader / FormatBlobReader and the OffsetRow with_blob_context shim are kept for one release and will be removed in a follow-up. Tests that asserted on resolved blob payload bytes from to_iterator() / to_arrow() must now resolve descriptor cells via Blob.from_bytes(bytes, file_io, allow_blob_data=False).to_data() (see the _resolve_blobs helper added in blob_test.py / blob_table_test.py).
This reverts commit 151fa9c.
Drop the BLOB-specific parameters on RecordBatchReader.read_batch() to match Java RecordReader.readBatch(); inject file_io directly onto OffsetRow at iterator construction (mirroring Java ColumnarRow.setFileIO). OffsetRow.get_blob(pos) becomes a one-liner (Blob.from_bytes(value, file_io)), dropping the _blob_field_indices defensive check Java has no counterpart for. to_blob_iterator() becomes a thin alias of to_iterator(). The read-path behaviour (BLOB_AS_DESCRIPTOR option consumption, eager descriptor resolution when the option is false) is preserved — this PR only aligns the row/iterator API shape, not the read-path semantics. Compatibility: OffsetRow.with_blob_context(file_io, ...) is kept as a thin alias forwarding to set_file_io(file_io) for one release. Existing callers of read.to_iterator() / read.to_arrow() see no behaviour change.
Purpose
Tests