Route Arrow - DType extension conversion through the registry#7592
Route Arrow - DType extension conversion through the registry#7592
Conversation
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Merging this PR will degrade performance by 10.6%
Performance Changes
Comparing Footnotes
|
| /// Vortex-internal extension ids and Arrow canonical extension names. Canonical extensions | ||
| /// serialize metadata as raw UTF-8 (typically JSON) rather than base64-wrapped bytes. | ||
| const CANONICAL_ALIASES: &[(&str, &str)] = | ||
| &[("vortex.fixed_shape_tensor", "arrow.fixed_shape_tensor")]; |
There was a problem hiding this comment.
Whoops I forgot to change vortex.fixed_shape_tensor to vortex.tensor.fixed_shape_tensor... Can you add that to this PR? I could do it myself but then these won't be synced. Maybe the str ID should just be a constant and both call sites use that?
|
ok I have thoughts but I'm going to head out soon and this required more thinking, I'll try and post a review first thing tomorrow @palaska |
| const ARROW_EXT_NAME_VARIANT: &str = "arrow.parquet.variant"; | ||
|
|
||
| /// `(vortex_id, arrow_canonical_name)` pairs — single source of truth for bijection between | ||
| /// Vortex-internal extension ids and Arrow canonical extension names. Canonical extensions | ||
| /// serialize metadata as raw UTF-8 (typically JSON) rather than base64-wrapped bytes. | ||
| const CANONICAL_ALIASES: &[(&str, &str)] = | ||
| &[("vortex.fixed_shape_tensor", "arrow.fixed_shape_tensor")]; |
There was a problem hiding this comment.
Just putting what we talked about over slack here - I think we should avoid hard-coding extensions here, this whole part of the code shouldn't know about them and while we can have some best-practices/defaults or assumed semantics for some of these, it should be done externally and allow for customization by the user.
| /// resolve `ARROW:extension:name` metadata into [`DType::Extension`] values. | ||
| /// | ||
| /// Unregistered or malformed extension metadata falls back to the storage dtype. | ||
| pub trait FromArrowWithSession<T>: Sized { |
There was a problem hiding this comment.
Should this be a function with a default impl on TryFromArrowType?
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
| #[derive(Debug)] | ||
| pub struct DTypeSession { | ||
| registry: ExtDTypeRegistry, | ||
| arrow_canonical: RwLock<ArrowCanonicalAliases>, |
There was a problem hiding this comment.
We cannot use a RwLock, it has very high contention. It rarely written to?
There was a problem hiding this comment.
Claude recommended ArcSwap<HashMap>, lock-free reads and atomic write swaps
There was a problem hiding this comment.
Yes, I think wrap it in a newtype just like #7588
| //! Test extension types for exercising the [`ExtVTable`] contract. | ||
|
|
||
| mod divisible_int; | ||
| pub(crate) mod divisible_int; |
There was a problem hiding this comment.
for the newly added extension_roundtrip tests, i needed a test extension type with metadata. What is your comment here? 😄
There was a problem hiding this comment.
It was do we need to do this
There was a problem hiding this comment.
Why remove this in this PR
There was a problem hiding this comment.
arrow requires metadata to be UTF-8 json so i swapped, do we want to keep the proto around?
There was a problem hiding this comment.
But in general JSON is less space efficient than proto
There was a problem hiding this comment.
I think it's fine so that we are compat with arrow
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
| match field.extension_type_metadata() { | ||
| None | Some("") => Ok(Cow::Borrowed(&[])), | ||
| Some(s) if is_canonical => Ok(Cow::Borrowed(s.as_bytes())), | ||
| Some(s) => BASE64_STANDARD |
There was a problem hiding this comment.
@joseph-isaacs should I get rid of this and assert valid utf 8 on extension metadata serde? Then this codepath will be simpler (String::from_utf8) and we can make serialize_metadata return VortexResult<String> instead of bytes.
There was a problem hiding this comment.
Do we store BASE_64 json in the vortex file for this type? Surely that is a waste of space
There was a problem hiding this comment.
this is only the arrow serde path. Metadata for the fixed shape tensor is utf8 encoded json bytes, no base64. I guess its still larger than proto but it's metadata at the end of the day, your call :)
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
…ndary Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
arrow.fixed_shape_tensorwith JSON metadata, so pyarrow sees it as a first-class FixedShapeTensorArray.