Skip to content

Route Arrow - DType extension conversion through the registry#7592

Open
palaska wants to merge 30 commits intodevelopfrom
bp/arrow-ext
Open

Route Arrow - DType extension conversion through the registry#7592
palaska wants to merge 30 commits intodevelopfrom
bp/arrow-ext

Conversation

@palaska
Copy link
Copy Markdown
Contributor

@palaska palaska commented Apr 22, 2026

  • Arrow <-> DType extension conversion now consults the DTypeSession extension registry.
  • Any registered extension round-trips through Arrow by default, not just temporal types and Parquet Variant.
  • Extensions can register an Arrow canonical alias; FixedShapeTensor uses this to emit arrow.fixed_shape_tensor with JSON metadata, so pyarrow sees it as a first-class FixedShapeTensorArray.

@palaska palaska changed the title Route Arrow ↔ DType extension conversion through the registry Route Arrow - DType extension conversion through the registry Apr 22, 2026
palaska added 2 commits April 22, 2026 16:53
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 22, 2026

Merging this PR will degrade performance by 10.6%

⚡ 24 improved benchmarks
❌ 1 regressed benchmark
✅ 1105 untouched benchmarks
⏩ 33 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation take_search[(0.005, 0.05)] 167.6 µs 130.7 µs +28.18%
Simulation take_search[(0.005, 0.1)] 319.7 µs 246.5 µs +29.72%
Simulation take_search[(0.005, 0.5)] 1.5 ms 1.2 ms +31.07%
Simulation take_search[(0.005, 1.0)] 3.1 ms 2.3 ms +31.28%
Simulation take_search[(0.01, 0.1)] 340.6 µs 267.4 µs +27.39%
Simulation take_search[(0.01, 0.05)] 178.7 µs 141.8 µs +25.97%
Simulation take_search[(0.01, 0.5)] 1.6 ms 1.3 ms +28.61%
Simulation take_search[(0.01, 1.0)] 3.3 ms 2.5 ms +28.79%
Simulation take_search[(0.1, 0.05)] 248.4 µs 211.6 µs +17.42%
Simulation take_search[(0.1, 0.1)] 458.1 µs 384.8 µs +19.06%
Simulation take_search[(0.1, 1.0)] 4.3 ms 3.5 ms +20.65%
Simulation take_search_chunked[(0.005, 0.1)] 383.3 µs 321.1 µs +19.37%
Simulation take_search_chunked[(0.005, 0.5)] 1.9 ms 1.5 ms +20.08%
Simulation take_search_chunked[(0.01, 0.05)] 212.6 µs 181.4 µs +17.23%
Simulation take_search[(0.1, 0.5)] 2.2 ms 1.8 ms +20.43%
Simulation take_search_chunked[(0.005, 1.0)] 3.7 ms 3.1 ms +20.18%
Simulation take_search_chunked[(0.005, 0.05)] 199.9 µs 168.6 µs +18.54%
Simulation take_search_chunked[(0.01, 0.5)] 2 ms 1.7 ms +18.64%
Simulation take_search_chunked[(0.01, 0.1)] 407.9 µs 345.7 µs +17.99%
Simulation take_search_chunked[(0.1, 0.05)] 278.6 µs 247.3 µs +12.66%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing bp/arrow-ext (fdaf5d7) with develop (140eec6)

Open in CodSpeed

Footnotes

  1. 33 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@palaska palaska added the changelog/feature A new feature label Apr 22, 2026
@palaska palaska marked this pull request as ready for review April 22, 2026 15:59
@palaska palaska requested a review from connortsui20 April 22, 2026 15:59
@connortsui20 connortsui20 requested a review from AdamGS April 22, 2026 16:09
Comment thread vortex-array/src/dtype/arrow.rs Outdated
/// Vortex-internal extension ids and Arrow canonical extension names. Canonical extensions
/// serialize metadata as raw UTF-8 (typically JSON) rather than base64-wrapped bytes.
const CANONICAL_ALIASES: &[(&str, &str)] =
&[("vortex.fixed_shape_tensor", "arrow.fixed_shape_tensor")];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops I forgot to change vortex.fixed_shape_tensor to vortex.tensor.fixed_shape_tensor... Can you add that to this PR? I could do it myself but then these won't be synced. Maybe the str ID should just be a constant and both call sites use that?

@AdamGS
Copy link
Copy Markdown
Contributor

AdamGS commented Apr 22, 2026

ok I have thoughts but I'm going to head out soon and this required more thinking, I'll try and post a review first thing tomorrow @palaska

Comment thread vortex-array/src/dtype/arrow.rs Outdated
Comment on lines +56 to +62
const ARROW_EXT_NAME_VARIANT: &str = "arrow.parquet.variant";

/// `(vortex_id, arrow_canonical_name)` pairs — single source of truth for bijection between
/// Vortex-internal extension ids and Arrow canonical extension names. Canonical extensions
/// serialize metadata as raw UTF-8 (typically JSON) rather than base64-wrapped bytes.
const CANONICAL_ALIASES: &[(&str, &str)] =
&[("vortex.fixed_shape_tensor", "arrow.fixed_shape_tensor")];
Copy link
Copy Markdown
Contributor

@AdamGS AdamGS Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just putting what we talked about over slack here - I think we should avoid hard-coding extensions here, this whole part of the code shouldn't know about them and while we can have some best-practices/defaults or assumed semantics for some of these, it should be done externally and allow for customization by the user.

Comment thread vortex-array/src/dtype/arrow.rs Outdated
/// resolve `ARROW:extension:name` metadata into [`DType::Extension`] values.
///
/// Unregistered or malformed extension metadata falls back to the storage dtype.
pub trait FromArrowWithSession<T>: Sized {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a function with a default impl on TryFromArrowType?

Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
@palaska palaska requested review from AdamGS and connortsui20 April 24, 2026 12:21
Comment thread vortex-array/src/dtype/session.rs Outdated
#[derive(Debug)]
pub struct DTypeSession {
registry: ExtDTypeRegistry,
arrow_canonical: RwLock<ArrowCanonicalAliases>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use a RwLock, it has very high contention. It rarely written to?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude recommended ArcSwap<HashMap>, lock-free reads and atomic write swaps

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think wrap it in a newtype just like #7588

Comment thread vortex-array/src/dtype/session.rs Outdated
//! Test extension types for exercising the [`ExtVTable`] contract.

mod divisible_int;
pub(crate) mod divisible_int;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the newly added extension_roundtrip tests, i needed a test extension type with metadata. What is your comment here? 😄

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was do we need to do this

Comment thread vortex-tensor/src/types/fixed_shape/mod.rs Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this in this PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow requires metadata to be UTF-8 json so i swapped, do we want to keep the proto around?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@connortsui20 is this okay?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in general JSON is less space efficient than proto

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine so that we are compat with arrow

Comment thread vortex-tensor/src/lib.rs Outdated
Comment thread vortex-tensor/src/types/vector/vtable.rs Outdated
palaska added 5 commits April 27, 2026 09:33
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
@palaska palaska requested a review from joseph-isaacs April 27, 2026 11:26
palaska added 2 commits April 27, 2026 12:41
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Comment thread vortex-array/src/dtype/arrow.rs Outdated
match field.extension_type_metadata() {
None | Some("") => Ok(Cow::Borrowed(&[])),
Some(s) if is_canonical => Ok(Cow::Borrowed(s.as_bytes())),
Some(s) => BASE64_STANDARD
Copy link
Copy Markdown
Contributor Author

@palaska palaska Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joseph-isaacs should I get rid of this and assert valid utf 8 on extension metadata serde? Then this codepath will be simpler (String::from_utf8) and we can make serialize_metadata return VortexResult<String> instead of bytes.

Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we store BASE_64 json in the vortex file for this type? Surely that is a waste of space

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only the arrow serde path. Metadata for the fixed shape tensor is utf8 encoded json bytes, no base64. I guess its still larger than proto but it's metadata at the end of the day, your call :)

palaska added 18 commits April 27, 2026 15:55
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
…ndary

Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants