Skip to content

feat(datasource): add Java-implemented data sources#65

Merged
andygrove merged 18 commits into
apache:mainfrom
andygrove:feat/columnar-value-udf
May 19, 2026
Merged

feat(datasource): add Java-implemented data sources#65
andygrove merged 18 commits into
apache:mainfrom
andygrove:feat/columnar-value-udf

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 18, 2026

Which issue does this PR close?

Rationale for this change

Java users have no way to expose custom in-process tables (JDBC scans, in-memory
collections, custom file formats, etc.) to DataFusion. This adds a minimal
DataSource interface and the JNI wiring to register it on a SessionContext.
The implementation mirrors the existing scalar-UDF JNI pattern.

What changes are included in this PR?

  • New public DataSource interface in org.apache.datafusion with
    Schema schema() and ArrowReader scan(BufferAllocator).
  • SessionContext.registerDataSource(name, source) registers a Java-backed
    table; schema is captured at registration time.
  • JniBridge.invokeDataSourceScan exports the user's ArrowReader through
    the Arrow C Data Interface (zero-copy).
  • Native: JavaDataSource: TableProvider + JavaScanExec: ExecutionPlan in
    native/src/data_source.rs, plus the JNI entry point.
  • Shared jthrowable_to_string helper lifted into native/src/jni_util.rs
    so the UDF and data-source paths share Java-exception formatting.
  • New JdbcExample in the examples module demonstrating an end-to-end
    JDBC-backed DataSource: populates an H2 in-memory table, wraps a JDBC
    query in a JdbcDataSource, registers it, and runs an aggregation query.
    Streams batches via arrow-jdbc's ArrowVectorIterator wrapped in a small
    ArrowReader subclass — no IPC re-serialisation. Adds arrow-jdbc and
    com.h2database:h2 as examples-module deps.
  • v1 scope: single partition, no projection or filter pushdown into Java
    (DataFusion projects/filters on top), no deregisterTable. Multi-partition,
    pushdown, and deregistration are listed as follow-ups in the user guide.

Run the JDBC example with:

./mvnw -pl examples exec:exec -Dexec.mainClass=org.apache.datafusion.examples.JdbcExample

Are these changes tested?

Yes — eight integration tests in
`core/src/test/java/org/apache/datafusion/DataSourceTest.java`:

  • `SELECT *` happy path
  • `UNION ALL` over the same registered table (multi-scan)
  • Empty stream
  • Column projection through DataFusion
  • Two registered tables joinable in one query
  • Schema-mismatch surfaces a readable error
  • `scan()` throwing propagates the Java exception class and message
  • `scan()` returning null is rejected with `IllegalStateException`

The JDBC example is exercised end-to-end manually (output verified: aggregation
produces `alice → 119.99` and `bob → 7.50` over the H2 fixture). It compiles
as part of the standard `mvn package` build alongside the other example
classes (`AddOneExample`, `DataFrameExample`, etc.) — none of which carry
JUnit tests, by convention.

Are there any user-facing changes?

Yes — new public `DataSource` interface and `SessionContext.registerDataSource`
method, plus a new user-guide page at `docs/source/user-guide/data-source.md`
covering the API, contract, threading, errors, and v1 limitations. The
runnable `JdbcExample` shows the API in action against an embedded H2.

andygrove added 16 commits May 18, 2026 14:28
Arrow Java's Data.exportArrayStream requires the reader's buffers to share
the same allocator root as the export allocator. The previous workaround
re-serialised every batch through IPC bytes, defeating zero-copy.

The correct fix is to require DataSource.scan to accept a BufferAllocator
argument (the framework's own ALLOCATOR) and allocate its reader's buffers
from it. This mirrors the ScalarFunction.evaluate(BufferAllocator, ...) API.
@andygrove
Copy link
Copy Markdown
Member Author

@pgwhalen could you review?

* is closed.
* @throws RuntimeException if native registration fails.
*/
public void registerDataSource(String name, DataSource source) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is basically a simplified API on top of the SessionContext::register_table rust function, what if we called the java function that instead (registerTable), and made the interface it accepts TableProvider?

I get that this PR is basically barebones support for custom table registration in java, and that data_source.rs is handling a lot so the java user gets a simple scan() callback. I think only providing that for now makes sense as a first step (and will always be useful for simple cases), but I'd like to make sure this can evolve towards all the flexibility of the TableProvider trait that interacts with ExecutionPlan and ultimately an ArrowReader. The LiteralGuaranteeTest from my bindings demonstrates what this could look like and what it enables (filter pushdown).

To keep things minimal for PR, maybe we could just

  • rename registerDataSource to registerTable
  • rename the DataSource interface to TableProvider
  • provide a simple implementation of TableProvider that just holds what the current DataSource does - not sure about a name for that, but maybe like SimpleTableProvider or FullScanTableProvider or something

Then we can make TableProvider more featured over time. Totally open to other ideas too.

Part of my motivation in renaming is that in the back of my head I'm thinking about eventual support for the separate DataSource, so don't want to clash on naming.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pgwhalen. I have addressed your feedback.

andygrove added 2 commits May 19, 2026 08:38
Address PR apache#65 review: align Java-side naming with DataFusion's Rust
TableProvider trait and free up the DataSource name for the separate
datafusion-datasource concept in the future. Add SimpleTableProvider
as a convenience wrapper for the (schema, scan-fn) case.

- DataSource -> TableProvider (Java interface)
- SessionContext.registerDataSource -> registerTable
- JniBridge.invokeDataSourceScan -> invokeTableScan
- Native JavaDataSource struct + module renamed to JavaTableProvider /
  table_provider.rs; JNI entry point + signature updated accordingly
- New SimpleTableProvider class wraps a Schema and a
  Function<BufferAllocator, ArrowReader> for the common no-pushdown case
- Test, example, and user-guide docs updated to match
Copy link
Copy Markdown

@pgwhalen pgwhalen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@andygrove andygrove merged commit 89d5496 into apache:main May 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for Java data sources

2 participants