feat(dataframe): expose sort and repartition by LantaoJin · Pull Request #66 · apache/datafusion-java

LantaoJin · 2026-05-19T04:13:16Z

Which issue does this PR close?

Closes feat(dataframe): expose sort and repartition #42 .

Rationale for this change

Two ordering / layout primitives have been missing from the Java DataFrame API: sort (no way to order without dropping to SQL) and repartition (no way to control parallelism / partitioning). Both are first-class on the upstream Rust DataFrame, in the default feature set, with no Cargo flag impact. This PR exposes them additively.

What changes are included in this PR?

SortExpr -- new value class. Final class with static factories SortExpr.asc(String) / SortExpr.desc(String) and a fluent nullsFirst(boolean) setter. Mirrors DataFusion's expr::Sort{ expr, asc, nulls_first }. Defaults match upstream: ASC → NULLs last, DESC → NULLs first.
DataFrame.sort(SortExpr...) -- ordering. Empty array is a no-op (matches DataFrame::sort(vec![])); each SortExpr is null-checked Java-side; the receiver remains usable.
DataFrame.repartitionRoundRobin(int) -- maps to Partitioning::RoundRobinBatch(usize). Java validates numPartitions > 0.
DataFrame.repartitionHash(int, String...) -- maps to Partitioning::Hash(Vec<Expr>, usize). Column-name keys for v1; the native handler translates each name through datafusion::logical_expr::col(...). Java validates numPartitions > 0, columns non-null/non-empty, no null elements.
native/src/lib.rs -- three JNI handlers (sortRows, repartitionRoundRobinRows, repartitionHashRows) using the existing try_unwrap_or_throw plumbing. Boolean arrays are decoded via JBooleanArray + get_boolean_array_region (jni 0.21).
Imports added: datafusion::logical_expr::{col, Partitioning, SortExpr}, jni::objects::JBooleanArray.

Why typed SortExpr instead of the SQL-string flavour the issue suggests as option 1: DataFrame::parse_sql_expr parses a single expression, not an ORDER BY list, and DataFusion 53.1 has no parse_sort_exprs helper. The string flavour would force hand-rolled SQL parsing on the native side. The issue authorises starting at option 2; the SQL-string flavour can be layered on later if/when an Expr builder lands.

Out of scope (for follow-ups):

SQL-string sort flavour (df.sort("a ASC, b DESC NULLS FIRST")).
Sort-key complex expressions (SortExpr.asc("a + b")). The field is named column (not expr) to make this contract enforceable.
Partitioning::DistributeBy and Partitioning::Hash with arbitrary expressions.
Partition-count assertions in tests -- the binding does not yet expose collect_partitioned. Tests assert the row-preservation invariant only.

Are these changes tested?

Yes -- 20 new tests across SortExprTest and DataFrameTransformationsTest, plus six new lines extending the existing close/collect coverage.

Are there any user-facing changes?

Yes -- purely additive. New public API:

org.apache.datafusion.SortExpr (value class)
DataFrame.sort(SortExpr...) → DataFrame
DataFrame.repartitionRoundRobin(int) → DataFrame
DataFrame.repartitionHash(int, String...) → DataFrame

No API removals, no deprecations, no behaviour change for existing callers. No Cargo feature changes; binary size is unchanged.

Add DataFrame.sort(SortExpr...), DataFrame.repartitionRoundRobin(int), and DataFrame.repartitionHash(int, String...). SortExpr is a small value class with static asc/desc factories and a fluent nullsFirst setter, mirroring DataFusion's expr::Sort. The SQL-string sort flavour the issue lists as option 1 is deferred: DataFusion 53.1 has no parse_sort_exprs helper on DataFrame, so the string flavour would force hand-rolled ORDER BY parsing. The typed SortExpr API is the same shape the issue authorises in option 2. repartitionHash takes column-name keys for v1 and translates each through col(...) in the native handler. Expression keys are deferred until a Java-side Expr builder lands.

…-repartition

LantaoJin added 2 commits May 19, 2026 04:02

Merge remote-tracking branch 'upstream/main' into feat/dataframe-sort…

04cb4fd

…-repartition

LantaoJin force-pushed the feat/dataframe-sort-repartition branch from b237591 to 04cb4fd Compare May 22, 2026 03:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dataframe): expose sort and repartition#66

feat(dataframe): expose sort and repartition#66
LantaoJin wants to merge 2 commits into
apache:mainfrom
LantaoJin:feat/dataframe-sort-repartition

LantaoJin commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LantaoJin commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LantaoJin commented May 19, 2026 •

edited

Loading