Skip to content

feat(dataframe): expose sort and repartition#66

Open
LantaoJin wants to merge 2 commits into
apache:mainfrom
LantaoJin:feat/dataframe-sort-repartition
Open

feat(dataframe): expose sort and repartition#66
LantaoJin wants to merge 2 commits into
apache:mainfrom
LantaoJin:feat/dataframe-sort-repartition

Conversation

@LantaoJin
Copy link
Copy Markdown
Contributor

@LantaoJin LantaoJin commented May 19, 2026

Which issue does this PR close?

Rationale for this change

Two ordering / layout primitives have been missing from the Java DataFrame API: sort (no way to order without dropping to SQL) and repartition (no way to control parallelism / partitioning). Both are first-class on the upstream Rust DataFrame, in the default feature set, with no Cargo flag impact. This PR exposes them additively.

What changes are included in this PR?

  • SortExpr -- new value class. Final class with static factories SortExpr.asc(String) / SortExpr.desc(String) and a fluent nullsFirst(boolean) setter. Mirrors DataFusion's expr::Sort{ expr, asc, nulls_first }. Defaults match upstream: ASC → NULLs last, DESC → NULLs first.
  • DataFrame.sort(SortExpr...) -- ordering. Empty array is a no-op (matches DataFrame::sort(vec![])); each SortExpr is null-checked Java-side; the receiver remains usable.
  • DataFrame.repartitionRoundRobin(int) -- maps to Partitioning::RoundRobinBatch(usize). Java validates numPartitions > 0.
  • DataFrame.repartitionHash(int, String...) -- maps to Partitioning::Hash(Vec<Expr>, usize). Column-name keys for v1; the native handler translates each name through datafusion::logical_expr::col(...). Java validates numPartitions > 0, columns non-null/non-empty, no null elements.
  • native/src/lib.rs -- three JNI handlers (sortRows, repartitionRoundRobinRows, repartitionHashRows) using the existing try_unwrap_or_throw plumbing. Boolean arrays are decoded via JBooleanArray + get_boolean_array_region (jni 0.21).
  • Imports added: datafusion::logical_expr::{col, Partitioning, SortExpr}, jni::objects::JBooleanArray.

Why typed SortExpr instead of the SQL-string flavour the issue suggests as option 1: DataFrame::parse_sql_expr parses a single expression, not an ORDER BY list, and DataFusion 53.1 has no parse_sort_exprs helper. The string flavour would force hand-rolled SQL parsing on the native side. The issue authorises starting at option 2; the SQL-string flavour can be layered on later if/when an Expr builder lands.

Out of scope (for follow-ups):

  • SQL-string sort flavour (df.sort("a ASC, b DESC NULLS FIRST")).
  • Sort-key complex expressions (SortExpr.asc("a + b")). The field is named column (not expr) to make this contract enforceable.
  • Partitioning::DistributeBy and Partitioning::Hash with arbitrary expressions.
  • Partition-count assertions in tests -- the binding does not yet expose collect_partitioned. Tests assert the row-preservation invariant only.

Are these changes tested?

Yes -- 20 new tests across SortExprTest and DataFrameTransformationsTest, plus six new lines extending the existing close/collect coverage.

Are there any user-facing changes?

Yes -- purely additive. New public API:

  • org.apache.datafusion.SortExpr (value class)
  • DataFrame.sort(SortExpr...) → DataFrame
  • DataFrame.repartitionRoundRobin(int) → DataFrame
  • DataFrame.repartitionHash(int, String...) → DataFrame

No API removals, no deprecations, no behaviour change for existing callers. No Cargo feature changes; binary size is unchanged.

LantaoJin added 2 commits May 19, 2026 04:02
Add DataFrame.sort(SortExpr...), DataFrame.repartitionRoundRobin(int),
and DataFrame.repartitionHash(int, String...). SortExpr is a small value
class with static asc/desc factories and a fluent nullsFirst setter,
mirroring DataFusion's expr::Sort.

The SQL-string sort flavour the issue lists as option 1 is deferred:
DataFusion 53.1 has no parse_sort_exprs helper on DataFrame, so the
string flavour would force hand-rolled ORDER BY parsing. The typed
SortExpr API is the same shape the issue authorises in option 2.

repartitionHash takes column-name keys for v1 and translates each
through col(...) in the native handler. Expression keys are deferred
until a Java-side Expr builder lands.
@LantaoJin LantaoJin force-pushed the feat/dataframe-sort-repartition branch from b237591 to 04cb4fd Compare May 22, 2026 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(dataframe): expose sort and repartition

1 participant