Is your feature request related to a problem or challenge?
DataFusion's DataFrame API offers eight set-operation methods — union,
intersect, except, and their *_by_name / *_distinct variants — and
none of them are reachable from Java today.
Describe the solution you'd like
Expose the following on DataFrame, each taking another DataFrame:
union(DataFrame other) — by-position, keeps duplicates
unionDistinct(DataFrame other) — by-position, deduplicated
unionByName(DataFrame other) — by-name, keeps duplicates
unionByNameDistinct(DataFrame other) — by-name, deduplicated
intersect(DataFrame other) — INTERSECT ALL
intersectDistinct(DataFrame other) — INTERSECT
except(DataFrame other) — EXCEPT ALL
exceptDistinct(DataFrame other) — EXCEPT
Lifecycle question worth deciding up front: do these consume the
right-hand DataFrame? DataFusion's Rust API takes dataframe: DataFrame
(owned), so the Java side will need to either consume other's native
handle (and forbid further use, like collect) or clone the underlying
LogicalPlan on the native side. Suggest cloning — simpler caller
contract, and LogicalPlan clone is cheap.
Tests in DataFrameTransformationsTest covering each variant against
small fixtures.
Describe alternatives you've considered
UNION / INTERSECT / EXCEPT via SQL. Works but requires
registering both sides as tables.
Additional context
All eight share one JNI entry point per operation kind plus a boolean
flag (by-name, distinct). Could plausibly land as one PR.
Is your feature request related to a problem or challenge?
DataFusion's DataFrame API offers eight set-operation methods — union,
intersect, except, and their
*_by_name/*_distinctvariants — andnone of them are reachable from Java today.
Describe the solution you'd like
Expose the following on
DataFrame, each taking anotherDataFrame:union(DataFrame other)— by-position, keeps duplicatesunionDistinct(DataFrame other)— by-position, deduplicatedunionByName(DataFrame other)— by-name, keeps duplicatesunionByNameDistinct(DataFrame other)— by-name, deduplicatedintersect(DataFrame other)—INTERSECT ALLintersectDistinct(DataFrame other)—INTERSECTexcept(DataFrame other)—EXCEPT ALLexceptDistinct(DataFrame other)—EXCEPTLifecycle question worth deciding up front: do these consume the
right-hand DataFrame? DataFusion's Rust API takes
dataframe: DataFrame(owned), so the Java side will need to either consume
other's nativehandle (and forbid further use, like
collect) or clone the underlyingLogicalPlanon the native side. Suggest cloning — simpler callercontract, and
LogicalPlanclone is cheap.Tests in
DataFrameTransformationsTestcovering each variant againstsmall fixtures.
Describe alternatives you've considered
UNION/INTERSECT/EXCEPTvia SQL. Works but requiresregistering both sides as tables.
Additional context
All eight share one JNI entry point per operation kind plus a boolean
flag (by-name, distinct). Could plausibly land as one PR.