Skip to content

feat(dataframe): expose set operations (union, intersect, except) #43

@andygrove

Description

@andygrove

Is your feature request related to a problem or challenge?

DataFusion's DataFrame API offers eight set-operation methods — union,
intersect, except, and their *_by_name / *_distinct variants — and
none of them are reachable from Java today.

Describe the solution you'd like

Expose the following on DataFrame, each taking another DataFrame:

  • union(DataFrame other) — by-position, keeps duplicates
  • unionDistinct(DataFrame other) — by-position, deduplicated
  • unionByName(DataFrame other) — by-name, keeps duplicates
  • unionByNameDistinct(DataFrame other) — by-name, deduplicated
  • intersect(DataFrame other)INTERSECT ALL
  • intersectDistinct(DataFrame other)INTERSECT
  • except(DataFrame other)EXCEPT ALL
  • exceptDistinct(DataFrame other)EXCEPT

Lifecycle question worth deciding up front: do these consume the
right-hand DataFrame? DataFusion's Rust API takes dataframe: DataFrame
(owned), so the Java side will need to either consume other's native
handle (and forbid further use, like collect) or clone the underlying
LogicalPlan on the native side. Suggest cloning — simpler caller
contract, and LogicalPlan clone is cheap.

Tests in DataFrameTransformationsTest covering each variant against
small fixtures.

Describe alternatives you've considered

UNION / INTERSECT / EXCEPT via SQL. Works but requires
registering both sides as tables.

Additional context

All eight share one JNI entry point per operation kind plus a boolean
flag (by-name, distinct). Could plausibly land as one PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions