Skip to content

feat(dataframe): expose set operations (union, intersect, except)#67

Open
LantaoJin wants to merge 1 commit into
apache:mainfrom
LantaoJin:feat/dataframe-set-operations
Open

feat(dataframe): expose set operations (union, intersect, except)#67
LantaoJin wants to merge 1 commit into
apache:mainfrom
LantaoJin:feat/dataframe-set-operations

Conversation

@LantaoJin
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

DataFusion's DataFrame exposes eight set-operation methods -- the union/intersect/except family with *_distinct and *_by_name variants -- and none have been reachable from Java. Without these, callers fall back to UNION/INTERSECT/EXCEPT via SQL, which loses lazy DataFrame composition and forces both sides to be registered as tables. This PR exposes all eight additively.

What changes are included in this PR?

Eight new methods on DataFrame, each taking another DataFrame:

  • union(other) -- SQL UNION ALL (positional, keeps duplicates)
  • unionDistinct(other) -- SQL UNION (positional, deduplicated)
  • unionByName(other) -- by column name, keeps duplicates; missing columns become NULL
  • unionByNameDistinct(other) -- by column name, deduplicated; missing columns become NULL
  • intersect(other) -- SQL INTERSECT ALL
  • intersectDistinct(other) -- SQL INTERSECT
  • except(other) -- SQL EXCEPT ALL
  • exceptDistinct(other) -- SQL EXCEPT

Are these changes tested?

Yes -- 12 new tests in DataFrameTransformationsTest.

Are there any user-facing changes?

Yes -- purely additive. Eight new methods on DataFrame. No API removals, no deprecations, no behaviour change for existing callers. No Cargo feature changes; binary size is unchanged.

Add eight DataFrame methods mirroring DataFusion's set-op family:
union/unionDistinct/unionByName/unionByNameDistinct and
intersect/intersectDistinct/except/exceptDistinct. Method names map
1:1 to the upstream Rust API; each Javadoc spells out SQL semantics
because the *_distinct convention inverts Spark's *All convention.

None of the eight methods consume their arguments. The native handler
clones both DataFrames (LogicalPlan is Arc-backed, clone is cheap),
matching the established non-destructive contract used by every other
transform method.

DataFusion implements INTERSECT ALL and EXCEPT ALL as left-semi /
left-anti joins on equality rather than standard SQL bag operators.
The intersect/except Javadocs flag this divergence with a worked
example so callers porting from PostgreSQL or Spark know what to
expect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(dataframe): expose set operations (union, intersect, except)

1 participant