feat(dataframe): expose set operations (union, intersect, except)#67
Open
LantaoJin wants to merge 1 commit into
Open
feat(dataframe): expose set operations (union, intersect, except)#67LantaoJin wants to merge 1 commit into
LantaoJin wants to merge 1 commit into
Conversation
Add eight DataFrame methods mirroring DataFusion's set-op family: union/unionDistinct/unionByName/unionByNameDistinct and intersect/intersectDistinct/except/exceptDistinct. Method names map 1:1 to the upstream Rust API; each Javadoc spells out SQL semantics because the *_distinct convention inverts Spark's *All convention. None of the eight methods consume their arguments. The native handler clones both DataFrames (LogicalPlan is Arc-backed, clone is cheap), matching the established non-destructive contract used by every other transform method. DataFusion implements INTERSECT ALL and EXCEPT ALL as left-semi / left-anti joins on equality rather than standard SQL bag operators. The intersect/except Javadocs flag this divergence with a worked example so callers porting from PostgreSQL or Spark know what to expect.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
DataFusion's
DataFrameexposes eight set-operation methods -- the union/intersect/except family with*_distinctand*_by_namevariants -- and none have been reachable from Java. Without these, callers fall back toUNION/INTERSECT/EXCEPTvia SQL, which loses lazy DataFrame composition and forces both sides to be registered as tables. This PR exposes all eight additively.What changes are included in this PR?
Eight new methods on
DataFrame, each taking anotherDataFrame:union(other)-- SQLUNION ALL(positional, keeps duplicates)unionDistinct(other)-- SQLUNION(positional, deduplicated)unionByName(other)-- by column name, keeps duplicates; missing columns become NULLunionByNameDistinct(other)-- by column name, deduplicated; missing columns become NULLintersect(other)-- SQLINTERSECT ALLintersectDistinct(other)-- SQLINTERSECTexcept(other)-- SQLEXCEPT ALLexceptDistinct(other)-- SQLEXCEPTAre these changes tested?
Yes -- 12 new tests in
DataFrameTransformationsTest.Are there any user-facing changes?
Yes -- purely additive. Eight new methods on
DataFrame. No API removals, no deprecations, no behaviour change for existing callers. No Cargo feature changes; binary size is unchanged.