perf: Convert inner joins to semi joins when equivalent#22652
Conversation
|
run benchmarks tpcds |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (39eb398) to 2e7b8e1 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
| /// The columns that are "live" at a plan node, i.e., which of its output | ||
| /// columns are referenced by an ancestor node. Represented as a set of column | ||
| /// indices, relative to the node's schema. | ||
| type LiveColumns = HashSet<usize>; |
There was a problem hiding this comment.
- This is similar but not identical to the
RequiredIndicesdata structure used byOptimizeProjections.RequiredIndicescares about insertion order but we don't, so it seemed cleaner to use a different data structure here. - It would be better to use a bitmap than a
HashSet. We could do so by adding a dependency on a third-party bitmap implementation (e.g., https://github.com/petgraph/fixedbitset, which one of our indirect dependencies already pulls in). But the performance impact should be small, so I'm not sure it's worth adding the dep.
There was a problem hiding this comment.
You could potentially use arrow BooleanBufferBuilder as a mutable bitset: https://docs.rs/arrow/latest/arrow/array/struct.BooleanBufferBuilder.html
Not sure if that is actually faster or not
There was a problem hiding this comment.
BooleanBufferBuilder would be awkward for what we need (i.e., variably-sized mutable bitmap, iterating over just the set bits, etc.)
Now that LiveColumns is a proper type, we can always switch this later. I'm inclined to leave as-is for now, I don't think the performance difference will be large.
|
run benchmarks |
|
run benchmark tpch10 |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
@neilconway FYI quite some plans changed now 🚀 Can you take a look at the PR |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (6b1d38a) to 7843ab3 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (6b1d38a) to 7843ab3 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (6b1d38a) to 7843ab3 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmarks |
|
run benchmark tpch10 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (227b2ac) to 04ef3c7 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (227b2ac) to 04ef3c7 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (227b2ac) to 04ef3c7 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (227b2ac) to 04ef3c7 (merge-base) diff using: tpch10 File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch10 — base (merge-base)
tpch10 — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmark tpch10 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-fd-convert-to-semi-join (227b2ac) to 04ef3c7 (merge-base) diff using: tpch10 File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch10 — base (merge-base)
tpch10 — branch
File an issue against this benchmark runner |
|
For anyone following along, @neilconway and @Dandandan made Semi-join faster in So now we expect this rewrite to be better perofrming 🚀 |
|
The latest benchmark numbers look better overall. Looking at the TPC-H numbers, at least locally w/ SF10, it seems that Q22 improves because switching to a semi-join results in emitting different stats for the join output, and those stats result in avoiding some poor planner choices we made for operators above the semi-join. So while semi-joins are a bit easier to compute stats for than inner joins, the win there is a bit coincidental. Q13 and Q22 are noise (no plan change from this rewrite). I looked at TPC-DS Q14, it's interesting: locally, the rewrite regresses performance by ~15% with >= 2 partitions but improves performance by ~10% with a single partition. The reason seems to be that using a semi-join results in less flexibility for subsequent partition decision making, so the overall plan performs worse because it needs a repartition that the inner join plan didn't. I think we're okay to merge this as-is, but there's a bunch of room for further improvements:
|
|
Given we are cranking on the join implementations this release, let's merge this in. |
|
Thanks again @neilconway and @Dandandan |
|
Looks like the merge queue has caught a logical conflict https://github.com/apache/datafusion/actions/runs/27299136079/job/80639801196 |
|
@alamb Updated, CI should be fixed. |
## Which issue does this PR close? - Closes apache#22594 ## Rationale for this change This PR extends the `EliminateJoin` rewrite pass to replace inner joins with semi joins in some cases. An inner join `L ⋈ R` can be rewritten to a left semi join `L ⋉ R` if two conditions hold: 1. None of R's columns are referenced above the join 2. (a) each L row matches at most one R row, OR (b) the consumers of the join result are insensitive to duplicates (And symmetrically with right semi joins.) ## What changes are included in this PR? * Add `for_each_referenced_index` helper that is used by both `EliminateJoin` and `EliminateProjections` * Introduce `LiveColumns` type to track the "live" (referenced by parent) columns of a plan node * Add inner -> semi join rewrite to `EliminateJoin` * Add unit and SLT tests for rewrite behavior * Update SLT test fixtures for plan changes ## Are these changes tested? Yes; new tests added. ## Are there any user-facing changes? Some plan changes but no behavioral changes. --------- Co-authored-by: Daniël Heres <danielheres@gmail.com>
Which issue does this PR close?
Rationale for this change
This PR extends the
EliminateJoinrewrite pass to replace inner joins with semi joins in some cases. An inner joinL ⋈ Rcan be rewritten to a left semi joinL ⋉ Rif two conditions hold:(And symmetrically with right semi joins.)
What changes are included in this PR?
for_each_referenced_indexhelper that is used by bothEliminateJoinandEliminateProjectionsLiveColumnstype to track the "live" (referenced by parent) columns of a plan nodeEliminateJoinAre these changes tested?
Yes; new tests added.
Are there any user-facing changes?
Some plan changes but no behavioral changes.