[core] Avoid cross-file blob and vector compaction for data evolution#7938
Open
leaves12138 wants to merge 3 commits into
Open
[core] Avoid cross-file blob and vector compaction for data evolution#7938leaves12138 wants to merge 3 commits into
leaves12138 wants to merge 3 commits into
Conversation
b0abd46 to
6d20e27
Compare
JingsongLi
reviewed
May 23, 2026
Contributor
JingsongLi
left a comment
There was a problem hiding this comment.
Left comments:
- Vector store files have the same bug. Lines 374-383 still collect all vector store files from all data files in the group and compact them together, regardless of whether triggerNormalFile is true. The same
cross-file compaction problem applies to vector store files. The fix should be applied symmetrically. - Test is a negative-only assertion. The new test testCompactPlannerDoesNotCompactBlobFilesAcrossDataFiles asserts tasks.isEmpty(), but it would be stronger to also verify that when compactMinFileNum=2 (matching
the 2 data files), the blob files DO get compacted together. This proves both the "yes-compact" and "no-compact" paths work. The existing testCompactPlannerWithBlobFiles partially covers this, but the boundary
is subtle. - Edge case: single data file with multiple blob files per field. When triggerNormalFile == false, the per-data-file blob compaction loop calls blobFileGroupsToCompact() for each data file individually. If a
single data file has, say, 3 small blob files for the same field (from prior partial compactions or writes), this correctly compacts them. Good. - Minor: The else branch iterates all dataFiles and plans blob compaction per file. If dataFiles has, say, 5 files but only 2 have blob files, this incurs 5 iterations but getOrDefault(..., emptyList()) returns
empty for the others and blobFileGroupsToCompact([]) returns empty — harmless but slightly wasteful. Not worth fixing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR prevents standalone Data Evolution dedicated-file compaction from combining blob or vector-store files that belong to different regular data-file row-id ranges.
Root Cause
The compact planner grouped dedicated files from a data compaction group before planning dedicated compact tasks. If blob or vector-store files were compacted across multiple regular data-file ranges without compacting those regular data files into the same row-id range, the compacted dedicated file could overlap several remaining data files.
Conflict detection groups files by overlapping row-id range and filters blob files from the error message, so the failure surfaced as multiple regular data files with different row-id ranges conflicting during COMPACT.
Changes
Tests
JAVA_HOME=/opt/zulu8.68.0.21-ca-jdk8.0.362-macosx_aarch64 mvn -pl paimon-core spotless:applyJAVA_HOME=/opt/zulu8.68.0.21-ca-jdk8.0.362-macosx_aarch64 mvn -pl paimon-core -Dtest=DataEvolutionCompactCoordinatorTest test