Skip to content

[core] Avoid cross-file blob and vector compaction for data evolution#7938

Open
leaves12138 wants to merge 3 commits into
apache:masterfrom
leaves12138:codex/fix-de-blob-compact-range
Open

[core] Avoid cross-file blob and vector compaction for data evolution#7938
leaves12138 wants to merge 3 commits into
apache:masterfrom
leaves12138:codex/fix-de-blob-compact-range

Conversation

@leaves12138
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 commented May 22, 2026

Purpose

This PR prevents standalone Data Evolution dedicated-file compaction from combining blob or vector-store files that belong to different regular data-file row-id ranges.

Root Cause

The compact planner grouped dedicated files from a data compaction group before planning dedicated compact tasks. If blob or vector-store files were compacted across multiple regular data-file ranges without compacting those regular data files into the same row-id range, the compacted dedicated file could overlap several remaining data files.

Conflict detection groups files by overlapping row-id range and filters blob files from the error message, so the failure surfaced as multiple regular data files with different row-id ranges conflicting during COMPACT.

Changes

  • Keep cross-data-file blob/vector-store compaction only when the corresponding regular data files are compacted in the same task.
  • Plan blob/vector-store compaction per containing data file when no regular data-file compaction is triggered.
  • Update planner tests for both the no-compact and compact-together paths.

Tests

  • JAVA_HOME=/opt/zulu8.68.0.21-ca-jdk8.0.362-macosx_aarch64 mvn -pl paimon-core spotless:apply
  • JAVA_HOME=/opt/zulu8.68.0.21-ca-jdk8.0.362-macosx_aarch64 mvn -pl paimon-core -Dtest=DataEvolutionCompactCoordinatorTest test

@leaves12138 leaves12138 marked this pull request as ready for review May 22, 2026 18:00
@leaves12138 leaves12138 force-pushed the codex/fix-de-blob-compact-range branch from b0abd46 to 6d20e27 Compare May 22, 2026 18:09
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments:

  1. Vector store files have the same bug. Lines 374-383 still collect all vector store files from all data files in the group and compact them together, regardless of whether triggerNormalFile is true. The same
    cross-file compaction problem applies to vector store files. The fix should be applied symmetrically.
  2. Test is a negative-only assertion. The new test testCompactPlannerDoesNotCompactBlobFilesAcrossDataFiles asserts tasks.isEmpty(), but it would be stronger to also verify that when compactMinFileNum=2 (matching
    the 2 data files), the blob files DO get compacted together. This proves both the "yes-compact" and "no-compact" paths work. The existing testCompactPlannerWithBlobFiles partially covers this, but the boundary
    is subtle.
  3. Edge case: single data file with multiple blob files per field. When triggerNormalFile == false, the per-data-file blob compaction loop calls blobFileGroupsToCompact() for each data file individually. If a
    single data file has, say, 3 small blob files for the same field (from prior partial compactions or writes), this correctly compacts them. Good.
  4. Minor: The else branch iterates all dataFiles and plans blob compaction per file. If dataFiles has, say, 5 files but only 2 have blob files, this incurs 5 iterations but getOrDefault(..., emptyList()) returns
    empty for the others and blobFileGroupsToCompact([]) returns empty — harmless but slightly wasteful. Not worth fixing.

@leaves12138 leaves12138 changed the title [core] Avoid cross-file blob compaction for data evolution [core] Avoid cross-file blob and vector compaction for data evolution May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants