Skip to content

[python] Implement table repair operation for pypaimon#7940

Closed
JunRuiLee wants to merge 1 commit into
apache:masterfrom
JunRuiLee:pypaimon-repair
Closed

[python] Implement table repair operation for pypaimon#7940
JunRuiLee wants to merge 1 commit into
apache:masterfrom
JunRuiLee:pypaimon-repair

Conversation

@JunRuiLee
Copy link
Copy Markdown
Contributor

@JunRuiLee JunRuiLee commented May 23, 2026

Purpose

Add metadata repair capability to pypaimon's filesystem catalog. The repair operation verifies the consistency chain (LATEST → snapshot → manifest list → manifest → data files) and can optionally fix a dangling LATEST file by rewriting it to the newest fully-valid snapshot.

Key behaviors:

  • Branch-qualified identifiers (db.table$branch_name) repair the correct branch
  • check_data_files=True verifies data file existence using proper partition path construction, respects custom partition.default-name, and skips DELETE entries
  • Fix mode only selects a snapshot whose entire manifest chain (and optionally data files) passes validation

Tests

21 unit tests in pypaimon/tests/repair_test.py

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comprehensive repair implementation!

Architecture feedback:

  1. 1488 lines in a single PR is substantial. Consider whether this could be split into:

    • Part 1: RepairReport + TableRepair verification logic (read-only)
    • Part 2: Fix mode (the dry_run=False path)
    • Part 3: CLI integration
  2. check_data_files parameter: The PR description mentions it "respects custom partition.default-name", which is good. But data file verification in large tables could be extremely slow. Consider:

    • Adding a progress callback or logging interval
    • Documenting the expected time complexity (proportional to total data files, not table size)
  3. Repair scope: The repair_catalog method iterates all databases and tables. For large catalogs, this could be a very long-running operation. Consider whether there should be a parallelism option or at minimum a way to resume from failure.

  4. Error handling in fix mode: When dry_run=False, if we crash midway through the fix (e.g., after deleting some snapshot files but before rewriting LATEST), what's the recovery story? The repair should be idempotent — running it again should produce a valid state.

  5. Test coverage: 21 unit tests is good. Are there tests for the case where repair is interrupted mid-fix?

Minor: The repair_database and repair_catalog APIs return List[RepairReport] but the base class docstring only mentions that. Make sure there's consistency in the return type annotations.

@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review! I've split this PR into 3 parts as suggested:

  1. [python] Add table repair verification logic for pypaimon (Part 1/3) #7943 — Read-only verification logic (TableRepair.verify())
  2. [python] Add table repair fix mode and catalog integration (Part 2/3) #7944 — Fix mode + catalog integration (depends on Part 1)
  3. [python] Add CLI command for table repair (Part 3/3) #7945 — CLI command (depends on Part 2)

Also addressed the other feedback points:

  • Progress logging: Added logging.info every 1000 data files when check_data_files=True, and documented time complexity as O(total_data_files)
  • Resume-from-failure: Added per-table error isolation in repair_database — individual table failures are logged and skipped, so re-running after a crash continues from where it left off
  • Idempotency: The only fix operation (_fix_latest_file) performs a single atomic write. Re-running after interruption converges to valid state. Added docstring explaining the guarantee.
  • Test for interrupted mid-fix: Added test_repair_is_idempotent — runs repair twice, verifies second run is a no-op
  • Return type annotations: Added consistent -> RepairReport / -> List[RepairReport] annotations to catalog methods

Please merge in order: Part 1 → Part 2 → Part 3. Closing this PR in favor of the split.

@JunRuiLee JunRuiLee closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants