[python] Implement table repair operation for pypaimon#7940
Conversation
d3c3f9e to
e4c18a2
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Thanks for this comprehensive repair implementation!
Architecture feedback:
-
1488 lines in a single PR is substantial. Consider whether this could be split into:
- Part 1:
RepairReport+TableRepairverification logic (read-only) - Part 2: Fix mode (the
dry_run=Falsepath) - Part 3: CLI integration
- Part 1:
-
check_data_filesparameter: The PR description mentions it "respects custompartition.default-name", which is good. But data file verification in large tables could be extremely slow. Consider:- Adding a progress callback or logging interval
- Documenting the expected time complexity (proportional to total data files, not table size)
-
Repair scope: The
repair_catalogmethod iterates all databases and tables. For large catalogs, this could be a very long-running operation. Consider whether there should be a parallelism option or at minimum a way to resume from failure. -
Error handling in fix mode: When
dry_run=False, if we crash midway through the fix (e.g., after deleting some snapshot files but before rewriting LATEST), what's the recovery story? The repair should be idempotent — running it again should produce a valid state. -
Test coverage: 21 unit tests is good. Are there tests for the case where repair is interrupted mid-fix?
Minor: The repair_database and repair_catalog APIs return List[RepairReport] but the base class docstring only mentions that. Make sure there's consistency in the return type annotations.
|
Thanks for the detailed review! I've split this PR into 3 parts as suggested:
Also addressed the other feedback points:
Please merge in order: Part 1 → Part 2 → Part 3. Closing this PR in favor of the split. |
Purpose
Add metadata repair capability to pypaimon's filesystem catalog. The repair operation verifies the consistency chain (LATEST → snapshot → manifest list → manifest → data files) and can optionally fix a dangling LATEST file by rewriting it to the newest fully-valid snapshot.
Key behaviors:
db.table$branch_name) repair the correct branchcheck_data_files=Trueverifies data file existence using proper partition path construction, respects custompartition.default-name, and skips DELETE entriesTests
21 unit tests in
pypaimon/tests/repair_test.py