[python] Implement partition expiration for pypaimon#7918
Conversation
a74bf8c to
30cba0e
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Review: [python] Implement partition expiration for pypaimon
Good work porting the partition expiration logic to pypaimon. The architecture is clean and the strategy pattern mirrors the Java implementation well. A few issues I noticed:
Bug: Dead code in _parse_with_formatter (partition_time_extractor.py)
The except block in _parse_with_formatter calls datetime.strptime(timestamp_string, formatter) again — the exact same call that just raised ValueError. This will always re-raise. The Java version uses a different formatter in the fallback (LocalDate.parse vs LocalDateTime.parse). To match the Java behavior, the fallback should attempt date-only parsing with a separate format.
Design: Java formatter conversion (_JAVA_TO_PYTHON_PATTERNS) is fragile
SSS/SS->%f: Python's%fexpects 6 digits (microseconds), while Java'sSSSis 3-digit millis. This is a subtle semantic mismatch.- Ordering risk: Sequential
str.replace()can produce incorrect results for overlapping tokens.
Consider a regex-based tokenizer or documenting the constraints.
Correctness: PartitionUpdateTimeExpireStrategy with last_file_creation_time=0
When last_file_creation_time is 0, the partition will always be considered expired. Consider adding a guard or log warning.
Scope: Unrelated methods in file_store_table.py
The statistics() and analyze() methods appear unrelated to partition expiration and should be in their own PR.
Minor issues
_parse_durationtype hint saysstrbut tests passNone— addOptional[str].- No
expire-batch-sizesupport (Java has this to prevent OOM on large drops). - Verify
truncate_partitionshas same semantics as Java'scommit.dropPartitions(...). - Missing
.donepartition handling for metastore-managed tables.
Tests
Thorough coverage. Suggest adding a test for last_file_creation_time=0 to document the "always expired" behavior.
Overall solid work. The main actionable fix is the _parse_with_formatter bug.
Add automatic partition expiration support with two strategies: - values-time: parse partition field values as timestamps - update-time: use last file creation time from manifests New module pypaimon/partition/ with: - PartitionTimeExtractor: extracts datetime from partition values - PartitionExpireStrategy: abstract base + two implementations - PartitionExpire: orchestration class reading manifests and dropping partitions Also adds: - Table-level API: table.expire_partitions() - CLI command: paimon table expire-partitions - Partition expiration options in CoreOptions - Unit tests (28 tests)
- Fix dead code in _parse_with_formatter: fallback now strips time directives and retries as date-only, matching Java's LocalDate.parse - Replace fragile str.replace() with regex tokenizer for Java→Python format conversion to avoid overlapping token issues - Skip partitions with last_file_creation_time=0 in update-time strategy to prevent false expiration of unknown-state partitions - Remove unrelated statistics()/analyze() methods from file_store_table - Fix _parse_duration type hint to Optional[str] - Add tests for date-only fallback and zero-creation-time guard
30cba0e to
75ae6e8
Compare
|
Thanks @JingsongLi for the review, I've addressed all comments. PTAL~ |
Purpose
Add pypaimon support for expiring partitions based on table options and CLI overrides, matching the Java partition expiration behavior for
values-timeandupdate-timestrategies.This includes:
FileStoreTable.DateTimeFormatter-style timestamp formatter support.table expire-partitionsCLI command.CoreOptions.partition.default-name.partition.expiration-max-num=100.Tests
Added
paimon-python/pypaimon/tests/partition_expire_test.pycovering:yyyyMMdd.values-time.update-time.