Skip to content

[python] Implement partition expiration for pypaimon#7918

Open
JunRuiLee wants to merge 2 commits into
apache:masterfrom
JunRuiLee:pypaimon-expire-partitions
Open

[python] Implement partition expiration for pypaimon#7918
JunRuiLee wants to merge 2 commits into
apache:masterfrom
JunRuiLee:pypaimon-expire-partitions

Conversation

@JunRuiLee
Copy link
Copy Markdown
Contributor

Purpose

Add pypaimon support for expiring partitions based on table options and CLI overrides, matching the Java partition expiration behavior for values-time and update-time strategies.

This includes:

  • Partition expiration orchestration for FileStoreTable.
  • Partition entry discovery from manifest metadata.
  • Time extraction from partition values.
  • Java DateTimeFormatter-style timestamp formatter support.
  • table expire-partitions CLI command.
  • Partition expiration options in Python CoreOptions.
  • Null partition handling via partition.default-name.
  • Default partition.expiration-max-num=100.

Tests

Added paimon-python/pypaimon/tests/partition_expire_test.py covering:

  • Extracting partition time from simple date and timestamp values.
  • Extracting time with timestamp patterns across multiple partition fields.
  • Java formatter conversion cases like yyyyMMdd.
  • Selecting expired partitions with values-time.
  • Selecting expired partitions with update-time.
  • Skipping unparseable partition values.
  • Respecting expiration check interval.
  • Respecting max expire count.
  • Empty partition lists.
  • Unconfigured expiration.
  • Non-partitioned tables.
  • Duration parsing for days, hours, minutes, seconds, milliseconds, and combined durations.

@JunRuiLee JunRuiLee force-pushed the pypaimon-expire-partitions branch from a74bf8c to 30cba0e Compare May 20, 2026 11:02
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [python] Implement partition expiration for pypaimon

Good work porting the partition expiration logic to pypaimon. The architecture is clean and the strategy pattern mirrors the Java implementation well. A few issues I noticed:


Bug: Dead code in _parse_with_formatter (partition_time_extractor.py)

The except block in _parse_with_formatter calls datetime.strptime(timestamp_string, formatter) again — the exact same call that just raised ValueError. This will always re-raise. The Java version uses a different formatter in the fallback (LocalDate.parse vs LocalDateTime.parse). To match the Java behavior, the fallback should attempt date-only parsing with a separate format.


Design: Java formatter conversion (_JAVA_TO_PYTHON_PATTERNS) is fragile

  1. SSS/SS -> %f: Python's %f expects 6 digits (microseconds), while Java's SSS is 3-digit millis. This is a subtle semantic mismatch.
  2. Ordering risk: Sequential str.replace() can produce incorrect results for overlapping tokens.

Consider a regex-based tokenizer or documenting the constraints.


Correctness: PartitionUpdateTimeExpireStrategy with last_file_creation_time=0

When last_file_creation_time is 0, the partition will always be considered expired. Consider adding a guard or log warning.


Scope: Unrelated methods in file_store_table.py

The statistics() and analyze() methods appear unrelated to partition expiration and should be in their own PR.


Minor issues

  1. _parse_duration type hint says str but tests pass None — add Optional[str].
  2. No expire-batch-size support (Java has this to prevent OOM on large drops).
  3. Verify truncate_partitions has same semantics as Java's commit.dropPartitions(...).
  4. Missing .done partition handling for metastore-managed tables.

Tests

Thorough coverage. Suggest adding a test for last_file_creation_time=0 to document the "always expired" behavior.

Overall solid work. The main actionable fix is the _parse_with_formatter bug.

JunRuiLee added 2 commits May 24, 2026 01:31
Add automatic partition expiration support with two strategies:
- values-time: parse partition field values as timestamps
- update-time: use last file creation time from manifests

New module pypaimon/partition/ with:
- PartitionTimeExtractor: extracts datetime from partition values
- PartitionExpireStrategy: abstract base + two implementations
- PartitionExpire: orchestration class reading manifests and dropping partitions

Also adds:
- Table-level API: table.expire_partitions()
- CLI command: paimon table expire-partitions
- Partition expiration options in CoreOptions
- Unit tests (28 tests)
- Fix dead code in _parse_with_formatter: fallback now strips time
  directives and retries as date-only, matching Java's LocalDate.parse
- Replace fragile str.replace() with regex tokenizer for Java→Python
  format conversion to avoid overlapping token issues
- Skip partitions with last_file_creation_time=0 in update-time strategy
  to prevent false expiration of unknown-state partitions
- Remove unrelated statistics()/analyze() methods from file_store_table
- Fix _parse_duration type hint to Optional[str]
- Add tests for date-only fallback and zero-creation-time guard
@JunRuiLee JunRuiLee force-pushed the pypaimon-expire-partitions branch from 30cba0e to 75ae6e8 Compare May 23, 2026 17:37
@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Thanks @JingsongLi for the review, I've addressed all comments. PTAL~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants