Skip to content

[parquet] Add map shredding for hot keys#7877

Open
Aitozi wants to merge 1 commit into
apache:masterfrom
Aitozi:mwj-map-shredding
Open

[parquet] Add map shredding for hot keys#7877
Aitozi wants to merge 1 commit into
apache:masterfrom
Aitozi:mwj-map-shredding

Conversation

@Aitozi
Copy link
Copy Markdown
Contributor

@Aitozi Aitozi commented May 17, 2026

Purpose

Add Parquet map shredding support for MAP<STRING, T> columns.

This allows selected map columns to extract hot keys into independent physical Parquet columns while preserving the original logical map schema for readers. The feature is controlled by map.shredding.* options, aligned with the existing variant.shredding.* naming style. It also adds a focused round-trip test and a storage benchmark to validate the storage benefit.

Tests

  • mvn -pl paimon-api,paimon-format -Pfast-build -DskipTests compile
  • mvn -pl paimon-format -am -Pfast-build -DfailIfNoTests=false -Dtest=ParquetFormatReadWriteTest#testMapShreddingRoundTrip,MapShreddingStorageBenchmark test
  • git diff --check

Physical Layout

This change does not introduce a new Parquet logical type and does not modify the standard Parquet MAP encoding. A shredded map is still written with the regular Parquet map group as the residual map. Hot keys are promoted into additional sibling sidecar columns in the parent Parquet group.

For example, a logical field:

headers MAP<STRING, STRING>

is normally written as:

message paimon_schema {
  optional group headers (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional binary value (STRING);
    }
  }
}

With map shredding enabled, if user-agent and host are selected as hot keys, the physical Parquet schema becomes:

message paimon_schema {
  optional group headers (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional binary value (STRING);
    }
  }

  optional binary dynamic_column_headers_value_0 (STRING);
  optional binary dynamic_column_headers_value_1 (STRING);
}

The footer metadata records the mapping from sidecar columns to map keys:

parquet.meta.dynamic.column.map.keys.of.headers = user-agent,host

During writing, entries for promoted hot keys are omitted from the residual map when their values are non-null, and their values are written into the corresponding sidecar columns. During reading, Paimon reads both the residual map and the sidecar columns, then reconstructs the original logical MAP<STRING, T> value.

For nested maps, the same rule applies within the containing row group. For example, for payload.headers, sidecar columns are added as siblings of the headers map inside the payload group, and the footer metadata uses the full logical path:

parquet.meta.dynamic.column.map.keys.of.payload.headers = user-agent,host

#7876

@Aitozi Aitozi force-pushed the mwj-map-shredding branch from 61967d4 to 5a5b5a5 Compare May 17, 2026 04:36
@Aitozi Aitozi force-pushed the mwj-map-shredding branch from 5a5b5a5 to 5f397f8 Compare May 17, 2026 04:38
@Aitozi
Copy link
Copy Markdown
Contributor Author

Aitozi commented May 17, 2026

Benchmark command:

mvn -s ~/.m2/apache-community.xml -pl paimon-format -am -Pfast-build \
  -DfailIfNoTests=false -Dtest=MapShreddingStorageBenchmark test

Benchmark file: [MapShreddingStorageBenchmark.java]

Common Setup

  • Schema: id INT, headers MAP<STRING, STRING>
  • Rows: 100,000
  • Hot keys: 32
  • Value length: 16
  • Compression: snappy
  • Compared layouts:
    • regular: normal Parquet map encoding
    • mapShredding: promotes 32 hot keys from headers into sidecar columns
  • Map shredding options:
    • map.shredding.columns=headers
    • map.shredding.maxKeys=32
    • map.shredding.maxInferBufferRow=10000
    • map.shredding.maxInferBufferMemory=64 mb

Results

Scenario Regular Map Shredding Saved Saving
Columnar value storage 708,012 bytes 431,637 bytes 276,375 bytes 39.04%
Long hot key storage 40,845,943 bytes 16,365,106 bytes 24,480,837 bytes 59.93%

Scenario Details

  • Columnar value storage: key names are short, values follow a repeated pattern with valueRunLength=128 and valueCardinality=4, dictionary encoding enabled. This measures whether promoted hot-key values benefit from columnar and dictionary encoding.
  • Long hot key storage: hot key names include 128 bytes of padding, dictionary encoding disabled. This measures the benefit of avoiding repeated long map-key strings in every row.

Conclusion: in this synthetic storage benchmark, map shredding reduces file size in both cases. The biggest gain appears when hot map keys are long and repeated across many rows, saving about 59.93%.

@JingsongLi
Copy link
Copy Markdown
Contributor

This looks very suitable to be solved using Variant, why not?

@Aitozi
Copy link
Copy Markdown
Contributor Author

Aitozi commented May 18, 2026

This looks very suitable to be solved using Variant, why not?

Hi @JingsongLi here are two reason we considering to introduce the shredding to the map

  1. The map type is widely used in the internal usage, the migration to variant involves a large number of downstream function changes.
  2. Regarding storage savings, the variant type does not offer a significant advantage. Based on the benchmark results, the actual reduction in storage usage with variant is very minimal compared to the dynamic column shredding of the map.

@JingsongLi
Copy link
Copy Markdown
Contributor

This looks very suitable to be solved using Variant, why not?

Hi @JingsongLi here are two reason we considering to introduce the shredding to the map

  1. The map type is widely used in the internal usage, the migration to variant involves a large number of downstream function changes.
  2. Regarding storage savings, the variant type does not offer a significant advantage. Based on the benchmark results, the actual reduction in storage usage with variant is very minimal compared to the dynamic column shredding of the map.

Can you demonstrate some benchmarks? As you said, the difference in storage between map and variant?

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [parquet] Add map shredding for hot keys

Overall this is a well-structured feature that extends the Parquet format layer to promote frequently-occurring map keys into independent columnar sidecar columns. The physical layout is clean, the metadata round-trip through footer key-value pairs is self-contained, and the tests cover multiple value types (STRING, INT, VARIANT) plus nested maps. A few observations follow.


Correctness Concerns

1. Comma in hot key names breaks metadata parsing

Hot keys are serialized into footer metadata as a comma-separated string (String.join(",", keys) in toFooterMetadata), and deserialized using the same parseColumnPaths that splits on commas. If a map key legitimately contains a comma (e.g. "Content-Type, charset"), it will be incorrectly split during reads. Consider using a delimiter that is less likely to appear in map keys or adding proper escaping (e.g. backslash-escape or JSON array encoding).

2. MapSidecarWriter.valueIndex mutable state coupling

In ParquetRowDataWriter, MapSidecarWriter.findValueIndex() stores the found index in a mutable field valueIndex, which is later consumed by write(). The correctness depends on RowFieldWriter.shouldWrite() always being called immediately before write() for the same row. If this invariant is ever violated (e.g. a future refactor), the wrong value or a stale index would be written silently. A safer pattern would be to pass the found index explicitly, or recompute it in write().

3. InternalRowToSizeVisitor allocated per-row in the extractor

In MapShreddingKeyExtractor.add():

BiFunction<DataGetters, Integer, Integer> valueSizer =
        column.valueType().accept(new InternalRowToSizeVisitor());

This creates a new visitor object on every call to add() for every column. Since the visitor is stateless and depends only on column.valueType() (which is fixed), it should be pre-computed once per column during construction and cached in ResolvedMapShreddingColumn or in the extractor itself. For 10,000 buffered rows x N columns, this is significant unnecessary allocation.


Design Observations

4. Hot key selection metric: total byte size vs. frequency

The extractor ranks keys by total accumulated value size (sizes.merge(key.toString(), valueSize, Long::sum)). This means a rare key with a single 10 KB value can outrank a key that appears in 99% of rows but has small values. The storage benefit of shredding comes primarily from eliminating repeated key strings and improving columnar encoding. A frequency-based or hybrid metric (frequency * average_value_size) might give better compression in practice. Worth documenting the trade-off in the config description at minimum.

5. Sidecar field ID collision risk

The sidecar column field IDs are computed as:

SpecialFields.getMapValueFieldId(field.id(), 1) + 1024 + i

The magic 1024 offset is not validated against the existing field ID space. For schemas with many fields or deeply nested types that already use high IDs, this could silently collide. Consider deriving IDs from a dedicated namespace or adding a collision check during schema conversion.

6. reachTargetSize during the buffering phase

In MapShreddingFormatWriter.reachTargetSize(), when the delegate is null (still buffering), it returns stream.getPos() >= targetSize. But during buffering no data has been written to the stream yet (position is 0), so this always returns false. This is likely fine in practice since the buffer phase is bounded by maxInferBufferRow/maxInferBufferMemory, but the semantics are misleading. A comment explaining this would help future readers.

7. Schema cache bypass for shredded files

When dynamicMapKeys is non-empty, getOrCreateRequestedSchema skips the requestedSchemaCache and recomputes the schema every time. Since each file can have different hot keys, this is correct, but it could become a performance concern if many small files are read. Consider keying the cache on (fileSchema, dynamicMapKeys) if profiling shows this matters.


Minor / Style

  • MapShreddingStorageBenchmark duplicates the option key strings as constants (MAP_SHREDDING_COLUMNS, etc.) instead of referencing CoreOptions.MAP_SHREDDING_COLUMNS.key(). Using the canonical keys avoids silent drift.
  • The hasResidualEntry method in MapWriter iterates all entries to check whether any should not be skipped. In the common case where the map is mostly composed of hot keys (which are skipped), this results in O(n) per row to determine if the repeated group should be opened at all. This is functionally correct but could be noted as a potential optimization point for large maps.
  • The option naming uses camelCase (maxInferBufferRow, maxInferBufferMemory) while the existing variant shredding options and most other Paimon options use dot-separated lowercase (max-infer-buffer-row). Aligning with the project convention would be more consistent.

Good work overall. The feature is well-isolated, the test coverage is thorough (especially testing Variant values inside shredded maps), and the benchmark demonstrates clear storage savings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants