GH-3530: Bypass Hadoop codec abstraction to optimize compression performance by iemejia · Pull Request #3570 · apache/parquet-java

iemejia · 2026-05-17T22:39:00Z

Part of #3530 — Apache Parquet Java Performance Improvements

Summary

Bypass the Hadoop CompressionCodec abstraction for all six supported codecs, eliminating per-page codec-pool lookups, stream-wrapper allocation, and unnecessary buffer copies in both CodecFactory and DirectCodecFactory.

Codec	Before	After
Snappy	Hadoop `SnappyCodec` stream wrappers	xerial `Snappy.compress`/`uncompress` direct calls
LZ4_RAW	Hadoop codec abstraction	airlift `LZ4Compressor`/`LZ4Decompressor` direct
ZSTD	Streaming `ZstdOutputStreamNoFinalizer`/`ZstdInputStreamNoFinalizer`	Reusable `ZstdCompressCtx`/`ZstdDecompressCtx` single-call APIs
GZIP	Hadoop `GzipCodec` with codec-pool overhead	JDK `GZIPOutputStream`/`GZIPInputStream` direct
LZO	GPL `com.hadoop.compression.lzo.LzoCodec`	aircompressor `LzoHadoopStreams` (Apache 2.0, wire-compatible)
Brotli	Abandoned `brotli-codec` (jbrotli, 2016, x86-only)	`brotli4j` 1.23.0 (10 platforms incl. aarch64, reflection-loaded)

Notable side effects:

LZO: Removes GPL dependency; uses Apache 2.0 aircompressor. Wire-compatible framing.
Brotli: Enables aarch64 support (linux, macOS, Windows). Removes non-aarch64 Maven profile guards and test skips.

JMH benchmarks: CompressionBenchmark, CpuReadBenchmark, CpuWriteBenchmark, FileReadBenchmark, FileWriteBenchmark, ConcurrentReadWriteBenchmark.

Benchmark results

Environment: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, Linux x86_64.

End-to-end file write (100K rows, SingleShotTime, ms/op lower is better):

Codec	V1 dict=true	V2 dict=true	V2 Speedup
SNAPPY	50.6 -> 40.9 (1.24x)	69.7 -> 38.7	1.80x
ZSTD	52.3 -> 43.6 (1.20x)	70.7 -> 40.6	1.74x
LZ4_RAW	49.6 -> 41.3 (1.20x)	70.2 -> 39.0	1.80x
GZIP	149.9 -> 119.3 (1.26x)	123.4 -> 67.6	1.83x
BROTLI	55.4 -> 46.8 (1.18x)	72.8 -> 41.8	1.74x

End-to-end file read (ms/op lower is better):

Codec	V1 Speedup	V2 Speedup
SNAPPY	1.50x	1.61x
ZSTD	1.49x	1.60x
LZ4_RAW	1.23x	1.57x
GZIP	1.47x	1.49x
BROTLI	1.83x	1.91x

Raw codec throughput (DirectCodecFactory): Snappy/ZSTD/LZ4/GZIP unchanged (already had native access). Brotli decompression improved 2.3-2.7x (brotli4j >> jbrotli).

V2 shows consistently larger speedups than V1 because V2 encoding produces more, smaller pages, meaning more codec invocations per file where the per-invocation Hadoop overhead accumulates.

…n performance Bypass Hadoop CompressionCodec for Snappy (xerial JNI), LZ4_RAW (airlift), ZSTD (zstd-jni), GZIP (JDK), LZO (airlift), and BROTLI (brotli4j) in both CodecFactory and DirectCodecFactory, eliminating per-page codec pool lookups, stream wrapper allocation, and unnecessary buffer copies. SNAPPY: direct byte-array JNI calls via Snappy.compress/uncompress, avoiding the Hadoop stream abstraction and intermediate direct ByteBuffer copies. ZSTD: replace streaming ZstdOutputStreamNoFinalizer/ZstdInputStreamNoFinalizer with reusable ZstdCompressCtx/ZstdDecompressCtx single-call APIs. Supports compression level and multi-threaded workers configuration. LZ4_RAW: direct airlift Lz4Compressor/Lz4Decompressor with reusable direct ByteBuffers, bypassing Hadoop's NonBlockedCompressor overhead. GZIP: bypass Hadoop's GzipCodec and codec-pool/stream-wrapper overhead with direct JDK GZIPOutputStream/GZIPInputStream. Compression level is read from the existing "zlib.compress.level" Hadoop configuration key. LZO: use aircompressor's LzoHadoopStreams directly, bypassing the GPL-licensed com.hadoop.compression.lzo.LzoCodec. Wire-compatible with Hadoop's LzoCodec. BROTLI: migrate from jbrotli (unmaintained) to brotli4j via reflection, using single-call Encoder.compress/Decoder.decompress byte-array APIs. End-to-end interop tests (TestCompressionInterop) validate that files written with the old Hadoop CompressionCodec path are readable by the new direct path and vice versa, for all 6 codecs including multi-row-group scenarios.

This was referenced May 17, 2026

Optimize compression hot paths: bypass Hadoop codec abstraction for Snappy, LZ4_RAW, ZSTD, and GZIP #3555

Closed

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar #3512

Closed

Apache Parquet Java Performance Improvements #3530

Open

iemejia force-pushed the parquet-perf-v2-par6-compression branch from a413752 to fe836a6 Compare May 18, 2026 18:09

iemejia force-pushed the parquet-perf-v2-par6-compression branch from fe836a6 to 0571b9e Compare May 21, 2026 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3530: Bypass Hadoop codec abstraction to optimize compression performance#3570

GH-3530: Bypass Hadoop codec abstraction to optimize compression performance#3570
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:parquet-perf-v2-par6-compression

iemejia commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iemejia commented May 17, 2026

Summary

Benchmark results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant