Skip to content

GH-3530: Bypass Hadoop codec abstraction to optimize compression performance#3570

Open
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:parquet-perf-v2-par6-compression
Open

GH-3530: Bypass Hadoop codec abstraction to optimize compression performance#3570
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:parquet-perf-v2-par6-compression

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented May 17, 2026

Part of #3530 — Apache Parquet Java Performance Improvements

Summary

Bypass the Hadoop CompressionCodec abstraction for all six supported codecs, eliminating per-page codec-pool lookups, stream-wrapper allocation, and unnecessary buffer copies in both CodecFactory and DirectCodecFactory.

Codec Before After
Snappy Hadoop SnappyCodec stream wrappers xerial Snappy.compress/uncompress direct calls
LZ4_RAW Hadoop codec abstraction airlift LZ4Compressor/LZ4Decompressor direct
ZSTD Streaming ZstdOutputStreamNoFinalizer/ZstdInputStreamNoFinalizer Reusable ZstdCompressCtx/ZstdDecompressCtx single-call APIs
GZIP Hadoop GzipCodec with codec-pool overhead JDK GZIPOutputStream/GZIPInputStream direct
LZO GPL com.hadoop.compression.lzo.LzoCodec aircompressor LzoHadoopStreams (Apache 2.0, wire-compatible)
Brotli Abandoned brotli-codec (jbrotli, 2016, x86-only) brotli4j 1.23.0 (10 platforms incl. aarch64, reflection-loaded)

Notable side effects:

  • LZO: Removes GPL dependency; uses Apache 2.0 aircompressor. Wire-compatible framing.
  • Brotli: Enables aarch64 support (linux, macOS, Windows). Removes non-aarch64 Maven profile guards and test skips.

JMH benchmarks: CompressionBenchmark, CpuReadBenchmark, CpuWriteBenchmark, FileReadBenchmark, FileWriteBenchmark, ConcurrentReadWriteBenchmark.

Benchmark results

Environment: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, Linux x86_64.

End-to-end file write (100K rows, SingleShotTime, ms/op lower is better):

Codec V1 dict=true V2 dict=true V2 Speedup
SNAPPY 50.6 -> 40.9 (1.24x) 69.7 -> 38.7 1.80x
ZSTD 52.3 -> 43.6 (1.20x) 70.7 -> 40.6 1.74x
LZ4_RAW 49.6 -> 41.3 (1.20x) 70.2 -> 39.0 1.80x
GZIP 149.9 -> 119.3 (1.26x) 123.4 -> 67.6 1.83x
BROTLI 55.4 -> 46.8 (1.18x) 72.8 -> 41.8 1.74x

End-to-end file read (ms/op lower is better):

Codec V1 Speedup V2 Speedup
SNAPPY 1.50x 1.61x
ZSTD 1.49x 1.60x
LZ4_RAW 1.23x 1.57x
GZIP 1.47x 1.49x
BROTLI 1.83x 1.91x

Raw codec throughput (DirectCodecFactory): Snappy/ZSTD/LZ4/GZIP unchanged (already had native access). Brotli decompression improved 2.3-2.7x (brotli4j >> jbrotli).

V2 shows consistently larger speedups than V1 because V2 encoding produces more, smaller pages, meaning more codec invocations per file where the per-invocation Hadoop overhead accumulates.

…n performance

Bypass Hadoop CompressionCodec for Snappy (xerial JNI), LZ4_RAW (airlift),
ZSTD (zstd-jni), GZIP (JDK), LZO (airlift), and BROTLI (brotli4j) in both
CodecFactory and DirectCodecFactory, eliminating per-page codec pool lookups,
stream wrapper allocation, and unnecessary buffer copies.

SNAPPY: direct byte-array JNI calls via Snappy.compress/uncompress, avoiding
the Hadoop stream abstraction and intermediate direct ByteBuffer copies.

ZSTD: replace streaming ZstdOutputStreamNoFinalizer/ZstdInputStreamNoFinalizer
with reusable ZstdCompressCtx/ZstdDecompressCtx single-call APIs. Supports
compression level and multi-threaded workers configuration.

LZ4_RAW: direct airlift Lz4Compressor/Lz4Decompressor with reusable direct
ByteBuffers, bypassing Hadoop's NonBlockedCompressor overhead.

GZIP: bypass Hadoop's GzipCodec and codec-pool/stream-wrapper overhead with
direct JDK GZIPOutputStream/GZIPInputStream. Compression level is read from
the existing "zlib.compress.level" Hadoop configuration key.

LZO: use aircompressor's LzoHadoopStreams directly, bypassing the GPL-licensed
com.hadoop.compression.lzo.LzoCodec. Wire-compatible with Hadoop's LzoCodec.

BROTLI: migrate from jbrotli (unmaintained) to brotli4j via reflection,
using single-call Encoder.compress/Decoder.decompress byte-array APIs.

End-to-end interop tests (TestCompressionInterop) validate that files written
with the old Hadoop CompressionCodec path are readable by the new direct path
and vice versa, for all 6 codecs including multi-row-group scenarios.
@iemejia iemejia force-pushed the parquet-perf-v2-par6-compression branch from fe836a6 to 0571b9e Compare May 21, 2026 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant