feat(write): add DataFrame.writeParquet with ParquetWriteOptions by andygrove · Pull Request #27 · apache/datafusion-java

andygrove · 2026-05-13T14:34:44Z

Which issue does this PR close?

Closes #.

Rationale for this change

DataFusion's DataFrame::write_parquet is the natural sink for transformed data, but today the Java bindings have no way to write results back to Parquet — collect() into JVM-side Arrow batches is the only option. This blocks ETL-shaped workloads where read → transform → write needs to stay in native code.

What changes are included in this PR?

New org.apache.datafusion.ParquetWriteOptions (fluent setters for compression and singleFileOutput), shaped to mirror the existing ParquetReadOptions.
DataFrame.writeParquet(String) and DataFrame.writeParquet(String, ParquetWriteOptions) overloads. Both retain the DataFrame (clone on the Rust side), matching the existing count() / show() pattern; the receiver stays usable and must still be closed.
One new JNI function Java_org_apache_datafusion_DataFrame_writeParquetWithOptions in native/src/lib.rs. Compression strings (e.g. "zstd(3)", "snappy", "uncompressed") are passed verbatim to DataFusion; invalid values surface as RuntimeException at write time.

Deliberately not included this round:

An overwriteMode knob / Java InsertOp enum — DataFusion 53.1.0's write_parquet only implements Append (Overwrite and Replace raise "not implemented"). Re-adding the knob once upstream support lands is a non-breaking addition.
partition_by, sort_by, row-group / page-size / dictionary / statistics / bloom-filter knobs — out of scope for the first cut.
CSV / JSON / Avro write outputs and any shared WriteOptions base.

Are these changes tested?

Yes. Two test files added:

ParquetWriteOptionsTest — pure-Java unit tests for defaults and fluent setter behavior.
DataFrameWriteParquetTest — four integration tests guarded by Assumptions.assumeTrue(Files.exists(lineitem)) so they skip cleanly when tpch-data/ is absent:
- Round-trip row count via the no-options overload (multi-file output).
- singleFileOutput(true) produces a regular file at the supplied path.
- compression("zstd(3)") writes and reads back with row count preserved.
- DataFrame remains usable after writeParquet (pins the retain semantics).

make test shows 42 tests, 0 failures. cargo clippy --all-targets --workspace -- -D warnings is clean.

Are there any user-facing changes?

Yes — adds new public API (DataFrame.writeParquet overloads + ParquetWriteOptions). No removals or breaking changes.

andygrove added 3 commits May 13, 2026 08:22

feat(write): add ParquetWriteOptions

aefdd9f

feat(write): add DataFrame.writeParquet with native binding

d94d946

test(write): cover single-file, compression, retain semantics

5577bb9

andygrove merged commit 423ff8a into apache:main May 13, 2026
1 check passed

andygrove deleted the feat/write-parquet branch May 13, 2026 15:19

This was referenced May 13, 2026

docs: remove project-status checklist #34

Merged

feat: add DataFrame.writeCsv with CsvWriteOptions #38

Closed

feat: add DataFrame.writeJson with JsonWriteOptions #39

Closed

This was referenced May 15, 2026

feat(dataframe): add writeCsv with CsvWriteOptions #53

Merged

feat(dataframe): add writeJson with JsonWriteOptions #61

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(write): add DataFrame.writeParquet with ParquetWriteOptions#27

feat(write): add DataFrame.writeParquet with ParquetWriteOptions#27
andygrove merged 3 commits into
apache:mainfrom
andygrove:feat/write-parquet

andygrove commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 13, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant