Skip to content

feat(write): add DataFrame.writeParquet with ParquetWriteOptions#27

Merged
andygrove merged 3 commits into
apache:mainfrom
andygrove:feat/write-parquet
May 13, 2026
Merged

feat(write): add DataFrame.writeParquet with ParquetWriteOptions#27
andygrove merged 3 commits into
apache:mainfrom
andygrove:feat/write-parquet

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

  • Closes #.

Rationale for this change

DataFusion's DataFrame::write_parquet is the natural sink for transformed data, but today the Java bindings have no way to write results back to Parquet — collect() into JVM-side Arrow batches is the only option. This blocks ETL-shaped workloads where read → transform → write needs to stay in native code.

What changes are included in this PR?

  • New org.apache.datafusion.ParquetWriteOptions (fluent setters for compression and singleFileOutput), shaped to mirror the existing ParquetReadOptions.
  • DataFrame.writeParquet(String) and DataFrame.writeParquet(String, ParquetWriteOptions) overloads. Both retain the DataFrame (clone on the Rust side), matching the existing count() / show() pattern; the receiver stays usable and must still be closed.
  • One new JNI function Java_org_apache_datafusion_DataFrame_writeParquetWithOptions in native/src/lib.rs. Compression strings (e.g. "zstd(3)", "snappy", "uncompressed") are passed verbatim to DataFusion; invalid values surface as RuntimeException at write time.

Deliberately not included this round:

  • An overwriteMode knob / Java InsertOp enum — DataFusion 53.1.0's write_parquet only implements Append (Overwrite and Replace raise "not implemented"). Re-adding the knob once upstream support lands is a non-breaking addition.
  • partition_by, sort_by, row-group / page-size / dictionary / statistics / bloom-filter knobs — out of scope for the first cut.
  • CSV / JSON / Avro write outputs and any shared WriteOptions base.

Are these changes tested?

Yes. Two test files added:

  • ParquetWriteOptionsTest — pure-Java unit tests for defaults and fluent setter behavior.
  • DataFrameWriteParquetTest — four integration tests guarded by Assumptions.assumeTrue(Files.exists(lineitem)) so they skip cleanly when tpch-data/ is absent:
    • Round-trip row count via the no-options overload (multi-file output).
    • singleFileOutput(true) produces a regular file at the supplied path.
    • compression("zstd(3)") writes and reads back with row count preserved.
    • DataFrame remains usable after writeParquet (pins the retain semantics).

make test shows 42 tests, 0 failures. cargo clippy --all-targets --workspace -- -D warnings is clean.

Are there any user-facing changes?

Yes — adds new public API (DataFrame.writeParquet overloads + ParquetWriteOptions). No removals or breaking changes.

@andygrove andygrove merged commit 423ff8a into apache:main May 13, 2026
1 check passed
@andygrove andygrove deleted the feat/write-parquet branch May 13, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant