Is your feature request related to a problem or challenge?
DataFrame introspection is currently limited to count(), show(),
and show(int). Users wanting to inspect schema, see the planned
query, or materialize an intermediate result have no Java entry point.
Describe the solution you'd like
DataFrame.schema() — return an Arrow Schema. Reuse the IPC
round-trip already established for SessionContext.tableSchema
(tableSchemaIpc pattern in SessionContext.java). Non-consuming;
the DataFrame remains usable.
DataFrame.explain(boolean verbose, boolean analyze) — wraps
DataFusion::DataFrame::explain, returning a DataFrame whose rows
are the plan-explanation strings. Caller calls show() / collect()
on the result. Matches DataFusion's own semantics.
DataFrame.cache() — materializes the plan into an in-memory
table and returns a new DataFrame. Async on the Rust side; blocks on
the Tokio runtime, same pattern as collect. Caller-closes the
returned DataFrame.
DataFrame.describe() — async, returns a DataFrame with summary
stats (count, mean, stddev, min, max) per numeric column. Same
pattern as cache.
Describe alternatives you've considered
For schema: ctx.tableSchema(name) works only for registered
tables. A user who built a DataFrame via sql("SELECT …") or chained
transformations has no schema accessor.
For explain: ctx.sql("EXPLAIN <query>") works but only against a
SQL string.
Additional context
All four are non-consuming except cache and describe (which return
new DataFrames the caller owns and closes). The schema/explain pair are
the most-requested and could land first as a smaller PR; cache and
describe are independent and can follow.
Is your feature request related to a problem or challenge?
DataFrame introspection is currently limited to
count(),show(),and
show(int). Users wanting to inspect schema, see the plannedquery, or materialize an intermediate result have no Java entry point.
Describe the solution you'd like
DataFrame.schema()— return an ArrowSchema. Reuse the IPCround-trip already established for
SessionContext.tableSchema(
tableSchemaIpcpattern inSessionContext.java). Non-consuming;the DataFrame remains usable.
DataFrame.explain(boolean verbose, boolean analyze)— wrapsDataFusion::DataFrame::explain, returning aDataFramewhose rowsare the plan-explanation strings. Caller calls
show()/collect()on the result. Matches DataFusion's own semantics.
DataFrame.cache()— materializes the plan into an in-memorytable and returns a new DataFrame. Async on the Rust side; blocks on
the Tokio runtime, same pattern as
collect. Caller-closes thereturned DataFrame.
DataFrame.describe()— async, returns a DataFrame with summarystats (count, mean, stddev, min, max) per numeric column. Same
pattern as
cache.Describe alternatives you've considered
For
schema:ctx.tableSchema(name)works only for registeredtables. A user who built a DataFrame via
sql("SELECT …")or chainedtransformations has no schema accessor.
For
explain:ctx.sql("EXPLAIN <query>")works but only against aSQL string.
Additional context
All four are non-consuming except
cacheanddescribe(which returnnew DataFrames the caller owns and closes). The schema/explain pair are
the most-requested and could land first as a smaller PR;
cacheanddescribeare independent and can follow.