Skip to content

feat(dataframe): expose introspection methods (schema, explain, cache, describe) #45

@andygrove

Description

@andygrove

Is your feature request related to a problem or challenge?

DataFrame introspection is currently limited to count(), show(),
and show(int). Users wanting to inspect schema, see the planned
query, or materialize an intermediate result have no Java entry point.

Describe the solution you'd like

  • DataFrame.schema() — return an Arrow Schema. Reuse the IPC
    round-trip already established for SessionContext.tableSchema
    (tableSchemaIpc pattern in SessionContext.java). Non-consuming;
    the DataFrame remains usable.
  • DataFrame.explain(boolean verbose, boolean analyze) — wraps
    DataFusion::DataFrame::explain, returning a DataFrame whose rows
    are the plan-explanation strings. Caller calls show() / collect()
    on the result. Matches DataFusion's own semantics.
  • DataFrame.cache() — materializes the plan into an in-memory
    table and returns a new DataFrame. Async on the Rust side; blocks on
    the Tokio runtime, same pattern as collect. Caller-closes the
    returned DataFrame.
  • DataFrame.describe() — async, returns a DataFrame with summary
    stats (count, mean, stddev, min, max) per numeric column. Same
    pattern as cache.

Describe alternatives you've considered

For schema: ctx.tableSchema(name) works only for registered
tables. A user who built a DataFrame via sql("SELECT …") or chained
transformations has no schema accessor.

For explain: ctx.sql("EXPLAIN <query>") works but only against a
SQL string.

Additional context

All four are non-consuming except cache and describe (which return
new DataFrames the caller owns and closes). The schema/explain pair are
the most-requested and could land first as a smaller PR; cache and
describe are independent and can follow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions