Skip to content

bug(udf): scalar UDFs over nested Arrow types cannot be registered #58

@LantaoJin

Description

@LantaoJin

Describe the bug

ScalarFunction.argTypes() returns List<ArrowType> and returnType() returns ArrowType (core/src/main/java/org/apache/datafusion/ScalarFunction.java:47, :50). Java Arrow's ArrowType is a leaf marker for the type kind: for primitives like Int32 or Float64 it is self-describing, but for nested types (List, Struct, Map, FixedSizeList) the element / member / key / value types live on the parent Field's children list, not inside ArrowType itself. ArrowType.List is literally a no-field marker class.

That mismatch means a Java UDF author has no way to declare a typed nested signature. The closest they can write is:

public List<ArrowType> argTypes() {
  return List.of(new ArrowType.List());  // says "list" -- cannot say "of Int32"
}

When this is passed to SessionContext.registerUdf(ScalarUdf) the registration path at core/src/main/java/org/apache/datafusion/SessionContext.java:385-389 constructs the signature schema as:

fields.add(new Field("return", FieldType.nullable(returnType), null));
for (int i = 0; i < argTypes.size(); i++) {
  fields.add(new Field("arg" + i, FieldType.nullable(argTypes.get(i)), null));
}

The null children list is the bug: Arrow's IPC writer rejects the malformed List field during serializeSchemaIpc(...) before the schema ever crosses JNI. The user sees a low-level IllegalArgumentException: Lists have one child Field. Found: none.

This blocks the entire family of nested-type UDFs that exist as built-ins in DataFusion's datafusion-functions-nested crate (array_length, cardinality, array_has, array_position, flatten, map_keys, map_values, arrays_zip, ...). Anyone porting Spark UDFs over ArrayType / StructType / MapType columns to DataFusion-Java hits this on the first attempt.

The Rust API does not have this problem: DataType::List(Arc<Field>) carries the child field inline, so Signature::exact(vec![DataType::List(Arc::new(Field::new("item", DataType::Int32, true)))], ...) round-trips with full structure.

To Reproduce

static final class ListLength implements ScalarFunction {
  public String name() { return "java_list_length"; }
  public List<ArrowType> argTypes() { return List.of(new ArrowType.List()); }
  public ArrowType returnType() { return new ArrowType.Int(32, true); }
  public Volatility volatility() { return Volatility.IMMUTABLE; }
  public FieldVector evaluate(BufferAllocator allocator, List<FieldVector> args, int rowCount) {
    /* ... */
  }
}

new SessionContext().registerUdf(new ScalarUdf(new ListLength()));
// throws:
//   IllegalArgumentException: Lists have one child Field. Found: none
//   at SessionContext.serializeSchemaIpc(SessionContext.java:398)
//   at SessionContext.registerUdf(SessionContext.java:391)

Expected behavior

A UDF whose argument or return type is a nested Arrow type registers successfully and is callable from SQL with full element-type information preserved end-to-end (Java → JNI → Rust Signature::exact).

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions