fix(udf): declare scalar UDF signatures with Field, not ArrowType#59
Open
LantaoJin wants to merge 2 commits into
Open
fix(udf): declare scalar UDF signatures with Field, not ArrowType#59LantaoJin wants to merge 2 commits into
LantaoJin wants to merge 2 commits into
Conversation
Change ScalarFunction.argTypes() / returnType() (List<ArrowType> / ArrowType) to argFields() / returnField() (List<Field> / Field). SessionContext.registerUdf forwards the Fields straight through. JavaScalarUdf stores the full return FieldRef and overrides ScalarUDFImpl::return_field_from_args, so declared nullability and metadata round-trip into the result schema. ArrowType is a leaf marker in Java Arrow: ArrowType.List has no fields, and child element / member / key / value types live on the parent Field's children list. The previous registration code reconstructed the schema with `new Field(..., FieldType.nullable( type), null)`, dropping nested-type metadata; the previous Rust impl only stored a DataType, so the default return_field_from_args wrapped results in a fresh always-nullable Field. Both are fixed by storing and forwarding the user's Fields verbatim.
…w-types # Conflicts: # core/src/main/java/org/apache/datafusion/ScalarFunction.java # examples/src/main/java/org/apache/datafusion/examples/AddOneExample.java # native/src/udf.rs
Contributor
Author
|
@andygrove resolved conflicts from #64 , can you review this? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
ScalarFunction.argTypes()returnedList<ArrowType>andreturnType()returnedArrowType. In Java Arrow,ArrowTypeis a leaf marker for the type kind: it is self-describing for primitives likeInt32orFloat64, but for nested types (List,Struct,Map,FixedSizeList) the element / member / key / value types live on the parentField'schildrenlist, not insideArrowType.ArrowType.Listis literally a no-field marker class.A Java UDF author therefore had no way to declare a typed nested signature. Trying
argTypes() = List.of(new ArrowType.List())blew up at registration time:This blocked the entire family of nested-type UDFs that exist as built-ins in DataFusion's
datafusion-functions-nestedcrate (array_length,cardinality,array_has,array_position,flatten,map_keys,map_values,arrows_zip, ...). Anyone porting Spark UDFs overArrayType/StructType/MapTypecolumns to DataFusion-Java hit this on the first attempt.The Rust API does not have this problem because
DataType::List(Arc<Field>)carries the child field inline. Switching the Java interface fromArrowTypetoFieldis the structural mirror:Fieldis the only type that can carry children, so it's the type the interface has always needed to use.What changes are included in this PR?
See commit log
Are these changes tested?
Yes -- 5 new tests over and above the existing 12.
Are there any user-facing changes?
Yes -- a source-breaking signature change to the public
ScalarFunctioninterface. Existing primitive UDFs become slightly more verbose:Nested-type UDFs that were previously impossible to declare can now be expressed by attaching child fields. For example, a UDF over
List<Int32>returning the list length:Same shape works for
Struct<a: Int32, b: Int32>(children are the member fields) andMap<Utf8, Int32>(children describe the key/value entries struct).The repo is pre-release, which makes this the right time to tighten the interface before downstream callers accumulate.