Skip to content

fix(udf): pass batch row count to ScalarFunction.evaluate#57

Closed
LantaoJin wants to merge 1 commit into
apache:mainfrom
LantaoJin:fix/udf-nullary-row-count
Closed

fix(udf): pass batch row count to ScalarFunction.evaluate#57
LantaoJin wants to merge 1 commit into
apache:mainfrom
LantaoJin:fix/udf-nullary-row-count

Conversation

@LantaoJin
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

ScalarFunction.evaluate(BufferAllocator, List<FieldVector>) (introduced in #46) is the contract every Java-implemented scalar UDF must satisfy. It must return a FieldVector whose getValueCount() matches the batch row count DataFusion is driving through the operator tree.

For UDFs with at least one argument, the body can read args.get(0).getValueCount() to learn how many rows to produce. For nullary UDFs -- zero arguments, e.g. analogs of random(), pi(), now() -- args is the empty list, and the body has no other channel to learn the row count.

The native side already knows the value: ScalarFunctionArgs::number_rows is read at native/src/udf.rs:100, used to materialise scalar arg columns at :106. The Java bridge (JniBridge.invokeScalarUdf) receives it but only uses it after the fact, to validate the returned vector's length. It is never communicated to impl.evaluate(...).

The result: any nullary UDF that DataFusion does not constant-fold (anything declared Volatility.VOLATILE, or STABLE calls in plans the optimizer cannot fold) trips the post-hoc row-count validation as soon as it runs over a batch with more than one row.

What changes are included in this PR?

  • ScalarFunction.evaluate(BufferAllocator allocator, List<FieldVector> args, int rowCount) — adds a third parameter carrying the per-batch row count. Source-breaking signature change to a public interface. The repo is pre-release; only five existing implementations needed an unused-parameter update (four test UDFs in ScalarUdfTest, one in examples/AddOneExample).
  • JniBridge.invokeScalarUdf (core/src/main/java/org/apache/datafusion/internal/JniBridge.java) now forwards the existing expectedRowCount parameter into impl.evaluate(...). Post-call validation against the same value is unchanged.
  • No native-side change. The value was already on the wire.

Are these changes tested?

yes

Are there any user-facing changes?

Yes, a source-breaking signature change to ScalarFunction.evaluate. Implementations of the interface need to add an int rowCount parameter to their evaluate override. Bodies that ignore it remain identical otherwise.

Before:

public FieldVector evaluate(BufferAllocator allocator, List<FieldVector> args) {
  // ...
}

After:

public FieldVector evaluate(BufferAllocator allocator, List<FieldVector> args, int rowCount) {
  // ...
}

Add an int rowCount parameter to ScalarFunction.evaluate. JniBridge
already receives the value from the native side as expectedRowCount
for post-call validation; now it is also forwarded into evaluate.

For UDFs with at least one argument the value matches what the body
could read from args.get(0).getValueCount(). For nullary UDFs (args
is empty), this is the only channel that communicates the batch row
count, making it possible to implement Volatility.VOLATILE nullary
functions like random() / now().
@LantaoJin
Copy link
Copy Markdown
Contributor Author

Superseded by upstream #64; nullary row count is now available via ScalarFunctionArgs.rowCount().

@LantaoJin LantaoJin closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(udf): nullary scalar UDFs cannot determine batch row count

1 participant