Skip to content

Add support for STDDEV_POP and STDDEV_SAMP in VarianceFn#38871

Merged
damccorm merged 5 commits into
apache:masterfrom
damccorm:feature/sql-variance
Jun 15, 2026
Merged

Add support for STDDEV_POP and STDDEV_SAMP in VarianceFn#38871
damccorm merged 5 commits into
apache:masterfrom
damccorm:feature/sql-variance

Conversation

@damccorm

@damccorm damccorm commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Adds support for standard deviation aggregation functions (STDDEV_POP and STDDEV_SAMP) to Beam SQL, improving compatibility with Spark SQL.

In standard SQL, standard deviation is mathematically the square root of variance. Calcite typically translates STDDEV_SAMP(x) into SQRT(VAR_SAMP(x)).

However, Beam's physical execution layer (specifically the enumerable bridge) cannot easily translate a SQRT scalar function layered on top of a windowed aggregation. Trying to plan this query results in translation failures because the planner cannot bridge the nested execution.

Instead of relying on Calcite to layer SQRT on top of variance, this implements standard deviation directly within the physical aggregation layer:

  1. Extended VarianceFn (the combiner used for VAR_POP and VAR_SAMP) to support standard deviation via a constructor flag (isStddev).
  2. In VarianceFn.extractOutput, if isStddev is true, we compute the double-precision square root of the variance using Math.sqrt() (casting to double to match Spark and numpy precision exactly) and return it as a BigDecimal.
  3. Registered STDDEV_POP and STDDEV_SAMP in BeamBuiltinAggregations to route to these new VarianceFn factories.

@damccorm

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for STDDEV_POP and STDDEV_SAMP aggregations in Beam SQL by extending the existing VarianceFn combiner to compute standard deviation end-to-end. The review feedback highlights a potential NullPointerException when computing the square root on a null variance result, and suggests marking the new isStddev field as final to ensure proper visibility and thread safety during serialization.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@damccorm

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for standard deviation aggregation functions (STDDEV_POP and STDDEV_SAMP) in Beam SQL by extending the existing VarianceFn combiner and computing the square root of the variance. The review feedback identifies a critical issue where numerical instability or overflow could cause Math.sqrt to return NaN or Infinity, leading to a NumberFormatException when converting back to BigDecimal. Clamping negative values to 0.0 and handling non-finite values safely is recommended.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@damccorm

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for STDDEV_POP and STDDEV_SAMP aggregations in Beam SQL by extending the existing VarianceFn combiner. Feedback focuses on addressing potential artificial overflow or underflow when converting large or small BigDecimal values to double for square root calculations, adding overloads to maintain API consistency, and expanding unit tests to cover extreme variance values.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@damccorm damccorm marked this pull request as ready for review June 12, 2026 18:11
@damccorm

Copy link
Copy Markdown
Contributor Author

R: @Abacn

@github-actions

Copy link
Copy Markdown
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces native support for standard deviation aggregation functions (STDDEV_POP and STDDEV_SAMP) in Beam SQL. By implementing these functions directly within the existing VarianceFn combiner, the changes resolve planning failures that occurred when the Calcite planner attempted to layer SQRT functions over windowed variance aggregations. This approach ensures compatibility with Spark SQL and improves the robustness of the physical execution layer.

Highlights

  • Direct Standard Deviation Implementation: Implemented STDDEV_POP and STDDEV_SAMP directly within the VarianceFn combiner to bypass limitations in the Beam SQL physical execution layer regarding nested SQRT calls.
  • API Expansion: Extended VarianceFn to support a new isStddev flag and added factory methods for standard deviation calculations.
  • Comprehensive Testing: Added new integration tests in BeamSqlDslAggregationVarianceTest and unit tests in VarianceFnTest to verify correct behavior for both population and sample standard deviation.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the STDDEV_POP and STDDEV_SAMP aggregations by extending the existing VarianceFn combiner to compute standard deviation end-to-end. This design bypasses limitations in Beam's enumerable bridge, which cannot translate a SQRT call layered on top of a windowed variance. Feedback on the changes points out a potential issue with intermediate overflow or underflow when converting the BigDecimal variance to a double before taking the square root, and provides a suggested scaling approach to handle extremely large or small values safely.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

public T extractOutput(VarianceAccumulator accumulator) {
return decimalConverter.apply(getVariance(accumulator));
BigDecimal result = getVariance(accumulator);
if (result != null && isStddev) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another side note, as also said in the desciption, stddev is simple sqrt of variance. One line should do the work, and much of the code here are defensive edge case handling.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think these are reasonable cases to protect against, though.

}
double sqrtVal = Math.sqrt(doubleVal);
if (Double.isInfinite(sqrtVal)) {
throw new ArithmeticException("Standard deviation overflow: result is infinity");

@Abacn Abacn Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just return Double.Infinite to keep the same behavior as variance

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call. Done

@damccorm damccorm force-pushed the feature/sql-variance branch from 81a7d9d to 8bd2196 Compare June 15, 2026 15:37
@damccorm damccorm merged commit bde545f into apache:master Jun 15, 2026
30 of 31 checks passed
@damccorm damccorm deleted the feature/sql-variance branch June 15, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants