Add support for STDDEV_POP and STDDEV_SAMP in VarianceFn#38871
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for STDDEV_POP and STDDEV_SAMP aggregations in Beam SQL by extending the existing VarianceFn combiner to compute standard deviation end-to-end. The review feedback highlights a potential NullPointerException when computing the square root on a null variance result, and suggests marking the new isStddev field as final to ensure proper visibility and thread safety during serialization.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for standard deviation aggregation functions (STDDEV_POP and STDDEV_SAMP) in Beam SQL by extending the existing VarianceFn combiner and computing the square root of the variance. The review feedback identifies a critical issue where numerical instability or overflow could cause Math.sqrt to return NaN or Infinity, leading to a NumberFormatException when converting back to BigDecimal. Clamping negative values to 0.0 and handling non-finite values safely is recommended.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for STDDEV_POP and STDDEV_SAMP aggregations in Beam SQL by extending the existing VarianceFn combiner. Feedback focuses on addressing potential artificial overflow or underflow when converting large or small BigDecimal values to double for square root calculations, adding overloads to maintain API consistency, and expanding unit tests to cover extreme variance values.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
R: @Abacn |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces native support for standard deviation aggregation functions (STDDEV_POP and STDDEV_SAMP) in Beam SQL. By implementing these functions directly within the existing VarianceFn combiner, the changes resolve planning failures that occurred when the Calcite planner attempted to layer SQRT functions over windowed variance aggregations. This approach ensures compatibility with Spark SQL and improves the robustness of the physical execution layer. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the STDDEV_POP and STDDEV_SAMP aggregations by extending the existing VarianceFn combiner to compute standard deviation end-to-end. This design bypasses limitations in Beam's enumerable bridge, which cannot translate a SQRT call layered on top of a windowed variance. Feedback on the changes points out a potential issue with intermediate overflow or underflow when converting the BigDecimal variance to a double before taking the square root, and provides a suggested scaling approach to handle extremely large or small values safely.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| public T extractOutput(VarianceAccumulator accumulator) { | ||
| return decimalConverter.apply(getVariance(accumulator)); | ||
| BigDecimal result = getVariance(accumulator); | ||
| if (result != null && isStddev) { |
There was a problem hiding this comment.
Another side note, as also said in the desciption, stddev is simple sqrt of variance. One line should do the work, and much of the code here are defensive edge case handling.
There was a problem hiding this comment.
I agree, I think these are reasonable cases to protect against, though.
| } | ||
| double sqrtVal = Math.sqrt(doubleVal); | ||
| if (Double.isInfinite(sqrtVal)) { | ||
| throw new ArithmeticException("Standard deviation overflow: result is infinity"); |
There was a problem hiding this comment.
Probably just return Double.Infinite to keep the same behavior as variance
There was a problem hiding this comment.
Yeah, good call. Done
…ow instead of throwing exception
81a7d9d to
8bd2196
Compare
Adds support for standard deviation aggregation functions (
STDDEV_POPandSTDDEV_SAMP) to Beam SQL, improving compatibility with Spark SQL.In standard SQL, standard deviation is mathematically the square root of variance. Calcite typically translates
STDDEV_SAMP(x)intoSQRT(VAR_SAMP(x)).However, Beam's physical execution layer (specifically the enumerable bridge) cannot easily translate a
SQRTscalar function layered on top of a windowed aggregation. Trying to plan this query results in translation failures because the planner cannot bridge the nested execution.Instead of relying on Calcite to layer
SQRTon top of variance, this implements standard deviation directly within the physical aggregation layer:VarianceFn(the combiner used forVAR_POPandVAR_SAMP) to support standard deviation via a constructor flag (isStddev).VarianceFn.extractOutput, ifisStddevis true, we compute the double-precision square root of the variance usingMath.sqrt()(casting todoubleto match Spark and numpy precision exactly) and return it as aBigDecimal.STDDEV_POPandSTDDEV_SAMPinBeamBuiltinAggregationsto route to these newVarianceFnfactories.