Fix NativeAOT GC hole issue#129598
Conversation
There is a GC hole when: * an exception is rethrown from a funclet * the exception escapes that funclet * a finally is executed for this secondary exception * GC runs while the call chain of this finally is being executed * A reference in non-volatile register is pushed in a prolog of one of the functions in the finally call chain * the nonvolatile register holds a live reference up somewhere up in the call chain of the parent of the catch handler that catches the secondary exception * the nonvolatile register is not pushed anywhere between the parent of the catch and the frame where the nonvolatile register holds a live GC reference In this case, if GC relocates that reference, it is updated in the stack frame of the finally call chain, but not in the location referenced by the REGDISPLAY in the ExInfo of the secondary exception. So when we resume after catch, the stale reference is placed in the nonvolatile register and then it bubbles up the call chain until it reaches the frame where the register is supposed to hold live GC reference. The fix is to save the nonvolatile registers after returning from a finally funclet back to the location referenced by the REGDISPLAY passed to the RhpCallFinallyFunclet. Close dotnet#129010
|
@jakobbotsch thank you so much for reproducing it with time travel debugging, it would be hard to reason about it without that! |
|
Tagging subscribers to this area: @agocke, @dotnet/ilc-contrib |
There was a problem hiding this comment.
Pull request overview
This PR updates the NativeAOT AMD64 exception-handling stubs so that after executing a finally funclet, the current values of preserved (non-volatile) registers are written back to the locations described by the passed-in REGDISPLAY (and, on Windows x64, the preserved XMM register values are written back into REGDISPLAY). This keeps the REGDISPLAY state consistent if a GC occurs during the finally call chain and relocates references that were temporarily spilled.
Changes:
- Add preserved-register write-back in the System V AMD64
RhpCallFinallyFunclet2path. - Add preserved-register + XMM6–XMM15 write-back in the Windows AMD64
RhpCallFinallyFunclet2path.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/coreclr/nativeaot/Runtime/amd64/ExceptionHandling.S | Writes back AMD64 SysV preserved integer registers to the homes referenced by REGDISPLAY after the finally funclet returns. |
| src/coreclr/nativeaot/Runtime/amd64/ExceptionHandling.asm | Writes back Windows x64 preserved integer registers (and XMM6–XMM15) to REGDISPLAY state after the finally funclet returns. |
|
Can we add a regression test for this? Could you please change "Funclets are not required to preserve non-volatile registers." in https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/clr-abi.md to "Funclets are not required to preserve non-volatile registers that are saved by main method body." Thank you both! |
|
@jkotas, I've added a regression test and updated the ABI doc, can you please take a look again? |
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
/azp run runtime-nativeaot-outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/ba-g flaky networking test, passed on automatic rerun |
|
@MichalStrehovsky @jakobbotsch Could you please approve this? My approval is not good enough since I have merged doc nit from copilot. |
|
Backport candidate? |
We don't currently have any way to help raise to users that they may want to patch and redeploy their NAOT apps, right? They just need to monitor the monthly servicing release notes and make a judgement call on if it is impactful to them? |
|
This is a concern for all dependencies that are bundled in the app or otherwise affect the app code, it is not specific to native aot. It takes a lot of expertise to evaluate how given (security) fix affects given app. Some teams may choose to do that. However the easiest strategy is to regularly update to latest servicing and redeploy without trying to reason about individual changes. |
I would generalize that to "patch and redeploy their self-contained apps". It's not specific to NAOT and affects all forced-selfcontained environments (mobile/WASM). Or even framework-dependent where there is no servicing story for the framework (macOS). Windows and Linux might be the only "nice" exceptions where one doesn't have to think about it. |
|
/backport to release/10.0 |
|
Started backporting to |
Backport of #129598 to release/10.0 /cc @janvorli ## Customer Impact - [ ] Customer reported - [x] Found internally There is a GC hole in NativeAOT. As any other GC hole, it could lead to intermittent failures of applications due to unexpected NullReferenceException, AccessViolationException or just unexpected behavior. This GC hole occurs in some cases when GC scans stack with active exception handling when an exception thrown from a call chain of a funclet escapes the funclet and GC occurs when a finally handler of that secondary exception is being executed. ## Regression - [x] Yes - [ ] No Introduced by #115284 in .NET 10.0 ## Testing Libraries test that exposed the issue, directed regression test, CI coreclr and libraries tests. ## Risk Low. The change just ensures that a modified non-volatile register value is saved in the stack frame iterator of the pending exception handling, keeping that value up to date in case GC moves it. --------- Co-authored-by: Jan Vorlicek <janvorli@microsoft.com> Co-authored-by: Jan Kotas <jkotas@microsoft.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There is a GC hole when:
In this case, if GC relocates that reference, it is updated in the stack frame of the finally call chain, but not in the location referenced by the
REGDISPLAYin theExInfoof the secondary exception. So when we resume after catch, the stale reference is placed in the nonvolatile register and then it bubbles up the call chain until it reaches the frame where the register is supposed to hold live GC reference.The fix is to save the nonvolatile registers after returning from a finally funclet back to the location referenced by the
REGDISPLAYpassed to theRhpCallFinallyFunclet.Close #129010