Unfortunately, x64 codegen - at least as employed by v8 - for i32x4.trunc_sat_f32x4_s is really elaborate:
https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/compiler/backend/ia32/code-generator-ia32.cc#L2083-L2100
This is 7 instructions for what could be 1 instruction in x64 if NaN handling or overflow behavior didn't have to match the specified one.
Is there any way this can be improved? I don't have a specific suggestion, but this costs ~10% of instructions (not sure how to measure cycle impact accurately) on one of my functions.
Unfortunately, x64 codegen - at least as employed by v8 - for i32x4.trunc_sat_f32x4_s is really elaborate:
https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/compiler/backend/ia32/code-generator-ia32.cc#L2083-L2100
This is 7 instructions for what could be 1 instruction in x64 if NaN handling or overflow behavior didn't have to match the specified one.
Is there any way this can be improved? I don't have a specific suggestion, but this costs ~10% of instructions (not sure how to measure cycle impact accurately) on one of my functions.