Summary
GPU kernels currently seem unable to use non-literal global variables that are initialized from top-level expressions/conversions.
Literal globals work correctly, but globals derived from other globals or top-level computations appear as 0.
Motivation
The original trigger was math.pi32.
According to the Math module, pi64 is stored as a literal global, while pi32 is derived from it via a cast/conversion. In GPU kernels, pi64 works, but pi32 reads as zero.
That led me to reduce the problem into a smaller reproducer involving top-level globals initialized through conversions.
Problem Reproduction in Math Module
import math
import gpu
def close(a: float, b: float, epsilon: float = 1e-6):
return math.fabs(a - b) <= epsilon
@gpu.kernel
def kernel(out):
i = gpu.thread.x
out[i] = float(math.pi32)
out = [0.0]
kernel_args = (out,)
kernel(*kernel_args, grid=1, block=1)
print("out: ", out[0])
print("math.pi32: ", float(math.pi32))
assert close(out[0], float(math.pi32), 1e-5)
print("PASS pi32_gpu_f32")
Result of Math Module Reproduction
$ codon run pi32_gpu_f32.codon
out: 0
math.pi32: 3.14159
AssertionError: Assert failed (/home/swchoi/src/test_code/codon_math_matrix/generated/cases/nvptx/pi32_gpu_f32.codon:18)
Raised from:
/home/swchoi/src/test_code/codon_math_matrix/generated/cases/nvptx/pi32_gpu_f32.codon:18:1
Backtrace:
[0x77492d8d0f13] main.0 at /home/swchoi/src/test_code/codon_math_matrix/generated/cases/nvptx/pi32_gpu_f32.codon:18:1
Aborted (core dumped)
As you can see that, reference pi32 in kernel appears to be zero.
Minimal Reproducer
import math
import gpu
# Base literals.
G_F64 = 3.14159265358979323846
G_I64 = 123456789
# Derived globals initialized via top-level conversion.
G_F32_FROM_F64 = float32(G_F64)
G_F64_FROM_I64 = float(G_I64)
G_F32_FROM_I64 = float32(G_I64)
def close(a: float, b: float, epsilon: float = 1e-6):
return math.fabs(a - b) <= epsilon
@gpu.kernel
def kernel(out_f64, out_f32_from_f64, out_f64_from_i64, out_f32_from_i64):
i = gpu.thread.x
out_f64[i] = G_F64
out_f32_from_f64[i] = float(G_F32_FROM_F64)
out_f64_from_i64[i] = G_F64_FROM_I64
out_f32_from_i64[i] = float(G_F32_FROM_I64)
out_f64 = [0.0]
out_f32_from_f64 = [0.0]
out_f64_from_i64 = [0.0]
out_f32_from_i64 = [0.0]
kernel(out_f64, out_f32_from_f64, out_f64_from_i64, out_f32_from_i64, grid=1, block=1)
print("host G_F64 =", G_F64)
print("host G_F32_FROM_F64 =", float(G_F32_FROM_F64))
print("host G_F64_FROM_I64 =", G_F64_FROM_I64)
print("host G_F32_FROM_I64 =", float(G_F32_FROM_I64))
print("kernel out_f64 =", out_f64[0])
print("kernel out_f32_from_f64 =", out_f32_from_f64[0])
print("kernel out_f64_from_i64 =", out_f64_from_i64[0])
print("kernel out_f32_from_i64 =", out_f32_from_i64[0])
assert close(out_f64[0], G_F64)
assert close(out_f32_from_f64[0], float(G_F32_FROM_F64))
assert close(out_f64_from_i64[0], G_F64_FROM_I64)
assert close(out_f32_from_i64[0], float(G_F32_FROM_I64))
print("PASS global_cast_init_repro_nvptx")
Result of MRE
$ ./global_cast_init_repro_nvptx
host G_F64 = 3.14159
host G_F32_FROM_F64 = 3.14159
host G_F64_FROM_I64 = 1.23457e+08
host G_F32_FROM_I64 = 1.23457e+08
kernel out_f64 = 3.14159
kernel out_f32_from_f64 = 0
kernel out_f64_from_i64 = 0
kernel out_f32_from_i64 = 0
Segmentation fault (core dumped)
Insight
Only literal Global G_F64 can be read correctly in the kernel.
Globals derived from top-level computation/conversion appear as zero.
Suspected Cause
From Reading the LLVM IR code generated by --llvm option, my current understanding is
Codon does not necessarily materialize globals as final compile-time constants in LLVM IR.
Instead, their values are established at runtime in main.0
Host/device module separation and pruning happen earlier.
The relevant IR shape for the globals looks like this:
@.G_F32_FROM_F64.0 = private global float 0.000000e+00, !dbg !462
@.G_F32_FROM_I64.0 = private global float 0.000000e+00, !dbg !464
@.G_F64.0 = private global double 0.000000e+00, !dbg !466
@.G_F64_FROM_I64.0 = private global double 0.000000e+00, !dbg !468
and the traversal logic I was looking at is:
void exploreGV(llvm::GlobalValue *G, llvm::SmallPtrSetImpl<llvm::GlobalValue *> &keep) {
if (keep.contains(G))
return;
keep.insert(G);
if (auto *F = llvm::dyn_cast<llvm::Function>(G)) {
for (auto I = llvm::inst_begin(F), E = inst_end(F); I != E; ++I) {
for (auto &U : I->operands()) {
if (auto *G2 = llvm::dyn_cast<llvm::GlobalValue>(U.get()))
exploreGV(G2, keep);
}
}
}
}
My suspicion is that this fails for derived globals because the global itself is emitted as a zero-initialized storage object, while its actual value is established later through runtime initialization.
So there is no static initializer chain in Compile Time IR for exploreGV to follow back to the root dependency.
In other words, the device module sees the storage, but not the computation that defines its value.
Why I think this is a design question, not just a local bug
I think this raises a broader question about what semantics Codon wants for globals used by accelerators.
There seem to be a few possible directions:
- Freeze eligible globals as compile-time constants for device code (This is how Numba Cuda in Python deals with GV!)
- Similar to how some Python GPU DSLs effectively snapshot globals for kernels. (Idk what about taichi btw)
- This would make top-level pure expressions/casts usable in kernels as constants.
- Require explicit host/device separation for globals
- Closer to C/CUDA-style models where device-visible globals are explicitly represented.
- Disallow or restrict runtime-initialized globals in accelerator kernels
- In that case, perhaps only literal/constant-foldable globals should be allowed in kernels, with a diagnostic.
For comparison, tools like Numba appear to treat globals more like captured/snapshotted environment values for compiled kernels rather than dynamically tracking later host-side mutation.
Question for maintainers
What direction would maintainers prefer here?
More specifically:
- Should top-level pure/derived globals (e.g. casts from other globals) be expected to work in GPU kernels?
- Is the intended model that kernel-visible globals must be compile-time constants only?
- If runtime-initialized globals are not meant to be supported on accelerators, should this become a frontend restriction / diagnostic instead of silently producing zero-initialized behavior?
As far as I know, I couldn't find any docs or issues related to this issue.
If so, would you please kindly let me know?
If you need any reference for other code or insight, I can share them on discord too! Please let me know.
Always Thanks For Your Great Job, Maintainers and Contributors!
Summary
GPU kernels currently seem unable to use
non-literal global variablesthat are initialized from top-level expressions/conversions.Literal globals work correctly, but globals derived from other globals or top-level computations appear as
0.Motivation
The original trigger was
math.pi32.According to the Math module,
pi64is stored as a literal global, whilepi32is derived from it via a cast/conversion. In GPU kernels,pi64works, butpi32reads as zero.That led me to reduce the problem into a smaller reproducer involving top-level globals initialized through conversions.
Problem Reproduction in
Math ModuleResult of
Math ModuleReproductionAs you can see that, reference pi32 in kernel appears to be zero.
Minimal Reproducer
Result of MRE
Insight
Only literal Global
G_F64can be read correctly in the kernel.Globals derived from top-level computation/conversion appear as
zero.Suspected Cause
From Reading the LLVM IR code generated by
--llvmoption, my current understanding isCodon does
not necessarily materialize globalsas final compile-time constants in LLVM IR.Instead, their values are established at runtime
in main.0Host/device module separation and pruning happen earlier.
The relevant IR shape for the globals looks like this:
and the traversal logic I was looking at is:
My suspicion is that this fails for derived globals because the global itself is emitted as a
zero-initialized storage object, while its actual value is established later through runtime initialization.So there is no static initializer chain in Compile Time IR for
exploreGVto follow back to the root dependency.In other words, the device module sees the storage, but not the computation that defines its value.
Why I think this is a design question, not just a local bug
I think this raises a broader question about what semantics Codon wants for globals used by accelerators.
There seem to be a few possible directions:
For comparison, tools like Numba appear to treat globals more like captured/snapshotted environment values for compiled kernels rather than dynamically tracking later host-side mutation.
Question for maintainers
What direction would maintainers prefer here?
More specifically:
As far as I know, I couldn't find any docs or issues related to this issue.
If so, would you please kindly let me know?
If you need any reference for other code or insight, I can share them on discord too! Please let me know.
Always Thanks For Your Great Job, Maintainers and Contributors!