Skip to content

GPU kernel cannot use non-literal global variables #781

@BI71317

Description

@BI71317

Summary

GPU kernels currently seem unable to use non-literal global variables that are initialized from top-level expressions/conversions.

Literal globals work correctly, but globals derived from other globals or top-level computations appear as 0.

Motivation

The original trigger was math.pi32.

According to the Math module, pi64 is stored as a literal global, while pi32 is derived from it via a cast/conversion. In GPU kernels, pi64 works, but pi32 reads as zero.

That led me to reduce the problem into a smaller reproducer involving top-level globals initialized through conversions.

Problem Reproduction in Math Module

import math
import gpu

def close(a: float, b: float, epsilon: float = 1e-6):
    return math.fabs(a - b) <= epsilon

@gpu.kernel
def kernel(out):
    i = gpu.thread.x
    out[i] = float(math.pi32)

out = [0.0]
kernel_args = (out,)
kernel(*kernel_args, grid=1, block=1)
print("out: ", out[0])
print("math.pi32: ", float(math.pi32))

assert close(out[0], float(math.pi32), 1e-5)

print("PASS pi32_gpu_f32")

Result of Math Module Reproduction

$ codon run pi32_gpu_f32.codon 
out:  0
math.pi32:  3.14159
AssertionError: Assert failed (/home/swchoi/src/test_code/codon_math_matrix/generated/cases/nvptx/pi32_gpu_f32.codon:18)

Raised from:
/home/swchoi/src/test_code/codon_math_matrix/generated/cases/nvptx/pi32_gpu_f32.codon:18:1

Backtrace:
  [0x77492d8d0f13] main.0 at /home/swchoi/src/test_code/codon_math_matrix/generated/cases/nvptx/pi32_gpu_f32.codon:18:1
Aborted (core dumped)

As you can see that, reference pi32 in kernel appears to be zero.

Minimal Reproducer

import math
import gpu

# Base literals.
G_F64 = 3.14159265358979323846
G_I64 = 123456789

# Derived globals initialized via top-level conversion.
G_F32_FROM_F64 = float32(G_F64)
G_F64_FROM_I64 = float(G_I64)
G_F32_FROM_I64 = float32(G_I64)

def close(a: float, b: float, epsilon: float = 1e-6):
    return math.fabs(a - b) <= epsilon

@gpu.kernel
def kernel(out_f64, out_f32_from_f64, out_f64_from_i64, out_f32_from_i64):
    i = gpu.thread.x
    out_f64[i] = G_F64
    out_f32_from_f64[i] = float(G_F32_FROM_F64)
    out_f64_from_i64[i] = G_F64_FROM_I64
    out_f32_from_i64[i] = float(G_F32_FROM_I64)

out_f64 = [0.0]
out_f32_from_f64 = [0.0]
out_f64_from_i64 = [0.0]
out_f32_from_i64 = [0.0]

kernel(out_f64, out_f32_from_f64, out_f64_from_i64, out_f32_from_i64, grid=1, block=1)

print("host G_F64            =", G_F64)
print("host G_F32_FROM_F64   =", float(G_F32_FROM_F64))
print("host G_F64_FROM_I64   =", G_F64_FROM_I64)
print("host G_F32_FROM_I64   =", float(G_F32_FROM_I64))

print("kernel out_f64        =", out_f64[0])
print("kernel out_f32_from_f64 =", out_f32_from_f64[0])
print("kernel out_f64_from_i64 =", out_f64_from_i64[0])
print("kernel out_f32_from_i64 =", out_f32_from_i64[0])

assert close(out_f64[0], G_F64)
assert close(out_f32_from_f64[0], float(G_F32_FROM_F64))
assert close(out_f64_from_i64[0], G_F64_FROM_I64)
assert close(out_f32_from_i64[0], float(G_F32_FROM_I64))

print("PASS global_cast_init_repro_nvptx")

Result of MRE

$ ./global_cast_init_repro_nvptx
host G_F64            = 3.14159
host G_F32_FROM_F64   = 3.14159
host G_F64_FROM_I64   = 1.23457e+08
host G_F32_FROM_I64   = 1.23457e+08
kernel out_f64        = 3.14159
kernel out_f32_from_f64 = 0
kernel out_f64_from_i64 = 0
kernel out_f32_from_i64 = 0
Segmentation fault (core dumped)

Insight

Only literal Global G_F64 can be read correctly in the kernel.
Globals derived from top-level computation/conversion appear as zero.

Suspected Cause

From Reading the LLVM IR code generated by --llvm option, my current understanding is
Codon does not necessarily materialize globals as final compile-time constants in LLVM IR.

Instead, their values are established at runtime in main.0

Host/device module separation and pruning happen earlier.

The relevant IR shape for the globals looks like this:

@.G_F32_FROM_F64.0 = private global float 0.000000e+00, !dbg !462
@.G_F32_FROM_I64.0 = private global float 0.000000e+00, !dbg !464
@.G_F64.0 = private global double 0.000000e+00, !dbg !466
@.G_F64_FROM_I64.0 = private global double 0.000000e+00, !dbg !468

and the traversal logic I was looking at is:

void exploreGV(llvm::GlobalValue *G, llvm::SmallPtrSetImpl<llvm::GlobalValue *> &keep) {
  if (keep.contains(G))
    return;

  keep.insert(G);
  if (auto *F = llvm::dyn_cast<llvm::Function>(G)) {
    for (auto I = llvm::inst_begin(F), E = inst_end(F); I != E; ++I) {
      for (auto &U : I->operands()) {
        if (auto *G2 = llvm::dyn_cast<llvm::GlobalValue>(U.get()))
          exploreGV(G2, keep);
      }
    }
  }
}

My suspicion is that this fails for derived globals because the global itself is emitted as a zero-initialized storage object, while its actual value is established later through runtime initialization.

So there is no static initializer chain in Compile Time IR for exploreGV to follow back to the root dependency.

In other words, the device module sees the storage, but not the computation that defines its value.

Why I think this is a design question, not just a local bug

I think this raises a broader question about what semantics Codon wants for globals used by accelerators.

There seem to be a few possible directions:

  • Freeze eligible globals as compile-time constants for device code (This is how Numba Cuda in Python deals with GV!)
    • Similar to how some Python GPU DSLs effectively snapshot globals for kernels. (Idk what about taichi btw)
    • This would make top-level pure expressions/casts usable in kernels as constants.
  • Require explicit host/device separation for globals
    • Closer to C/CUDA-style models where device-visible globals are explicitly represented.
  • Disallow or restrict runtime-initialized globals in accelerator kernels
    • In that case, perhaps only literal/constant-foldable globals should be allowed in kernels, with a diagnostic.

For comparison, tools like Numba appear to treat globals more like captured/snapshotted environment values for compiled kernels rather than dynamically tracking later host-side mutation.

Question for maintainers

What direction would maintainers prefer here?
More specifically:

  • Should top-level pure/derived globals (e.g. casts from other globals) be expected to work in GPU kernels?
  • Is the intended model that kernel-visible globals must be compile-time constants only?
  • If runtime-initialized globals are not meant to be supported on accelerators, should this become a frontend restriction / diagnostic instead of silently producing zero-initialized behavior?

As far as I know, I couldn't find any docs or issues related to this issue.
If so, would you please kindly let me know?
If you need any reference for other code or insight, I can share them on discord too! Please let me know.

Always Thanks For Your Great Job, Maintainers and Contributors!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions