GPU kernel cannot use non-literal global variables

## Summary
 GPU kernels currently seem unable to use `non-literal global variables` that are initialized from top-level expressions/conversions.

 Literal globals work correctly, but globals derived from other globals or top-level computations appear as `0`.

## Motivation
 The original trigger was `math.pi32`.

 According to the Math module, `pi64` is stored as a literal global, while `pi32` is derived from it via a cast/conversion. In GPU kernels, `pi64` works, but `pi32` reads as zero.

 That led me to reduce the problem into a smaller reproducer involving top-level globals initialized through conversions.

## Problem Reproduction in `Math Module`
```
import math
import gpu

def close(a: float, b: float, epsilon: float = 1e-6):
    return math.fabs(a - b) <= epsilon

@gpu.kernel
def kernel(out):
    i = gpu.thread.x
    out[i] = float(math.pi32)

out = [0.0]
kernel_args = (out,)
kernel(*kernel_args, grid=1, block=1)
print("out: ", out[0])
print("math.pi32: ", float(math.pi32))

assert close(out[0], float(math.pi32), 1e-5)

print("PASS pi32_gpu_f32")
```
## Result of `Math Module` Reproduction

```
$ codon run pi32_gpu_f32.codon 
out:  0
math.pi32:  3.14159
AssertionError: Assert failed (/home/swchoi/src/test_code/codon_math_matrix/generated/cases/nvptx/pi32_gpu_f32.codon:18)

Raised from:
/home/swchoi/src/test_code/codon_math_matrix/generated/cases/nvptx/pi32_gpu_f32.codon:18:1

Backtrace:
  [0x77492d8d0f13] main.0 at /home/swchoi/src/test_code/codon_math_matrix/generated/cases/nvptx/pi32_gpu_f32.codon:18:1
Aborted (core dumped)
```
As you can see that, reference pi32 in kernel appears to be zero.

### Minimal Reproducer
```
import math
import gpu

# Base literals.
G_F64 = 3.14159265358979323846
G_I64 = 123456789

# Derived globals initialized via top-level conversion.
G_F32_FROM_F64 = float32(G_F64)
G_F64_FROM_I64 = float(G_I64)
G_F32_FROM_I64 = float32(G_I64)

def close(a: float, b: float, epsilon: float = 1e-6):
    return math.fabs(a - b) <= epsilon

@gpu.kernel
def kernel(out_f64, out_f32_from_f64, out_f64_from_i64, out_f32_from_i64):
    i = gpu.thread.x
    out_f64[i] = G_F64
    out_f32_from_f64[i] = float(G_F32_FROM_F64)
    out_f64_from_i64[i] = G_F64_FROM_I64
    out_f32_from_i64[i] = float(G_F32_FROM_I64)

out_f64 = [0.0]
out_f32_from_f64 = [0.0]
out_f64_from_i64 = [0.0]
out_f32_from_i64 = [0.0]

kernel(out_f64, out_f32_from_f64, out_f64_from_i64, out_f32_from_i64, grid=1, block=1)

print("host G_F64            =", G_F64)
print("host G_F32_FROM_F64   =", float(G_F32_FROM_F64))
print("host G_F64_FROM_I64   =", G_F64_FROM_I64)
print("host G_F32_FROM_I64   =", float(G_F32_FROM_I64))

print("kernel out_f64        =", out_f64[0])
print("kernel out_f32_from_f64 =", out_f32_from_f64[0])
print("kernel out_f64_from_i64 =", out_f64_from_i64[0])
print("kernel out_f32_from_i64 =", out_f32_from_i64[0])

assert close(out_f64[0], G_F64)
assert close(out_f32_from_f64[0], float(G_F32_FROM_F64))
assert close(out_f64_from_i64[0], G_F64_FROM_I64)
assert close(out_f32_from_i64[0], float(G_F32_FROM_I64))

print("PASS global_cast_init_repro_nvptx")
```
## Result of MRE

```
$ ./global_cast_init_repro_nvptx
host G_F64            = 3.14159
host G_F32_FROM_F64   = 3.14159
host G_F64_FROM_I64   = 1.23457e+08
host G_F32_FROM_I64   = 1.23457e+08
kernel out_f64        = 3.14159
kernel out_f32_from_f64 = 0
kernel out_f64_from_i64 = 0
kernel out_f32_from_i64 = 0
Segmentation fault (core dumped)
```
## Insight
Only literal Global `G_F64` can be read correctly in the kernel.
Globals derived from top-level computation/conversion appear as `zero`.


## Suspected Cause
From Reading the LLVM IR code generated by `--llvm` option, my current understanding is
Codon does `not necessarily materialize globals` as final compile-time constants in LLVM IR.

Instead, their values are established at runtime `in main.0`

Host/device module separation and pruning happen earlier.

The relevant IR shape for the globals looks like this:
```
@.G_F32_FROM_F64.0 = private global float 0.000000e+00, !dbg !462
@.G_F32_FROM_I64.0 = private global float 0.000000e+00, !dbg !464
@.G_F64.0 = private global double 0.000000e+00, !dbg !466
@.G_F64_FROM_I64.0 = private global double 0.000000e+00, !dbg !468
```
and the traversal logic I was looking at is:

```
void exploreGV(llvm::GlobalValue *G, llvm::SmallPtrSetImpl<llvm::GlobalValue *> &keep) {
  if (keep.contains(G))
    return;

  keep.insert(G);
  if (auto *F = llvm::dyn_cast<llvm::Function>(G)) {
    for (auto I = llvm::inst_begin(F), E = inst_end(F); I != E; ++I) {
      for (auto &U : I->operands()) {
        if (auto *G2 = llvm::dyn_cast<llvm::GlobalValue>(U.get()))
          exploreGV(G2, keep);
      }
    }
  }
}
```
 My suspicion is that this fails for derived globals because the global itself is emitted as a `zero-initialized storage object`, while its actual value is established later through runtime initialization. 

So there is no static initializer chain in Compile Time IR for `exploreGV` to follow back to the root dependency.

In other words, the device module sees the storage, but not the computation that defines its value.

## Why I think this is a design question, not just a local bug
I think this raises a broader question about what semantics Codon wants for globals used by accelerators.

There seem to be a few possible directions:

- Freeze eligible globals as compile-time constants for device code (This is how Numba Cuda in Python deals with GV!)
  - Similar to how some Python GPU DSLs effectively snapshot globals for kernels. (Idk what about taichi btw)
  - This would make top-level pure expressions/casts usable in kernels as constants.
- Require explicit host/device separation for globals
  - Closer to C/CUDA-style models where device-visible globals are explicitly represented.
- Disallow or restrict runtime-initialized globals in accelerator kernels
  - In that case, perhaps only literal/constant-foldable globals should be allowed in kernels, with a diagnostic.
 
For comparison, tools like Numba appear to treat globals more like captured/snapshotted environment values for compiled kernels rather than dynamically tracking later host-side mutation.

## Question for maintainers
What direction would maintainers prefer here?
More specifically:
- Should top-level pure/derived globals (e.g. casts from other globals) be expected to work in GPU kernels?
- Is the intended model that kernel-visible globals must be compile-time constants only?
- If runtime-initialized globals are not meant to be supported on accelerators, should this become a frontend restriction / diagnostic instead of silently producing zero-initialized behavior?

As far as I know, I couldn't find any docs or issues related to this issue.
If so, would you please kindly let me know?
If you need any reference for other code or insight, I can share them on discord too! Please let me know.

Always Thanks For Your Great Job, Maintainers and Contributors! 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU kernel cannot use non-literal global variables #781

Summary

Motivation

Problem Reproduction in `Math Module`

Result of `Math Module` Reproduction

Minimal Reproducer

Result of MRE

Insight

Suspected Cause

Why I think this is a design question, not just a local bug

Question for maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU kernel cannot use non-literal global variables #781

Description

Summary

Motivation

Problem Reproduction in Math Module

Result of Math Module Reproduction

Minimal Reproducer

Result of MRE

Insight

Suspected Cause

Why I think this is a design question, not just a local bug

Question for maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Problem Reproduction in `Math Module`

Result of `Math Module` Reproduction