Fix: Separate CPU and GPU LLVM optimization pipelines before GPU lowering by BI71317 · Pull Request #793 · exaloop/codon

BI71317 · 2026-04-20T07:13:26Z

resolves problem of #792

What this PR Fixes

This PR changes the pipeline structure so that CPU and GPU modules are separated before their respective optimization flows are applied.

2-pass Optimization

Like CPU Module, also GPU module introduced 2-pass Optimzation.

But Not only for performance,

When Only 1-pass Optimzation is occured,


.version 4.2
.target sm_30
.address_size 64

	// .globl	std_numpy_indexing__getset_0_0_std_numpy_ndarray_ndarray_0_float32_1__Tuple_std_internal_types_slice_Slice_0_int_int_int___std_numpy_ndarray_ndarray_0_float32_1__float32__3860
.extern .func  (.param .b64 func_retval0) malloc
(
	.param .b64 malloc_param_0
)
;
.visible .func std_numpy_util_multirange_0_0_Tuple_int___3272_resume
(
	.param .b64 std_numpy_util_multirange_0_0_Tuple_int___3272_resume_param_0
)
;
.extern .global .align 1 .b8 _str_209[2];
.extern .global .align 1 .b8 _str_210[2];
.extern .global .align 1 .b8 _str_211[2];
.extern .global .align 1 .b8 _str_212[2];
.extern .global .align 1 .b8 _str_223[2];
.extern .global .align 1 .b8 _str_224[2];
.extern .global .align 1 .b8 _str_225[2];
.extern .global .align 1 .b8 _str_226[2];
.extern .global .align 1 .b8 _str_272[2];
.extern .global .align 1 .b8 _str_273[2];
.extern .global .align 1 .b8 _str_274[2];
.extern .global .align 1 .b8 _str_275[2];
.extern .global .align 1 .b8 _str_276[2];
.extern .global .align 1 .b8 _str_277[2];
.extern .global .align 1 .b8 _str_278[2];
.extern .global .align 1 .b8 _str_279[2];
.extern .global .align 1 .b8 _str_280[2];
.extern .global .align 1 .b8 _str_313[2];
.extern .global .align 1 .b8 _str_318[2];
...

Un-inlined GV are retained, these seems to occur invalid PTX.

2-pass

.version 4.2
.target sm_30
.address_size 64

	// .globl	exp_kernel_naver_02_0_0_std_numpy_ndarray_ndarray_0_float32_1__std_numpy_ndarray_ndarray_0_float32_1__
.extern .func  (.param .b64 func_retval0) malloc
(
	.param .b64 malloc_param_0
)
;
...

Seems 2-pass opt inlines GV, And this case works correctly.

ApplyGPUTransformation

ApplyGPUTransformation does lots of jobs.

It clones module, separate CPU and GPU module, run NVPTX passes, cleanup CPU module, and Patches.

In only one Function, it contains these jobs, so to optimize each module, dividing logic of this function was inevitable I think.

BI71317 added 3 commits April 17, 2026 16:37

seperating GPU module from CPU module Optimization

a349b8f

apply optimization for gpu module

fe5136c

apply opt twice

3f112b2

BI71317 requested a review from arshajii as a code owner April 20, 2026 07:13

cla-bot Bot added the cla-signed label Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Separate CPU and GPU LLVM optimization pipelines before GPU lowering#793

Fix: Separate CPU and GPU LLVM optimization pipelines before GPU lowering#793
BI71317 wants to merge 3 commits intoexaloop:developfrom
BI71317:illegal-intrinsic

BI71317 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BI71317 commented Apr 20, 2026

What this PR Fixes

2-pass Optimization

When Only 1-pass Optimzation is occured,

2-pass

ApplyGPUTransformation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant