Skip to content

Fix: Separate CPU and GPU LLVM optimization pipelines before GPU lowering#793

Open
BI71317 wants to merge 3 commits intoexaloop:developfrom
BI71317:illegal-intrinsic
Open

Fix: Separate CPU and GPU LLVM optimization pipelines before GPU lowering#793
BI71317 wants to merge 3 commits intoexaloop:developfrom
BI71317:illegal-intrinsic

Conversation

@BI71317
Copy link
Copy Markdown
Contributor

@BI71317 BI71317 commented Apr 20, 2026

resolves problem of #792

What this PR Fixes

This PR changes the pipeline structure so that CPU and GPU modules are separated before their respective optimization flows are applied.

2-pass Optimization

Like CPU Module, also GPU module introduced 2-pass Optimzation.

But Not only for performance,

When Only 1-pass Optimzation is occured,


.version 4.2
.target sm_30
.address_size 64

	// .globl	std_numpy_indexing__getset_0_0_std_numpy_ndarray_ndarray_0_float32_1__Tuple_std_internal_types_slice_Slice_0_int_int_int___std_numpy_ndarray_ndarray_0_float32_1__float32__3860
.extern .func  (.param .b64 func_retval0) malloc
(
	.param .b64 malloc_param_0
)
;
.visible .func std_numpy_util_multirange_0_0_Tuple_int___3272_resume
(
	.param .b64 std_numpy_util_multirange_0_0_Tuple_int___3272_resume_param_0
)
;
.extern .global .align 1 .b8 _str_209[2];
.extern .global .align 1 .b8 _str_210[2];
.extern .global .align 1 .b8 _str_211[2];
.extern .global .align 1 .b8 _str_212[2];
.extern .global .align 1 .b8 _str_223[2];
.extern .global .align 1 .b8 _str_224[2];
.extern .global .align 1 .b8 _str_225[2];
.extern .global .align 1 .b8 _str_226[2];
.extern .global .align 1 .b8 _str_272[2];
.extern .global .align 1 .b8 _str_273[2];
.extern .global .align 1 .b8 _str_274[2];
.extern .global .align 1 .b8 _str_275[2];
.extern .global .align 1 .b8 _str_276[2];
.extern .global .align 1 .b8 _str_277[2];
.extern .global .align 1 .b8 _str_278[2];
.extern .global .align 1 .b8 _str_279[2];
.extern .global .align 1 .b8 _str_280[2];
.extern .global .align 1 .b8 _str_313[2];
.extern .global .align 1 .b8 _str_318[2];
...

Un-inlined GV are retained, these seems to occur invalid PTX.

2-pass

.version 4.2
.target sm_30
.address_size 64

	// .globl	exp_kernel_naver_02_0_0_std_numpy_ndarray_ndarray_0_float32_1__std_numpy_ndarray_ndarray_0_float32_1__
.extern .func  (.param .b64 func_retval0) malloc
(
	.param .b64 malloc_param_0
)
;
...

Seems 2-pass opt inlines GV, And this case works correctly.

ApplyGPUTransformation

ApplyGPUTransformation does lots of jobs.

It clones module, separate CPU and GPU module, run NVPTX passes, cleanup CPU module, and Patches.

In only one Function, it contains these jobs, so to optimize each module, dividing logic of this function was inevitable I think.

@BI71317 BI71317 requested a review from arshajii as a code owner April 20, 2026 07:13
@cla-bot cla-bot Bot added the cla-signed label Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant