Expert Parallelism: common C API + NCCL EP backend by phu0ngng · Pull Request #3034 · NVIDIA/TransformerEngine

phu0ngng · 2026-05-22T02:42:51Z

Summary

First PR in the TE Expert Parallelism (EP) series. Lands the common C API and NCCL EP backend that later framework PRs (PyTorch, JAX) build on. No Python bindings yet — common-lib foundation plus build wiring only. Build/load works on any arch; SM and NCCL version gates fire at runtime.

Every network-bound payload tensor takes an optional NVTECommWindow. When the window is provided, the backend uses NCCL EP's symmetric-memory zero-copy path, which skips the D2D Memcpy from the user buffers to the Symmetric Staging Buffers.

Implementation

Public C API (`transformer_engine/common/include/transformer_engine/{ep.h,comm_window.h}`)

Types: NVTEEpGroupConfig, NVTEEpLayerConfig, NVTEEpHandle, NVTECommWindow (side-band {ncclWindow_t window, size_t offset}; NCCL peer handles are not carried on NVTETensor).

Lifecycle (host-only, eager):

void     nvte_ep_initialize(void* ep_comm, NVTEEpGroupConfig group_config);
void     nvte_ep_shutdown(void);

uint64_t nvte_ep_register_layer(NVTEEpLayerConfig layer_config, size_t* handle_mem_size);

nvte_ep_initialize — borrow an external ncclComm_t for the EP sub-group and init the singleton backend.
nvte_ep_shutdown — tear down the backend; idempotent; does not destroy ep_comm.
nvte_ep_register_layer — reserve a handle_id for a layer config and report the handle_mem buffer size the caller must allocate. The pair {id, mem} becomes the per-step NVTEEpHandle.

Per-step (allocation-free, CUDA-graph capturable)

void nvte_ep_prepare(NVTEEpHandle handle, NVTETensor topk_idx, NVTETensor token_counts,
                     size_t dispatch_output_per_expert_alignment, cudaStream_t stream);

void nvte_ep_dispatch(NVTEEpHandle handle, NVTETensor topk_idx,
                      NVTETensor tokens, NVTECommWindow tokens_win,
                      NVTETensor topk_weights, NVTECommWindow topk_weights_win,
                      NVTETensor recv_tokens, NVTECommWindow recv_tokens_win,
                      NVTETensor recv_topk_weights,  NVTECommWindow recv_topk_weights_win,
                      cudaStream_t stream);

void nvte_ep_combine(NVTEEpHandle handle, NVTETensor expert_out, NVTECommWindow expert_out_win,
                     NVTETensor result, cudaStream_t stream);

void nvte_ep_dispatch_bwd(NVTEEpHandle handle, NVTETensor grad, NVTECommWindow grad_win,
                          NVTETensor g_recv_topk_weights, NVTECommWindow g_recv_topk_weights_win,
                          NVTETensor grad_tokens, NVTETensor grad_topk_weights, cudaStream_t stream);

void nvte_ep_combine_bwd(NVTEEpHandle handle, NVTETensor grad, NVTECommWindow grad_win,
                         NVTETensor grad_expert_out, NVTECommWindow grad_expert_out_win,
                         cudaStream_t stream);

nvte_ep_prepare — all-gather the routing map and write routing maps to handle.mem.
nvte_ep_dispatch — scatter tokens and routing weights from source ranks to expert ranks. tokens, topk_weights, recv_tokens, recv_topk_weights each accept an optional symm-mem window for zero-copy.
nvte_ep_combine — scatter-sum expert outputs back to source ranks (unweighted; caller pre-multiplies by recv_topk_weights). expert_out accepts a window.
nvte_ep_dispatch_bwd — backward of dispatch; routes token and weight grads back to source. grad and g_recv_topk_weights accept windows; the gathered outputs (grad_tokens, grad_topk_weights).
nvte_ep_combine_bwd — backward of combine; grad and grad_expert_out accept windows. Padded slots in grad_expert_out are zeroed.

Backend + build

NCCL EP backend (transformer_engine/common/ep/): EPBackend singleton, HT-mode dispatch/combine over NCCL EP (libnccl_ep.so), group/layer registration. Internal helper make_payload_tensor() builds the per-call ncclEpTensor_t: when the caller's NVTECommWindow.window != nullptr it sets win_hdl + win_offset (zero-copy); otherwise it sets data from nvte_tensor_data(t) (HBM fallback).
Runtime gates (in EPBackend::initialize): SM>=90 (via cudaDeviceGetAttribute), NCCL>=2.30.4 (via ncclGetVersion), CUDA multicast/NVLS support.
Stub path: when NVTE_WITH_NCCL_EP=OFF, ep/ep_api_stub.cpp provides throwing nvte_ep_* stubs so framework bindings link unconditionally; failure surfaces at first nvte_ep_initialize.
Build wiring
- setup.py builds libnccl_ep.so from 3rdparty/nccl by default; auto-disables NCCL EP when no requested CUDA arch >= 90. Explicit NVTE_BUILD_WITH_NCCL_EP=1 with all archs < 90 is treated as user error NVTE_BUILD_WITH_NCCL_EP=0 to opt out.
- NCCL_HOME resolved dynamically: explicit env → /opt/nvidia/nccl, /usr/local/nccl, /usr → ldconfig -p fallback.

Testing

C++ distributed tests under tests/cpp_distributed/.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-22T02:48:16Z

Greptile Summary

This PR lands the foundational Expert Parallelism (EP) layer for TransformerEngine: a common C API (ep.h, comm_window.h) and a NCCL EP backend singleton that handles group/layer registration and the full forward/backward dispatch-combine cycle. No Python bindings are included yet; the change is intended as the base that PyTorch and JAX PRs will build on.

New C API (nvte_ep_initialize / nvte_ep_shutdown / nvte_ep_register_layer + per-step ops): thin wrappers over an EPBackend Meyers singleton that owns the ncclEpGroup_t and a handle-id cache; all per-step ops are allocation-free and designed to be CUDA-graph-capturable.
Build wiring: setup.py adds _discover_nccl_home / build_nccl_ep_submodule to drive the 3rdparty/nccl submodule build; auto-disables NCCL EP when no arch ≥ 90 is targeted; stub path (ep_api_stub.cpp) provides throwing symbols when NCCL EP is off.
Tests: new tests/cpp_distributed/ suite with a multi-process bash harness that spawns one process per GPU and exchanges ncclUniqueId via a shared temp file.

Confidence Score: 4/5

Safe to merge with one build issue addressed: the public comm_window.h header pulls in <nccl.h> and exposes ncclWindow_t, which causes the stub (off) build path to fail compilation on systems with NCCL < 2.30.

The comm_window.h public header includes <nccl.h> and uses ncclWindow_t. When NVTE_WITH_NCCL_EP=OFF, CMake adds no NCCL EP include dirs, yet ep_api_stub.cpp transitively includes that header. On a machine with NCCL 2.18 (no ncclWindow_t), the stub build — the fallback for pre-EP systems — fails to compile.

transformer_engine/common/include/transformer_engine/comm_window.h and transformer_engine/common/ep/ep_api_stub.cpp need attention: the public header's unconditional ncclWindow_t dependency breaks the stub build path on pre-2.30 NCCL systems.

Important Files Changed

Filename	Overview
transformer_engine/common/include/transformer_engine/comm_window.h	New public C header exposing NVTECommWindow. Unconditionally includes `<nccl.h>` and uses `ncclWindow_t`, breaking stub builds on pre-NCCL-2.30 systems — the exact scenario the stub path is meant to serve.
transformer_engine/common/ep/ep_backend.cpp	Core EP singleton backend: group creation, layer registration, and all per-step ops (prepare/dispatch/combine and their backwards). Mutex-protected operations hold the lock across NCCL EP calls. The already-reported `max_token_bytes` hardcoding and `ncclEpHandleConfig_t` init asymmetry are notable concerns.
transformer_engine/common/ep/ep_api_stub.cpp	Throwing stubs for NVTE_WITH_NCCL_EP=OFF builds. Compiles correctly only if the system NCCL headers include `ncclWindow_t` (NCCL >= 2.30), which may not hold on older-NCCL systems where this stub path is intended to be used.
transformer_engine/common/include/transformer_engine/ep.h	New public C API header for Expert Parallelism — lifecycle, registration, and per-step ops. Well-documented with clear in/out annotations; inherits the nccl.h exposure issue from comm_window.h.
setup.py	Adds NCCL EP detection, arch-gating, and build orchestration. `_discover_nccl_home` and `build_nccl_ep_submodule` are well-structured. `libnccl_ep.so` rebuild is skipped if the file already exists, which won't detect submodule updates.
transformer_engine/common/CMakeLists.txt	Adds NCCL EP CMake wiring: header/lib discovery, rpath embed, and conditional stub vs. real backend source selection. Includes a runtime-diagnosed NCCL version log. Looks correct.
tests/cpp_distributed/test_ep_common.h	Shared test infrastructure: process bootstrap, RAII tensor/buffer helpers, and uid-file-based ncclUniqueId exchange. Default uid path is rank-specific (deadlocks without --uid-file), but run_test_ep.sh always provides the flag.
tests/cpp_distributed/run_test_ep.sh	Multi-process test harness: spawns one process per GPU, exchanges a shared UID file, collects logs, and enforces per-rank timeouts. SM < 90 skip logic is correct.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant C_API as nvte_ep_* (ep_api.cpp)
    participant Backend as EPBackend singleton
    participant NCCL_EP as ncclEp* (libnccl_ep.so)

    Caller->>C_API: nvte_ep_initialize(ep_comm, group_config)
    C_API->>Backend: EPBackend::initialize()
    Backend->>NCCL_EP: ncclEpCreateGroup()

    Caller->>C_API: "nvte_ep_register_layer(layer_config, &mem_size)"
    C_API->>Backend: register_layer()
    Backend->>NCCL_EP: ncclEpHandleMemSize()
    Backend-->>Caller: handle_id + required mem_size

    Note over Caller: Allocates handle_mem buffer

    loop Per training step
        Caller->>C_API: nvte_ep_prepare(handle, topk_idx, token_counts, stream)
        C_API->>Backend: prepare() → ncclEpUpdateHandle()
        Backend->>NCCL_EP: ncclEpUpdateHandle (AllGather routing map)

        Caller->>C_API: nvte_ep_dispatch(handle, tokens, [win], weights, [win], stream)
        C_API->>Backend: dispatch() → ncclEpDispatch()
        Backend->>NCCL_EP: ncclEpDispatch (scatter tokens to expert ranks)

        Note over Caller: Expert computation on recv_tokens

        Caller->>C_API: nvte_ep_combine(handle, expert_out, [win], result, stream)
        C_API->>Backend: combine() → ncclEpCombine()
        Backend->>NCCL_EP: ncclEpCombine (scatter-sum back to source ranks)
    end

    Caller->>C_API: nvte_ep_shutdown()
    C_API->>Backend: EPBackend::shutdown()
    Backend->>NCCL_EP: ncclEpHandleDestroy + ncclEpGroupDestroy

_{Reviews (4): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-05-22T22:51:11Z

+  cfg.algorithm = NCCL_EP_ALGO_HIGH_THROUGHPUT;
+  cfg.num_experts = static_cast<unsigned int>(group_config.num_experts);
+  cfg.max_dispatch_tokens_per_rank = static_cast<unsigned int>(group_config.max_tokens_per_rank);
+  cfg.max_token_bytes = static_cast<unsigned int>(group_config.hidden_dim * sizeof(nv_bfloat16));


max_token_bytes hardcoded to sizeof(nv_bfloat16) breaks float32 dispatch

cfg.max_token_bytes is computed as hidden_dim * sizeof(nv_bfloat16) (2 bytes), but nvte_dtype_to_nccl supports float32, float16, int32, int64, float8, etc. When a caller creates the EP group with this config and later dispatches float32 tokens (via nvte_ep_dispatch), the pre-allocated max_token_bytes is half the required size. NCCL EP uses this value to size internal staging buffers at group creation; dispatching a wider dtype silently overruns those buffers or triggers an internal NCCL error. NVTEEpGroupConfig needs a dtype (or max_token_element_bytes) field so callers can declare the maximum element width they will use.

Note for myself: Need to expose this option for users to set in ep_bootstrap.

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…em_reloc gating Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

ptrendx · 2026-05-26T21:57:29Z

 endif()

-find_library(TE_LIB NAMES transformer_engine PATHS "${TE_LIB_PATH}/.." ${TE_LIB_PATH} ENV TE_LIB_PATH REQUIRED)
+find_library(TE_LIB NAMES transformer_engine PATHS "${TE_LIB_PATH}/.." ${TE_LIB_PATH} ENV TE_LIB_PATH REQUIRED NO_CMAKE_SYSTEM_PATH)


Why do we need that?

ptrendx · 2026-05-26T21:59:40Z

+# No MPI dependency — processes are spawned by run_test_ep.sh with
+# --rank / --nranks flags.  ncclUniqueId exchange uses a
+# shared temp file (see test_ep_common.h for details).


I believe that the other distributed tests do rely on MPI, so why don't we also do that here?

ptrendx · 2026-05-27T18:42:02Z

+# nvrtc symbols are referenced from libtransformer_engine.so but not in its
+# DT_NEEDED list (loaded via dlopen in Python).  For cpp tests we link nvrtc
+# explicitly with --no-as-needed so the linker keeps the dependency.
+set(EP_TEST_LINK_OPTS "LINKER:--no-as-needed")


This sounds like a bug actually, but the other tests do not need to do this, they instead specify the nvrtc after the TE_LIB in the LINKER_LIBS variable.

ptrendx · 2026-05-27T18:43:34Z

+// ── Error-checking macros ─────────────────────────────────────────────────────
+
+#define CHECK_NCCL(expr)                                                          \
+  do {                                                                            \
+    ncclResult_t _err = (expr);                                                   \
+    if (_err != ncclSuccess)                                                      \
+      FAIL() << "NCCL error " << _err << ": " << ncclGetErrorString(_err);        \
+  } while (false)
+
+#define CHECK_CUDA(expr)                                                          \
+  do {                                                                            \
+    cudaError_t _err = (expr);                                                    \
+    if (_err != cudaSuccess)                                                      \
+      FAIL() << "CUDA error " << _err << ": " << cudaGetErrorString(_err);        \
+  } while (false)
+
+#define ASSERT_CUDA_OK(expr)                                                      \
+  do {                                                                            \
+    cudaError_t _err = (expr);                                                    \
+    if (_err != cudaSuccess) {                                                    \
+      fprintf(stderr, "CUDA error %d: %s\n", _err, cudaGetErrorString(_err));    \
+      exit(EXIT_FAILURE);                                                         \
+    }                                                                             \
+  } while (false)
+
+#define ASSERT_NCCL_OK(expr)                                                      \
+  do {                                                                            \
+    ncclResult_t _err = (expr);                                                   \
+    if (_err != ncclSuccess) {                                                    \
+      fprintf(stderr, "NCCL error %d: %s\n", _err, ncclGetErrorString(_err));    \
+      exit(EXIT_FAILURE);                                                         \
+    }                                                                             \
+  } while (false)


Why not use logging.h?

ptrendx · 2026-05-27T18:44:18Z

+struct TensorHandle {
+  NVTETensor tensor   = nullptr;
+  void*      dev_ptr  = nullptr;
+
+  ~TensorHandle() {
+    if (tensor) nvte_destroy_tensor(tensor);
+  }
+
+  TensorHandle() = default;
+  TensorHandle(const TensorHandle&) = delete;
+  TensorHandle& operator=(const TensorHandle&) = delete;
+
+  TensorHandle(TensorHandle&& o) noexcept : tensor(o.tensor), dev_ptr(o.dev_ptr) {
+    o.tensor = nullptr; o.dev_ptr = nullptr;
+  }
+  TensorHandle& operator=(TensorHandle&& o) noexcept {
+    if (this != &o) {
+      if (tensor) nvte_destroy_tensor(tensor);
+      tensor = o.tensor; dev_ptr = o.dev_ptr;
+      o.tensor = nullptr; o.dev_ptr = nullptr;
+    }
+    return *this;
+  }
+};


Why not TensorWrapper?

ptrendx · 2026-05-27T18:44:56Z

+
+// RAII owner for a cudaMalloc'd device buffer; frees on destruction.
+template <typename T>
+struct DevBuf {


We have a very similar thing already in the test_common.h

ptrendx · 2026-05-27T18:48:24Z

@@ -0,0 +1,64 @@
+/*************************************************************************


I feel that having those tests as separate entities does not really make sense and will introduce overhead to the CI - the actual functionality tests would already be able to cover those initialization issues, no?

ptrendx · 2026-05-27T18:49:54Z

+};
+
+// Bundled NVTETensor views over an EPBuffers — one place to update the shape
+// conventions when the C-API evolves.


What do you mean by "when the C-API evolves"? We should aim for stability of the C API.

ptrendx · 2026-05-27T18:53:13Z

+  CHECK_CUDA(cudaMemcpy(h_result.data(), buf.result.get(),
+                        h_result.size() * sizeof(nv_bfloat16), cudaMemcpyDeviceToHost));
+  auto h_tok = generate_tokens(g_process_id, num_tokens_, hidden_dim_);
+  // Spot-check 3 hidden-dim positions per token to catch partial-row writes.


What? Why don't we check the full data?

ptrendx · 2026-05-27T18:53:47Z

+  // Spot-check 3 hidden-dim positions per token to catch partial-row writes.
+  const int probes[3] = {0, hidden_dim_ / 2, hidden_dim_ - 1};
+  for (int tok = 0; tok < num_tokens_; ++tok) {
+    float exp = __bfloat162float(h_tok[tok * hidden_dim_]) * static_cast<float>(top_k_);


Why do we hardcode BF16 everywhere? I assume that NCCL EP works with the other datatypes, right?

ptrendx · 2026-05-27T18:57:06Z

+// BF16 has 7 mantissa bits; relative ULP ≈ 2^-7. Use 4× headroom for
+// accumulation noise inside dispatch/combine.
+static float bf16_tol(float magnitude) {
+  return 4.f * std::ldexp(std::fabs(magnitude) + 1e-3f, -7);
+}


So why can't we just use rtol 2^-5 rather than this formula? In general the error checking here is very custom, could we integrate it better with the rest of the tests?

ptrendx · 2026-05-27T18:58:04Z

@@ -0,0 +1,562 @@
+/*************************************************************************


What are the cases that this test would catch that the ep_pipeline one would not?

ptrendx · 2026-05-27T19:00:37Z

+namespace transformer_engine {
+namespace ep {
+
+/*! \brief EP backend singleton — owns the NCCL EP group; borrows the comm. */


If it borrows the communicator then on the framework side we need to make sure that it stays alive.
Also, if it is a singleton, how does it work with multiple GPUs per process?

ptrendx · 2026-05-27T19:02:32Z

+
+  // Host-only: reserve a fresh handle_id, cache the layer config, and report
+  // the handle_mem buffer size the caller must allocate.
+  uint64_t register_layer(NVTEEpLayerConfig layer_config, size_t* handle_mem_size);


Is it ever-growing? I don't see any free_layer API.

ptrendx · 2026-05-27T19:04:13Z

+typedef struct {
+  int ep_size;             /*!< EP world size. */
+  int num_experts;         /*!< Total experts across all ranks. */
+  int max_tokens_per_rank; /*!< Upper bound on tokens this rank sends per dispatch. */
+  /*! Upper bound on tokens received per dispatch (worst-case top_k fan-out; must be > 0). */
+  int max_recv_tokens_per_rank;
+  int hidden_dim;  /*!< Token hidden dimension. */
+  int max_num_sms; /*!< Max SMs for EP kernels. 0 = auto. */
+  /*! 0 (default): throw on relocated handle_mem for a cached handle_id. 1: silently rebuild. */
+  int allow_handle_mem_reloc;
+} NVTEEpGroupConfig;
+
+/*! \brief Per-layer EP configuration. */
+typedef struct {
+  int num_local_experts; /*!< Reserved for ABI stability (derived from group config). */
+  int top_k;             /*!< Per-token expert fan-out. Required. */
+  size_t dispatch_output_per_expert_alignment;
+  /*!< Per-expert zone alignment in tokens (pow2; 0/1 = no padding). Must match
+   *   between nvte_ep_register_layer and nvte_ep_prepare. */
+} NVTEEpLayerConfig;


If we make this a public API then we should probably version those?

phu0ngng requested a review from ptrendx as a code owner May 22, 2026 02:42

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

Comment thread transformer_engine/common/ep/ep_backend.cpp Outdated

Comment thread transformer_engine/common/ep/ep_backend.cpp

Comment thread setup.py Outdated

Comment thread setup.py Outdated

This was referenced May 22, 2026

[PyTorch] Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy #3035

Draft

[JAX] Expert Parallelism: JAX primitives + VJPs #3036

Open

[Common] Initial NCCL EP integration + Distributed CPP unit tests #3023

Open

phu0ngng requested a review from timmoon10 May 22, 2026 16:17

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

Expert Parallelism: common C API + NCCL EP v0.1 backend

17e5126

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng force-pushed the phuong/ep-2-commwindow branch from 099857f to 17e5126 Compare May 22, 2026 23:07

phu0ngng and others added 2 commits May 23, 2026 19:36

Expert Parallelism: persistent ncclEpHandle cache with allow_handle_m…

cef4b33

…em_reloc gating Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0086be4

for more information, see https://pre-commit.ci

ptrendx reviewed May 26, 2026

View reviewed changes

ptrendx reviewed May 27, 2026

View reviewed changes

		@@ -0,0 +1,64 @@
		/*************************************************************************

		@@ -0,0 +1,562 @@
		/*************************************************************************

Conversation

phu0ngng commented May 22, 2026

Summary

Implementation

Public C API (transformer_engine/common/include/transformer_engine/{ep.h,comm_window.h})

Backend + build

Testing

Type of change

Checklist:

Uh oh!

greptile-apps Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Public C API (`transformer_engine/common/include/transformer_engine/{ep.h,comm_window.h}`)

greptile-apps Bot commented May 22, 2026 •

edited

Loading