Add direct HuggingFace safetensors loader for Gemma 4 (E2B/E4B) by ssfdre38 · Pull Request #919 · google/gemma.cpp

ssfdre38 · 2026-05-21T19:01:29Z

Summary

Adds a new code path to load Gemma 4 weights directly from HuggingFace *.safetensors files, bypassing the BlobStore conversion step. This avoids any potential weight precision loss from format conversion and lets you use freshly downloaded HF checkpoints without a separate conversion tool.

Changes

New files

io/safetensors.h / io/safetensors.cc — SafetensorsIndex class: scans a directory for *.safetensors shards, parses the 8-byte LE header + JSON, and provides ReadTensor() via seek-based I/O. Handles both single-file and sharded (model.safetensors.index.json) checkpoints.
gemma/load_safetensors.cc — WeightsPtrs::LoadFromSafetensors(): maps HF tensor names to gemma.cpp MatPtr fields. Handles Q/K/V concat, gate/up-proj concat, o_proj direct load, and per-layer token embedding transpose ([V, L*D] → [L*D, V]). Calls Fixup() at the end.

Modified files

gemma/weights.h — adds LoadFromSafetensors() public declaration
gemma/model_store.h / model_store.cc — adds ModelStore(const ModelConfig&, const Path& tokenizer_path) constructor for the safetensors path (reads tokenizer directly from file, leaves scales_ empty)
gemma/gemma.h / gemma.cc — adds Gemma(ModelConfig, tokenizer, safetensors_dir, InferenceArgs, ThreadingContext) constructor; changes BlobReader reader_ to unique_ptr<BlobReader> to allow null when not using BlobStore
gemma/gemma_args.h — adds --safetensors (directory path) and --model_spec (e.g. gemma4-e4b-bf16-it) flags to LoaderArgs
gemma/run.cc — wires new flags into Run() with a conditional branch (uses unique_ptr<Gemma> to avoid copy/move)
CMakeLists.txt — adds new source files; links nlohmann_json to libgemma

Usage

./gemma --safetensors /path/to/gemma-4-e4b-it \
        --model_spec   gemma4-e4b-bf16-it \
        --tokenizer    /path/to/tokenizer.model \
        --prompt       "Hello!"

The --model_spec specifier uses the existing ModelConfig(std::string) format: {model-prefix}-{type}-{wrapping} e.g. gemma4-e2b-bf16-it or gemma4-e4b-bf16-it.

Tested

Builds cleanly (MSVC + ninja) with the existing CMake setup
--help shows both new flags with descriptions
E4B multimodal HF checkpoint (model.language_model.* prefix, 2130 tensors, 42 layers): loads fully, prompt processing begins
Both single-shard and multi-shard (*.index.json) layouts supported by SafetensorsIndex

Adds initial support for Gemma 4 in gemma.cpp: - configs.h: Add GEMMA4_E2B/E4B to Model enum, IsVLM(), per_layer_embd_dim field to ModelConfig, fix KVCacheCols() for variable per-layer qkv_dim - configs.cc: Add ConfigGemma4_E2B() and ConfigGemma4_E4B() with full per-layer config building (BuildGemma4LayerConfigs helper) - E2B: 35 layers, model_dim=1536, TTTTF SWA pattern, mixed FFN (6144/12288) - E4B: 42 layers, model_dim=2560, TTTTTF SWA pattern, uniform FFN (10240) - Both: qkv_dim=256 for SWA layers, qkv_dim=512 for full-attention layers - SWA window=512 tokens, final_cap=30.0, vocab=262144 - tensor_info.cc: Register per_layer_token_embd.weight tensor for Gemma 4 - weights.h: Add per_layer_input_embedding MatPtr to WeightsPtrs Architecture notes: - Gemma 4 has physically distinct SWA and full-attention layers with different head dimensions (256 vs 512), requiring per-layer LayerConfig - per_layer_token_embd enables per-layer embedding injection, shape [num_layers * per_layer_embd_dim, vocab_size] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…r layer) Gemma 4 uses two different attention head dimensions depending on layer type: - SWA layers: qkv_dim=256 - Full-attention layers: qkv_dim=512 The previous code indexed the KV cache as layer_idx * cache_layer_size which assumes all layers have the same qkv_dim. For Gemma 4 this is wrong: layers 0-3 use 512 bytes/head, layer 4 uses 1024 bytes/head, etc., so the cumulative offset does not equal index × current_size. Fix: add KVCacheLayerOffset() to ModelConfig that sums CacheLayerSize() for all preceding layers. For existing uniform models this produces the same result as before. Update DotSoftmaxWeightedSum() and ComputeQKV() in attention.cc to use the new method. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The vendored sentencepiece_processor.h uses uint32_t without including <cstdint>, which fails to compile with newer g++ versions (MinGW-w64 on Windows). Add the missing include to unblock the full gemma target build. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Verifies E2B and E4B model configs against values observed in GGUF metadata: - Layer counts (35/42), model dims (1536/2560), vocab (262144) - SWA/full-attention layer distribution (28+7 / 35+7) - Per-layer qkv_dim (256 for SWA, 512 for full-att) - Non-uniform FFN dims for E2B (6144 layers 0-14, 12288 layers 15-34) - KV cache layout correctness via KVCacheLayerOffset() - Serialize/deserialize round-trip All tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Implements zero-conversion BF16 weight loading from HuggingFace safetensors directories, bypassing BlobStore to preserve exact weight precision. New files: - io/safetensors.h/.cc: SafetensorsIndex class that scans sharded *.safetensors files, parses the 8-byte LE header + JSON, builds a unified tensor index with random-access reads via File::Read() - gemma/load_safetensors.cc: WeightsPtrs::LoadFromSafetensors() maps HF tensor names to gemma.cpp MatPtrs; handles Q+K+V concat, gate+up concat, o_proj direct copy, per_layer_embd transpose [L,V,D]->[L*D,V], and calls Fixup() Modified files: - gemma/weights.h: adds public LoadFromSafetensors() declaration - gemma/model_store.h/.cc: adds ModelStore(ModelConfig&, Path&) ctor for BlobStore-free construction (reads tokenizer from file) - gemma/gemma.h: changes BlobReader reader_ to unique_ptr<BlobReader>; adds Gemma(ModelConfig, tokenizer_path, safetensors_dir, ...) constructor - gemma/gemma.cc: fixes reader_ -> *reader_ refs; adds safetensors constructor - CMakeLists.txt: adds io/safetensors.cc, gemma/load_safetensors.cc to SOURCES; links nlohmann_json::nlohmann_json to libgemma Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When --safetensors <dir> and --model_spec <specifier> are both given, constructs Gemma via the new safetensors constructor instead of the BlobStore path. Uses unique_ptr<Gemma> to avoid copy/move issues. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Public Gemma 4 checkpoints (e4b/e2b) on HuggingFace wrap the language model under model.language_model.*, not model.* directly. Also the per-layer token embedding is named embed_tokens_per_layer.weight with shape [V, L*D] (not [L,V,D]), requiring a simpler matrix transpose. - LN() prefix: model.layers.N. -> model.language_model.layers.N. - Global tensors: model.embed_tokens.weight -> model.language_model.* - LoadPerLayerEmbd: new name + correct [V, L*D] -> [L*D, V] transpose Tested: 2130 tensors indexed, 42 layers loaded, prompt processing begins (CPU-only inference is slow for 4B BF16 model). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

google-cla · 2026-05-21T19:01:35Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

ssfdre38 · 2026-05-21T19:03:16Z

im part of cla already

ssfdre38 and others added 8 commits May 21, 2026 10:23

build: remove stray .o object files

ba6fd86

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add direct HuggingFace safetensors loader for Gemma 4 (E2B/E4B)#919

Add direct HuggingFace safetensors loader for Gemma 4 (E2B/E4B)#919
ssfdre38 wants to merge 8 commits into
google:mainfrom
ssfdre38:safetensors-gemma4-loader

ssfdre38 commented May 21, 2026

Uh oh!

google-cla Bot commented May 21, 2026

Uh oh!

ssfdre38 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ssfdre38 commented May 21, 2026

Summary

Changes

New files

Modified files

Usage

Tested

Uh oh!

google-cla Bot commented May 21, 2026

Uh oh!

ssfdre38 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant