Add direct HuggingFace safetensors loader for Gemma 4 (E2B/E4B)#919
Open
ssfdre38 wants to merge 8 commits into
Open
Add direct HuggingFace safetensors loader for Gemma 4 (E2B/E4B)#919ssfdre38 wants to merge 8 commits into
ssfdre38 wants to merge 8 commits into
Conversation
Adds initial support for Gemma 4 in gemma.cpp: - configs.h: Add GEMMA4_E2B/E4B to Model enum, IsVLM(), per_layer_embd_dim field to ModelConfig, fix KVCacheCols() for variable per-layer qkv_dim - configs.cc: Add ConfigGemma4_E2B() and ConfigGemma4_E4B() with full per-layer config building (BuildGemma4LayerConfigs helper) - E2B: 35 layers, model_dim=1536, TTTTF SWA pattern, mixed FFN (6144/12288) - E4B: 42 layers, model_dim=2560, TTTTTF SWA pattern, uniform FFN (10240) - Both: qkv_dim=256 for SWA layers, qkv_dim=512 for full-attention layers - SWA window=512 tokens, final_cap=30.0, vocab=262144 - tensor_info.cc: Register per_layer_token_embd.weight tensor for Gemma 4 - weights.h: Add per_layer_input_embedding MatPtr to WeightsPtrs Architecture notes: - Gemma 4 has physically distinct SWA and full-attention layers with different head dimensions (256 vs 512), requiring per-layer LayerConfig - per_layer_token_embd enables per-layer embedding injection, shape [num_layers * per_layer_embd_dim, vocab_size] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r layer) Gemma 4 uses two different attention head dimensions depending on layer type: - SWA layers: qkv_dim=256 - Full-attention layers: qkv_dim=512 The previous code indexed the KV cache as layer_idx * cache_layer_size which assumes all layers have the same qkv_dim. For Gemma 4 this is wrong: layers 0-3 use 512 bytes/head, layer 4 uses 1024 bytes/head, etc., so the cumulative offset does not equal index × current_size. Fix: add KVCacheLayerOffset() to ModelConfig that sums CacheLayerSize() for all preceding layers. For existing uniform models this produces the same result as before. Update DotSoftmaxWeightedSum() and ComputeQKV() in attention.cc to use the new method. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The vendored sentencepiece_processor.h uses uint32_t without including <cstdint>, which fails to compile with newer g++ versions (MinGW-w64 on Windows). Add the missing include to unblock the full gemma target build. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verifies E2B and E4B model configs against values observed in GGUF metadata: - Layer counts (35/42), model dims (1536/2560), vocab (262144) - SWA/full-attention layer distribution (28+7 / 35+7) - Per-layer qkv_dim (256 for SWA, 512 for full-att) - Non-uniform FFN dims for E2B (6144 layers 0-14, 12288 layers 15-34) - KV cache layout correctness via KVCacheLayerOffset() - Serialize/deserialize round-trip All tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements zero-conversion BF16 weight loading from HuggingFace safetensors directories, bypassing BlobStore to preserve exact weight precision. New files: - io/safetensors.h/.cc: SafetensorsIndex class that scans sharded *.safetensors files, parses the 8-byte LE header + JSON, builds a unified tensor index with random-access reads via File::Read() - gemma/load_safetensors.cc: WeightsPtrs::LoadFromSafetensors() maps HF tensor names to gemma.cpp MatPtrs; handles Q+K+V concat, gate+up concat, o_proj direct copy, per_layer_embd transpose [L,V,D]->[L*D,V], and calls Fixup() Modified files: - gemma/weights.h: adds public LoadFromSafetensors() declaration - gemma/model_store.h/.cc: adds ModelStore(ModelConfig&, Path&) ctor for BlobStore-free construction (reads tokenizer from file) - gemma/gemma.h: changes BlobReader reader_ to unique_ptr<BlobReader>; adds Gemma(ModelConfig, tokenizer_path, safetensors_dir, ...) constructor - gemma/gemma.cc: fixes reader_ -> *reader_ refs; adds safetensors constructor - CMakeLists.txt: adds io/safetensors.cc, gemma/load_safetensors.cc to SOURCES; links nlohmann_json::nlohmann_json to libgemma Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When --safetensors <dir> and --model_spec <specifier> are both given, constructs Gemma via the new safetensors constructor instead of the BlobStore path. Uses unique_ptr<Gemma> to avoid copy/move issues. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Public Gemma 4 checkpoints (e4b/e2b) on HuggingFace wrap the language model under model.language_model.*, not model.* directly. Also the per-layer token embedding is named embed_tokens_per_layer.weight with shape [V, L*D] (not [L,V,D]), requiring a simpler matrix transpose. - LN() prefix: model.layers.N. -> model.language_model.layers.N. - Global tensors: model.embed_tokens.weight -> model.language_model.* - LoadPerLayerEmbd: new name + correct [V, L*D] -> [L*D, V] transpose Tested: 2130 tensors indexed, 42 layers loaded, prompt processing begins (CPU-only inference is slow for 4B BF16 model). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Author
|
im part of cla already |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new code path to load Gemma 4 weights directly from HuggingFace
*.safetensorsfiles, bypassing the BlobStore conversion step. This avoids any potential weight precision loss from format conversion and lets you use freshly downloaded HF checkpoints without a separate conversion tool.Changes
New files
io/safetensors.h/io/safetensors.cc—SafetensorsIndexclass: scans a directory for*.safetensorsshards, parses the 8-byte LE header + JSON, and providesReadTensor()via seek-based I/O. Handles both single-file and sharded (model.safetensors.index.json) checkpoints.gemma/load_safetensors.cc—WeightsPtrs::LoadFromSafetensors(): maps HF tensor names to gemma.cppMatPtrfields. Handles Q/K/V concat, gate/up-proj concat, o_proj direct load, and per-layer token embedding transpose ([V, L*D]→[L*D, V]). CallsFixup()at the end.Modified files
gemma/weights.h— addsLoadFromSafetensors()public declarationgemma/model_store.h/model_store.cc— addsModelStore(const ModelConfig&, const Path& tokenizer_path)constructor for the safetensors path (reads tokenizer directly from file, leavesscales_empty)gemma/gemma.h/gemma.cc— addsGemma(ModelConfig, tokenizer, safetensors_dir, InferenceArgs, ThreadingContext)constructor; changesBlobReader reader_tounique_ptr<BlobReader>to allow null when not using BlobStoregemma/gemma_args.h— adds--safetensors(directory path) and--model_spec(e.g.gemma4-e4b-bf16-it) flags toLoaderArgsgemma/run.cc— wires new flags intoRun()with a conditional branch (usesunique_ptr<Gemma>to avoid copy/move)CMakeLists.txt— adds new source files; linksnlohmann_jsontolibgemmaUsage
./gemma --safetensors /path/to/gemma-4-e4b-it \ --model_spec gemma4-e4b-bf16-it \ --tokenizer /path/to/tokenizer.model \ --prompt "Hello!"The
--model_specspecifier uses the existingModelConfig(std::string)format:{model-prefix}-{type}-{wrapping}e.g.gemma4-e2b-bf16-itorgemma4-e4b-bf16-it.Tested
--helpshows both new flags with descriptionsmodel.language_model.*prefix, 2130 tensors, 42 layers): loads fully, prompt processing begins*.index.json) layouts supported bySafetensorsIndex