Skip to content

Commit c45da17

Browse files
author
Paul C
committed
v22.7.3: small-model AI fits + VM passthrough network-safety preflight
Two Discord-reported issues with concrete fixes. 1. Local AI: 143 KB request body overflowed FunctionGemma's context Reported by Gary KO4BSR 2026-05-01. The chat() path was injecting the full embedded knowledge base (~200 KB hand-written + generated product docs) into every system prompt — fine for cloud providers with 100K+ context windows, fatal for 4-8 K context windows on small local models. The model would respond with `finish_reason=tool_calls` and an empty/malformed tool_calls array because it ran out of tokens before the response could form, and WolfStack reported "empty response". Fix: new `build_compact_system_prompt` strips the KB block. The chat() path picks this for `provider == "local"`; cloud paths keep the full prompt where the KB is actually load-bearing. Compact prompt is comfortably under 10 KB (test pin). Plus: when finish_reason is `tool_calls` but nothing was dispatchable, the error message now lists what tool names the model emitted vs what WolfStack exposes — so the next "I tried model X and it didn't work" report carries the diagnosis on its face. With suggested alternatives (qwen2.5:3b, llama3.1, mistral-functioncalling) that align to the OpenAI tool schema WolfStack advertises. 2. VM start nukes the host network when passthrough takes the uplink Reported by PapaSchlumpf 2026-05-02. HomeAssistant VM with PCI passthrough of a NIC nuked DHCP for the entire network on start; reboot fixed, WolfStack restart didn't. Root cause is structural: VFIO passthrough removes the device from the host kernel namespace, so dnsmasq (and any default-route binding) loses its leg instantly. The kernel-state nature is why a reboot is required to recover. Fix: new `check_passthrough_steals_host_net` runs as part of the start_vm preflight (alongside the existing find_conflicts USB/PCI check). It returns `Some(iface)` when: • a NicConfig.passthrough_interface (MACVTAP-style direct bind) names the host's default-route interface, OR • a pci_devices entry's BDF resolves (via /sys/bus/pci/.../net/) to that same interface. start_vm refuses with a clear, multi-line resolution message listing three operator paths: switch to a virtio bridge NIC, move the host's primary connectivity to a different physical NIC, or accept the disconnection. No reboot ever needed because the start is blocked before the disconnection happens. Reads /proc/net/route directly (no shellout), maps PCI BDF → interface via sysfs, handles both short-form (`01:00.0`) and full-form (`0000:01:00.0`) BDF spellings. Fails open when the host has no default route (dev/lab boxes shouldn't have legitimate starts blocked). 5 new network_preflight_tests + 2 new content_tool_call_tests for the compact prompt. 433/433 tests pass; full suite clean. Code- reviewer pass on the prior delta (v22.7.2) carried both BLOCKERs addressed; same parser hardening still in effect here.
1 parent 3c9d3c4 commit c45da17

4 files changed

Lines changed: 314 additions & 10 deletions

File tree

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "wolfstack"
3-
version = "22.7.2"
3+
version = "22.7.3"
44
edition = "2024"
55
authors = ["Wolf Software Systems Ltd"]
66
description = "Server management platform for the Wolf software suite"

src/ai/mod.rs

Lines changed: 115 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -538,7 +538,17 @@ impl AiAgent {
538538
h.iter().rev().take(10).cloned().collect::<Vec<_>>().into_iter().rev().collect()
539539
};
540540

541-
let system_prompt = build_system_prompt(&self.knowledge_base, system_context);
541+
// Local providers get a compact system prompt that omits the
542+
// ~200 KB embedded knowledge base — small models (2-8 B)
543+
// routinely have 4-8 K context windows that the full prompt
544+
// can't fit. Cloud providers (Claude / Gemini / OpenAI /
545+
// OpenRouter) keep the full KB; their context windows are
546+
// 100K+ tokens and the KB is genuinely useful for grounding.
547+
let system_prompt = if config.provider == "local" {
548+
build_compact_system_prompt(system_context)
549+
} else {
550+
build_system_prompt(&self.knowledge_base, system_context)
551+
};
542552

543553
let mut current_msg = user_message.to_string();
544554
let mut final_response = String::new();
@@ -1970,6 +1980,31 @@ fn build_system_prompt(knowledge: &str, server_context: &str) -> String {
19701980
)
19711981
}
19721982

1983+
/// Compact system prompt for local / small-context providers. Strips
1984+
/// the embedded knowledge base (~200 KB hand-written + generated) so
1985+
/// the request fits inside the 4-8 K context windows typical of
1986+
/// 2-8 B local models. The model still gets the capability
1987+
/// instructions and tool-use rules — just not the full Wolf product
1988+
/// docs, which it usually doesn't need to answer cluster-state
1989+
/// questions anyway.
1990+
///
1991+
/// Reported on Discord (Gary KO4BSR 2026-05-01): FunctionGemma's
1992+
/// test-connection returned `finish_reason=tool_calls` with empty
1993+
/// tool_calls and a 143 KB request body — the KB alone overflowed
1994+
/// the model's context window and the response collapsed.
1995+
fn build_compact_system_prompt(server_context: &str) -> String {
1996+
let full = build_system_prompt("", server_context);
1997+
// The full builder still includes the "Below is comprehensive
1998+
// documentation…" header even when knowledge is empty. Strip
1999+
// that trailing line so the model isn't told to consult docs
2000+
// that aren't there.
2001+
if let Some(idx) = full.rfind("Below is comprehensive documentation") {
2002+
full[..idx].trim_end().to_string()
2003+
} else {
2004+
full
2005+
}
2006+
}
2007+
19732008
// ─── Simple / stateless chat helper ───
19742009

19752010
/// Single-shot prompt-to-response against the configured AI provider.
@@ -2443,20 +2478,57 @@ async fn call_local_inner(
24432478
}
24442479

24452480
if combined.is_empty() {
2481+
// Special case: model signalled `finish_reason=tool_calls`
2482+
// (intent to call a tool) but neither the structured
2483+
// `tool_calls` array nor the content fallback yielded
2484+
// anything dispatchable. Surface what the model actually
2485+
// emitted so the user can see WHY the call didn't translate
2486+
// — e.g., FunctionGemma calling a function name we don't
2487+
// expose, or a malformed tool_calls payload.
2488+
if finish_reason == "tool_calls" {
2489+
let tool_calls_preview = serde_json::to_string(&msg["tool_calls"])
2490+
.unwrap_or_else(|_| "<unserialisable>".into());
2491+
let names_seen: Vec<&str> = msg["tool_calls"].as_array()
2492+
.map(|a| a.iter()
2493+
.filter_map(|t| t["function"]["name"].as_str())
2494+
.collect())
2495+
.unwrap_or_default();
2496+
tracing::warn!(
2497+
target: "wolfstack::ai",
2498+
"call_local: model={} signalled tool_calls but nothing dispatched. \
2499+
Names seen: {:?}. Allowed: {:?}. Raw tool_calls: {}",
2500+
model, names_seen, MAIN_AI_TOOLS,
2501+
tool_calls_preview.chars().take(800).collect::<String>(),
2502+
);
2503+
let names_str = if names_seen.is_empty() {
2504+
"(none — empty tool_calls array)".to_string()
2505+
} else {
2506+
format!("[{}]", names_seen.join(", "))
2507+
};
2508+
return Err(format!(
2509+
"Local AI ({}) wanted to call tools but emitted names this \
2510+
build doesn't expose. Got: {}. Allowed: [{}]. \
2511+
The model is most likely matching its training-time tool \
2512+
catalogue rather than WolfStack's. Try a model fine-tuned \
2513+
on OpenAI-style function-calling (qwen2.5:3b, llama3.1, \
2514+
mistral-functioncalling) — those align to the schema \
2515+
WolfStack advertises.",
2516+
model, names_str, MAIN_AI_TOOLS.join(", "),
2517+
));
2518+
}
24462519
tracing::warn!(
24472520
target: "wolfstack::ai",
24482521
"call_local: empty response (model={} finish_reason={} body_size={}). \
2449-
Common causes: model context exceeded by system prompt + tools + \
2450-
history; model doesn't follow instructions; server filtered the \
2451-
output. Body preview: {}",
2522+
Common causes: context exceeded; model doesn't follow instructions; \
2523+
server filtered the output. Body preview: {}",
24522524
model, finish_reason, text.len(),
24532525
text.chars().take(300).collect::<String>(),
24542526
);
24552527
return Err(format!(
2456-
"Local AI returned empty response (finish_reason={}). The request \
2457-
body was {} bytes — if the model has a small context window (4-8K \
2458-
on many small models) it may have run out of tokens. Try a model \
2459-
with a larger context, or simpler prompts.",
2528+
"Local AI returned empty response (finish_reason={}). Request body \
2529+
was {} bytes — if the model has a small context window (4-8 K on \
2530+
many small models) it may have run out of tokens. Try a smaller \
2531+
model prompt or a longer-context model.",
24602532
finish_reason, body_size,
24612533
));
24622534
}
@@ -2716,6 +2788,41 @@ mod content_tool_call_tests {
27162788
assert_eq!(calls[0].1["command"], "ls");
27172789
}
27182790

2791+
/// Compact prompt for local providers must NOT carry the
2792+
/// embedded knowledge base. Reported by Gary KO4BSR 2026-05-01:
2793+
/// 143 KB request body overflowed FunctionGemma's context.
2794+
/// Pin a generous upper bound (10 KB) so future additions to
2795+
/// the prompt-shape can't silently re-inflate it past the
2796+
/// 4 K-token budget of small local models.
2797+
#[test]
2798+
fn compact_system_prompt_omits_knowledge_base() {
2799+
let compact = build_compact_system_prompt("# Server\n(test context)");
2800+
assert!(
2801+
compact.len() < 10_000,
2802+
"compact prompt is {} bytes — must stay well below the 4-8 K-token \
2803+
window of small local models. The KB inclusion was the bug.",
2804+
compact.len(),
2805+
);
2806+
// Nothing in the compact prompt should mention "knowledge
2807+
// base" or echo the trailing "Below is comprehensive
2808+
// documentation" header — that text is what introduces the
2809+
// KB block, and an empty introduction is misleading.
2810+
assert!(
2811+
!compact.contains("Below is comprehensive documentation"),
2812+
"compact prompt must not include the KB-introduction header",
2813+
);
2814+
}
2815+
2816+
#[test]
2817+
fn full_system_prompt_does_include_knowledge_base() {
2818+
// Counter-test so a future refactor that "compactifies" the
2819+
// full prompt builder (and breaks cloud-AI grounding) trips
2820+
// a test instead of silently shipping.
2821+
let full = build_system_prompt("# KB content goes here", "# Server\n(test)");
2822+
assert!(full.contains("# KB content goes here"));
2823+
assert!(full.contains("Below is comprehensive documentation"));
2824+
}
2825+
27192826
#[test]
27202827
fn arguments_null_becomes_empty_object() {
27212828
// `security_audit` is in MAIN_AI_TOOLS and takes no args —

src/vms/manager.rs

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,10 @@ use tracing::{error, warn, info};
1010
use rand::Rng;
1111
use crate::containers;
1212
use crate::networking;
13-
use super::passthrough::{parse_libvirt_hostdevs, parse_proxmox_passthrough, find_conflicts};
13+
use super::passthrough::{
14+
parse_libvirt_hostdevs, parse_proxmox_passthrough,
15+
find_conflicts, check_passthrough_steals_host_net,
16+
};
1417

1518
/// A storage volume that can be attached to a VM
1619
#[derive(Serialize, Deserialize, Debug, Clone)]
@@ -1402,6 +1405,40 @@ impl VmManager {
14021405
name, conflicts.join("; ")
14031406
));
14041407
}
1408+
1409+
// Network-safety preflight: passing a NIC through to a
1410+
// guest takes it OUT of the host kernel namespace. If
1411+
// that NIC is the host's path to the network (default-
1412+
// route interface), the host loses connectivity the
1413+
// moment the VM starts — including dnsmasq for every
1414+
// client on the WolfNet/LAN bridge. Reboot is required
1415+
// because the device disappears in a way `ip link` can't
1416+
// undo without re-binding from VFIO.
1417+
//
1418+
// Reported on Discord (PapaSchlumpf 2026-05-02): HomeAssistant
1419+
// VM with passthrough NIC nuked DHCP for the whole
1420+
// network on start; reboot fixed, WolfStack restart did
1421+
// not. The kernel-state nature is the tell.
1422+
if let Some(blocking_iface) = check_passthrough_steals_host_net(target) {
1423+
return Err(format!(
1424+
"Cannot start VM '{}': its passthrough configuration would \
1425+
claim the host's primary network interface '{}'. Starting \
1426+
would disconnect the host from the network and break DHCP \
1427+
for every client on the WolfNet bridge — recovery would \
1428+
require a host reboot.\n\n\
1429+
Fixes:\n\
1430+
(a) Remove the PCI passthrough for that NIC and attach \
1431+
a virtio bridge NIC instead — the guest gets the same \
1432+
reachability without taking the host's uplink.\n\
1433+
(b) Move the host's primary connectivity to a different \
1434+
physical NIC (so the passed-through one is no longer the \
1435+
default route).\n\
1436+
(c) If you genuinely need to take that NIC and have an \
1437+
out-of-band recovery path, edit the VM's PCI passthrough \
1438+
list to confirm.",
1439+
name, blocking_iface,
1440+
));
1441+
}
14051442
}
14061443

14071444
// Look up the WolfStack-side config so we can re-arm the WolfNet

src/vms/passthrough.rs

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1169,3 +1169,163 @@ net0: virtio=AA:BB:CC:DD:EE:FF,bridge=vmbr0
11691169
assert!(find_conflicts(&target, &[stopped]).is_empty());
11701170
}
11711171
}
1172+
1173+
// ─── Network-safety preflight ──────────────────────────────────────
1174+
//
1175+
// Reported on Discord (PapaSchlumpf 2026-05-02): an HA-OS VM with PCI
1176+
// passthrough of a NIC nuked DHCP for the entire network the moment
1177+
// it started. The cause is structural — VFIO passthrough removes the
1178+
// device from the host kernel, so any service binding to it (the
1179+
// host's default-route, dnsmasq for WolfNet clients) loses its leg
1180+
// instantly. Reboot is required because re-attaching from VFIO
1181+
// without one tends to leave the device in an unrecoverable state.
1182+
//
1183+
// `check_passthrough_steals_host_net` returns the offending interface
1184+
// name when the VM's passthrough list would claim the host's
1185+
// default-route interface, so the start path can refuse with a
1186+
// clear error before the operator nukes their own connectivity.
1187+
1188+
/// Returns `Some(iface_name)` if any of `vm`'s passthrough config
1189+
/// would steal the host's default-route interface. `None` when the
1190+
/// VM is safe to start (or when we couldn't determine the host's
1191+
/// default route — fail open rather than block legitimate starts).
1192+
pub fn check_passthrough_steals_host_net(vm: &VmConfig) -> Option<String> {
1193+
let host_default_iface = host_default_route_interface()?;
1194+
1195+
// Per-NIC `passthrough_interface` (MACVTAP-style direct bind
1196+
// to a host interface). Lives on each `extra_nics[i]`.
1197+
for nic in &vm.extra_nics {
1198+
if let Some(iface) = &nic.passthrough_interface {
1199+
if iface == &host_default_iface {
1200+
return Some(host_default_iface);
1201+
}
1202+
}
1203+
}
1204+
1205+
// PCI passthrough: walk each BDF, ask sysfs which net interface
1206+
// (if any) is backed by that device, and compare.
1207+
for dev in &vm.pci_devices {
1208+
if let Some(net_iface) = pci_bdf_to_net_iface(&dev.bdf) {
1209+
if net_iface == host_default_iface {
1210+
return Some(host_default_iface);
1211+
}
1212+
}
1213+
}
1214+
1215+
None
1216+
}
1217+
1218+
/// Read the host's IPv4 default-route interface from `/proc/net/route`.
1219+
/// Avoids shelling out to `ip` for the hot path. Returns `None` if
1220+
/// no default route exists (host might be a network-isolated lab box
1221+
/// where killing connectivity isn't fatal).
1222+
fn host_default_route_interface() -> Option<String> {
1223+
let text = std::fs::read_to_string("/proc/net/route").ok()?;
1224+
for line in text.lines().skip(1) {
1225+
let cols: Vec<&str> = line.split_whitespace().collect();
1226+
if cols.len() < 2 { continue; }
1227+
// Default route = destination 00000000. Field order on
1228+
// every Linux kernel: Iface Destination Gateway Flags ...
1229+
if cols[1] == "00000000" {
1230+
return Some(cols[0].to_string());
1231+
}
1232+
}
1233+
None
1234+
}
1235+
1236+
/// Map a PCI BDF (e.g. "0000:01:00.0" or "01:00.0") to the kernel
1237+
/// network-interface name it backs, by reading
1238+
/// `/sys/bus/pci/devices/{normalised-bdf}/net/`. Returns `None`
1239+
/// when:
1240+
/// • the BDF doesn't resolve in sysfs (device not present);
1241+
/// • the device isn't a network class (the `net/` directory
1242+
/// doesn't exist — e.g. a GPU passthrough);
1243+
/// • the device IS a NIC but is already bound to vfio-pci (in
1244+
/// which case the host doesn't currently use it, so it can't
1245+
/// be the default-route interface either — safe).
1246+
fn pci_bdf_to_net_iface(bdf: &str) -> Option<String> {
1247+
// Normalise short-form BDFs (`01:00.0`) to the full form sysfs
1248+
// uses (`0000:01:00.0`). Full-form input passes through.
1249+
let normalised = if bdf.matches(':').count() == 1 {
1250+
format!("0000:{}", bdf)
1251+
} else {
1252+
bdf.to_string()
1253+
};
1254+
let net_dir = format!("/sys/bus/pci/devices/{}/net", normalised);
1255+
let entries = std::fs::read_dir(&net_dir).ok()?;
1256+
for entry in entries.flatten() {
1257+
if let Some(name) = entry.file_name().to_str() {
1258+
return Some(name.to_string());
1259+
}
1260+
}
1261+
None
1262+
}
1263+
1264+
#[cfg(test)]
1265+
mod network_preflight_tests {
1266+
use super::*;
1267+
use super::super::manager::{NicConfig, VmConfig};
1268+
1269+
fn empty_vm() -> VmConfig {
1270+
VmConfig::new("test".to_string(), 1, 1024, 10)
1271+
}
1272+
1273+
fn nic_with_passthrough(iface: &str) -> NicConfig {
1274+
NicConfig {
1275+
model: "virtio".into(),
1276+
mac: None,
1277+
bridge: None,
1278+
passthrough_interface: Some(iface.to_string()),
1279+
}
1280+
}
1281+
1282+
#[test]
1283+
fn vm_with_no_passthrough_is_safe() {
1284+
let vm = empty_vm();
1285+
// A VM with no passthrough at all can never steal the host
1286+
// NIC, regardless of the host's actual route table.
1287+
assert!(check_passthrough_steals_host_net(&vm).is_none());
1288+
}
1289+
1290+
#[test]
1291+
fn passthrough_iface_matching_default_route_blocks() {
1292+
// We don't have a way to inject a fake default route, so
1293+
// pick whatever the host's actual default-route iface is
1294+
// (if any) and assert that a VM claiming that exact iface
1295+
// is flagged. Skips on hosts with no default route — those
1296+
// exist in CI environments without network.
1297+
let Some(host_iface) = host_default_route_interface() else { return; };
1298+
let mut vm = empty_vm();
1299+
vm.extra_nics.push(nic_with_passthrough(&host_iface));
1300+
assert_eq!(
1301+
check_passthrough_steals_host_net(&vm).as_deref(),
1302+
Some(host_iface.as_str()),
1303+
"passthrough_interface == default-route iface must be blocked",
1304+
);
1305+
}
1306+
1307+
#[test]
1308+
fn passthrough_iface_not_matching_default_route_allows() {
1309+
let mut vm = empty_vm();
1310+
// A bogus interface name that can't possibly be the host's
1311+
// default route. Counter-test pinning the safe path so a
1312+
// future false-positive bug is caught.
1313+
vm.extra_nics.push(nic_with_passthrough("does-not-exist-9999"));
1314+
assert!(check_passthrough_steals_host_net(&vm).is_none());
1315+
}
1316+
1317+
#[test]
1318+
fn pci_bdf_to_net_iface_handles_short_form() {
1319+
// We can't assert the actual return without a known PCI
1320+
// device, but we can confirm the BDF normalisation doesn't
1321+
// panic and returns None for a clearly-fake address.
1322+
assert_eq!(pci_bdf_to_net_iface("ff:ff.7"), None);
1323+
assert_eq!(pci_bdf_to_net_iface("0000:ff:ff.7"), None);
1324+
}
1325+
1326+
#[test]
1327+
fn host_default_route_interface_returns_string_or_none() {
1328+
// Smoke test — must not panic. Result depends on the host.
1329+
let _ = host_default_route_interface();
1330+
}
1331+
}

0 commit comments

Comments
 (0)