Skip to content

Commit fc119f5

Browse files
author
Paul C
committed
v22.7.0: Predictive Ops — unified Inbox, 9 analyzers, scanner convergence, Apps drawer
Major feature drop. New `src/predictive/` module (7600 lines, 157 tests) gives WolfStack a single forward-looking ops pipeline that replaces the duplicated threshold logic that lived in `alerting.rs`, `collect_issues`, and `security.rs`. Headline features - 🔮 Predictive Inbox — new top-nav surface aggregating findings from every reachable cluster peer. Snooze / dismiss / ack-as-intentional per finding. First-appearance dispatch to existing Discord/Slack/ Telegram/email channels (once per finding, not on every refresh). Cluster-aware filter + grouping; multi-cluster operators get a per-cluster filter and grouped headers. - ⊞ Apps & Tools drawer — top icon strip cut from 11 icons to 4 (Datacenter, Issues, Inbox, Apps). Everything else (App Store, Global View, 3D Room, WolfFlow, WolfAgents, Cluster Browser, Databases, Control Panel, plugins) lives in a slide-out panel with icon + name + description per tile. Analyzers shipped - Host disk-fill (linear-fit ETA per mount, sparse df with kill-on-drop timeout so a hung NFS can't stall the orchestrator) - Docker + LXC container storage-fill (runtime-specific finding types so acks scope precisely; container-ID rotation guard with FNV-1a deterministic hash so a container restart doesn't fit a regression through two different containers) - Docker container restart-loop (delta on RestartCount, tier bumped one when state == "restarting") - Host thresholds — CPU / memory / disk-free / swap / load / failed- systemd (replaces collect_issues + alerting::check_thresholds) - Container memory pressure (replaces alerting::check_container_thresholds) - Certificate expiry (Let's Encrypt + /etc/wolfstack/tls/, severity by days remaining, already-expired = Critical) - Backup freshness (per enabled schedule; missed-by-Nx interval tier) - VM disk-fill (qcow2 sparse-vs-allocated) - Security posture — first analyzer to consume the unified NetworkReachability classifier; service_bound_publicly, sshd_password_auth, sshd_root_login with severity matrix: PublicInternet > LocalNetwork > OverlayOnly > LoopbackOnly Scanner convergence - alerting::check_thresholds and check_container_thresholds dispatch retired in cached_status_bg (the old loop becomes a no-op short circuit with a comment pointing at predictive::notify). The notify channels themselves stay — predictive's first-appearance dispatch fires them. - security.rs posture scans (service-on-public, sshd config) are now duplicated by security_posture.rs which uses the unified reachability classifier; security.rs's active-attack scans (brute-force, miners, /tmp binaries, suspicious outbound) stay for now since they're event-detection at a different cadence. Discipline guards (every analyzer) - Snapshot-clone discipline in the orchestrator — read locks held only long enough to clone, no analyzer holds a read lock during computation - Auto-resolve: covered-but-not-emitted scopes flip to Approved/ConditionCleared so disk-freed-itself proposals don't ghost in the inbox; data-source-down (empty covered) auto-resolves nothing - Per-tick concurrent sampling via tokio::join! with kill-on-drop child-process timeouts - Three independent suppression layers (ack store, current proposal status, analyzer-internal thresholds) tested separately Cluster aggregation - GET /api/proposals/cluster fans out to peers via existing build_node_urls + X-WolfStack-Secret, 30s in-memory cache invalidated on every state-change endpoint - Per-peer fetch failures surface as a yellow warning ("K of N responded — Inbox may be incomplete") with cluster-grouped unreachable list — never silent gaps - Per-task (id, hostname, cluster) preserved through panic so operator gets actionable peer attribution even on task panic XSS hardening - showToast() rebuilt to use replaceChildren + textContent for the message; audit confirmed zero callers across app.js + wolfrouter.js pass intentional HTML, so 1000+ call sites are now hardened with no regression. A peer responding `{"error": "<script>..."}` can no longer escape into the operator's DOM. Tests: 157 unit tests across 17 predictive modules. Independent code-reviewer agent passes on items 1-2 + reviewer-fix delta with all BLOCKER/MAJOR findings addressed (DefaultHasher non-determinism, panicked-task identity, three-read-locks-stall, stale pending proposals, AckScope serde-tag clash). Plan complete: every analyzer item from the original roadmap is live. Future iterations cover AI-source proposals using the same Inbox surface, OneClick remediation handlers (today's plans are all Manual), and per-tenant RBAC.
1 parent e00dfda commit fc119f5

23 files changed

Lines changed: 9065 additions & 57 deletions

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "wolfstack"
3-
version = "22.6.15"
3+
version = "22.7.0"
44
edition = "2024"
55
authors = ["Wolf Software Systems Ltd"]
66
description = "Server management platform for the Wolf software suite"

src/api/mod.rs

Lines changed: 502 additions & 1 deletion
Large diffs are not rendered by default.

src/containers/mod.rs

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2274,6 +2274,12 @@ pub struct ContainerInfo {
22742274
pub mac_address: String,
22752275
#[serde(default, skip_serializing_if = "String::is_empty")]
22762276
pub network_name: String,
2277+
/// Cumulative restart count reported by the runtime. Docker's
2278+
/// `State.RestartCount` from inspect; populated for Docker
2279+
/// containers. Always `None` for LXC (whose container-internal
2280+
/// init handles its own restart accounting).
2281+
#[serde(default, skip_serializing_if = "Option::is_none")]
2282+
pub restart_count: Option<u64>,
22772283
}
22782284

22792285
#[derive(Debug, Clone, Serialize, Deserialize)]
@@ -2611,6 +2617,7 @@ fn docker_list(all: bool) -> Vec<ContainerInfo> {
26112617
gateway: container_gateway,
26122618
mac_address: container_mac,
26132619
network_name: net_name,
2620+
restart_count: Some(fields.restart_count),
26142621
}
26152622
})
26162623
.collect()
@@ -2627,6 +2634,11 @@ struct DockerInspectFields {
26272634
network_macs: String,
26282635
merged_dir: String,
26292636
restart_policy: String,
2637+
/// Cumulative number of times Docker has restarted this
2638+
/// container since creation (`State.RestartCount` from inspect).
2639+
/// The predictive restart-loop analyzer reads the delta of this
2640+
/// across ticks to detect crash-loops.
2641+
restart_count: u64,
26302642
}
26312643

26322644
/// Run ONE `docker inspect <id1> <id2> ...` and parse the resulting JSON
@@ -2703,6 +2715,9 @@ fn docker_batched_inspect(ids: &[String])
27032715
if let Some(s) = entry.pointer("/HostConfig/RestartPolicy/Name").and_then(|v| v.as_str()) {
27042716
fields.restart_policy = s.to_string();
27052717
}
2718+
if let Some(n) = entry.pointer("/State/RestartCount").and_then(|v| v.as_u64()) {
2719+
fields.restart_count = n;
2720+
}
27062721
map.insert(id, fields);
27072722
}
27082723
map
@@ -3626,6 +3641,7 @@ pub fn lxc_list_all() -> Vec<ContainerInfo> {
36263641
gateway: lxc_gateway,
36273642
mac_address: lxc_mac,
36283643
network_name: lxc_link,
3644+
restart_count: None, // LXC: see ContainerInfo::restart_count doc
36293645
});
36303646
}
36313647
}
@@ -3938,6 +3954,7 @@ fn pct_list_all() -> Vec<ContainerInfo> {
39383954
gateway: pve_gateway,
39393955
mac_address: pve_mac,
39403956
network_name: pve_bridge,
3957+
restart_count: None, // PVE-LXC: see ContainerInfo::restart_count doc
39413958
}
39423959
})
39433960
}).collect();

src/main.rs

Lines changed: 82 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ mod proxmox;
2828
mod mysql_editor;
2929
mod appstore;
3030
mod alerting;
31+
mod predictive;
3132
mod wolfrun;
3233
mod statuspage;
3334
mod ceph;
@@ -387,9 +388,26 @@ async fn main() -> std::io::Result<()> {
387388
// Initialize Status Page monitoring state
388389
let statuspage_state = Arc::new(statuspage::StatusPageState::new());
389390

391+
// Predictive ops — load proposals/acks/history from disk so a
392+
// restart doesn't blind the analyzer for 24 hours. Acks get
393+
// an immediate prune of any expired entries to keep the file
394+
// bounded over years of operator use.
395+
let predictive_proposals = Arc::new(std::sync::RwLock::new(
396+
predictive::ProposalStore::load(),
397+
));
398+
let predictive_acks = Arc::new(std::sync::RwLock::new({
399+
let mut a = predictive::AckStore::load();
400+
a.prune_expired();
401+
a
402+
}));
403+
let predictive_metrics = Arc::new(std::sync::RwLock::new(
404+
predictive::MetricsHistory::load(),
405+
));
406+
390407
// Create app state
408+
let monitor_arc = Arc::new(Mutex::new(mon));
391409
let app_state = web::Data::new(api::AppState {
392-
monitor: Mutex::new(mon),
410+
monitor: monitor_arc.clone(),
393411
metrics_history: Mutex::new(monitoring::MetricsHistory::new()),
394412
cluster: cluster.clone(),
395413
sessions: sessions.clone(),
@@ -413,8 +431,29 @@ async fn main() -> std::io::Result<()> {
413431
image_watcher_cache: Arc::new(std::sync::RwLock::new(std::collections::HashMap::new())),
414432
integrations: Arc::new(crate::integrations::IntegrationState::new(&cluster_secret)),
415433
router: Arc::new(crate::networking::router::RouterState::new()),
434+
predictive_proposals: predictive_proposals.clone(),
435+
predictive_acks: predictive_acks.clone(),
436+
predictive_metrics: predictive_metrics.clone(),
437+
predictive_cluster_cache: Arc::new(std::sync::Mutex::new(None)),
438+
node_id: node_id.clone(),
416439
});
417440

441+
// Predictive ops orchestrator — 5-min loop that samples
442+
// disks, records into history, runs analyzers, and upserts
443+
// proposals into the inbox. Ack/snooze/dismiss are honoured
444+
// before any proposal is materialised. Threshold + first-
445+
// appearance notification dispatch landed in convergence
446+
// A+B — orchestrator now reads SystemMetrics off the shared
447+
// monitor and fires alerting channels on Critical/High.
448+
{
449+
let p = predictive_proposals.clone();
450+
let a = predictive_acks.clone();
451+
let m = predictive_metrics.clone();
452+
let mon = monitor_arc.clone();
453+
let n = node_id.clone();
454+
tokio::spawn(predictive::orchestrator::run_loop(p, a, m, mon, n));
455+
}
456+
418457
// Start the WolfRouter safe-mode watcher — auto-reverts firewall
419458
// changes if the user doesn't confirm within the safe-mode window.
420459
crate::networking::router::spawn_rollback_watcher(app_state.router.clone());
@@ -1555,8 +1594,32 @@ a{color:#dc2626;text-decoration:none;}a:hover{text-decoration:underline;}
15551594

15561595
let display_name = if node.hostname.is_empty() { &node.address } else { &node.hostname };
15571596

1558-
// Check thresholds
1559-
let triggered = alerting::check_thresholds(&config, cpu_pct, mem_pct, disk_pct);
1597+
// Check thresholds.
1598+
//
1599+
// Convergence B (the predictive ops pipeline) now owns
1600+
// first-appearance threshold dispatch via
1601+
// `predictive::notify::find_first_appearance_alerts` +
1602+
// `dispatch_alerts`, fired from each tick of
1603+
// `predictive::orchestrator`. That layer:
1604+
// • Has a unified Severity tier with snooze/dismiss/ack
1605+
// semantics instead of the old cooldown HashMap
1606+
// • Auto-resolves on `ConditionCleared` so the recovery
1607+
// branch below isn't needed for thresholds it covers
1608+
// • Surfaces in the Predictive Inbox alongside trend-based
1609+
// findings (disk-fill ETA, container restart-loops, etc.)
1610+
//
1611+
// We keep this `triggered` binding *only* so the recovery-
1612+
// notification branch downstream still executes on legacy
1613+
// signals — it's harmless when `triggered` is empty. The
1614+
// primary alert-fire loop below sees zero entries and
1615+
// becomes a no-op.
1616+
//
1617+
// Per-node remote-peer dispatch: each cluster node runs its
1618+
// own predictive orchestrator; remote peers' findings are
1619+
// surfaced via `/api/proposals/cluster` aggregation in the
1620+
// Inbox UI.
1621+
let _ = (cpu_pct, mem_pct, disk_pct, &config); // signal: kept for the recovery branch
1622+
let triggered: Vec<alerting::ThresholdAlert> = Vec::new();
15601623

15611624
for alert in &triggered {
15621625
if !alerting::is_in_cooldown(&cooldowns, &node.id, &alert.alert_type) {
@@ -1731,8 +1794,22 @@ a{color:#dc2626;text-decoration:none;}a:hover{text-decoration:underline;}
17311794
let docker_stats = tokio::task::spawn_blocking(|| containers::docker_stats()).await.unwrap_or_default();
17321795
let lxc_stats = tokio::task::spawn_blocking(|| containers::lxc_stats()).await.unwrap_or_default();
17331796

1734-
let docker_alerts = alerting::check_container_thresholds(&config, &docker_stats, "docker");
1735-
let lxc_alerts = alerting::check_container_thresholds(&config, &lxc_stats, "lxc");
1797+
// Container memory threshold dispatch — RETIRED.
1798+
//
1799+
// Predictive item 5 (`predictive::container_memory`) is the
1800+
// canonical source for per-container memory findings. It uses
1801+
// the same `containers::*_stats_cached()` data this loop did,
1802+
// but routes through the unified Inbox with snooze/dismiss/ack
1803+
// semantics instead of the legacy cooldown HashMap. The
1804+
// first-appearance dispatch in `predictive::notify` fires the
1805+
// Discord/Slack/Telegram/email channels with stable severity
1806+
// and per-finding dedup.
1807+
//
1808+
// Keep these `_stats` bindings — they're consumed by the
1809+
// top-N renderer below, which is unrelated to thresholds.
1810+
let _ = (&docker_stats, &lxc_stats, &config);
1811+
let docker_alerts: Vec<alerting::ContainerAlert> = Vec::new();
1812+
let lxc_alerts: Vec<alerting::ContainerAlert> = Vec::new();
17361813

17371814
let all_container_alerts: Vec<_> = docker_alerts.into_iter().chain(lxc_alerts.into_iter()).collect();
17381815

0 commit comments

Comments
 (0)