security: sanitize ledger error details from unauthenticated endpoints#84
security: sanitize ledger error details from unauthenticated endpoints#84rwilliamspbg-ops wants to merge 13 commits intomainfrom
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ryan <221235059+rwilliamspbg-ops@users.noreply.github.com>
…l.go Agent-Logs-Url: https://github.com/rwilliamspbg-ops/Sovereign_Map_Federated_Learning/sessions/04ecac5d-df42-4026-b6bb-e83b7b28abad Co-authored-by: rwilliamspbg-ops <221235059+rwilliamspbg-ops@users.noreply.github.com>
Agent-Logs-Url: https://github.com/rwilliamspbg-ops/Sovereign_Map_Federated_Learning/sessions/04ecac5d-df42-4026-b6bb-e83b7b28abad Co-authored-by: rwilliamspbg-ops <221235059+rwilliamspbg-ops@users.noreply.github.com>
Agent-Logs-Url: https://github.com/rwilliamspbg-ops/Sovereign_Map_Federated_Learning/sessions/04ecac5d-df42-4026-b6bb-e83b7b28abad Co-authored-by: rwilliamspbg-ops <221235059+rwilliamspbg-ops@users.noreply.github.com>
…mesh-artifacts # Conflicts: # scripts/check_dashboard_queries.py
There was a problem hiding this comment.
Pull request overview
This PR adds “digital twin mesh” audit artifacts and observability upgrades, including a hash-chained proof ledger (with optional Cockroach/Postgres SQL storage), a new Grafana audit dashboard, and supporting scripts/CI tweaks.
Changes:
- Introduce hash-chained ledger entries with per-stream sequencing, idempotency/replay detection, checkpoints, and a
/api/v1/ledger/reconcileendpoint; add optional SQL-backed ledger storage. - Add an “Sovereign Audit Gold Standard” Grafana dashboard plus improved PromQL query validation.
- Add a continuous digital-twin mesh traffic generator script and commit a bundle of captured audit artifacts; switch some GitHub workflows to install a minimal
requirements-ci.txt.
Reviewed changes
Copilot reviewed 26 out of 29 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/run_continuous_digital_twin_mesh.sh | Adds a looped traffic generator that emits offers/attestations/TPM events and polls key endpoints. |
| scripts/provision-grafana-dashboards.js | Registers a new “audit” dashboard definition for script-based provisioning/deploy. |
| scripts/check_dashboard_queries.py | Improves PromQL parsing to avoid misclassifying group labels as metrics. |
| requirements-ci.txt | Introduces a minimal CI Python dependency set (currently only NumPy). |
| internal/api/ledger.go | Expands the in-memory ledger to include stream sequencing, hash chaining, checkpoints, reconcile reporting, and an interface for pluggable backends. |
| internal/api/ledger_sql.go | Adds a Cockroach/Postgres-compatible SQL ledger backend with schema bootstrap, record, query, checkpoint, and readiness logic. |
| internal/api/handlers.go | Switches handler to ledger interface, adds readiness endpoints, idempotency header support, ledger reconcile route, and richer health/readiness metadata. |
| internal/api/handlers_test.go | Updates capability contract expectations and adds tests for replay/idempotency, reconcile, readiness, and fallback metadata. |
| grafana/provisioning/dashboards/audit_overview.json | Adds a new audit-focused Grafana dashboard JSON (uid sovereign-audit-gold-standard). |
| go.mod | Adds github.com/lib/pq for SQL ledger connectivity. |
| go.sum | Updates sums for github.com/lib/pq and removes older unused sums. |
| audit_results/digital_twin_continuous_20260408T001440Z/trust_snapshot_three_node.json | Captured trust snapshot evidence from the continuous mesh run. |
| audit_results/digital_twin_continuous_20260408T001440Z/train_status_final.json | Captured final training status evidence. |
| audit_results/digital_twin_continuous_20260408T001440Z/train_start.json | Captured training start response evidence. |
| audit_results/digital_twin_continuous_20260408T001440Z/README.md | Documents the contents of the captured artifact bundle. |
| audit_results/digital_twin_continuous_20260408T001440Z/ops_events_three_node.json | Captured recent ops/events evidence. |
| audit_results/digital_twin_continuous_20260408T001440Z/offers_after_three_node.json | Captured marketplace offers snapshot evidence. |
| audit_results/digital_twin_continuous_20260408T001440Z/offer_lidar.json | Captured lidar offer response evidence. |
| audit_results/digital_twin_continuous_20260408T001440Z/offer_image.json | Captured image offer response evidence. |
| audit_results/digital_twin_continuous_20260408T001440Z/offer_gps.json | Captured GPS offer response evidence. |
| audit_results/digital_twin_continuous_20260408T001440Z/continuous_digital_twin_mesh.log | Captured runtime log from the continuous generator. |
| audit_results/digital_twin_continuous_20260408T001440Z/attestations_after_three_node.json | Captured attestations feed snapshot evidence. |
| audit_results/digital_twin_continuous_20260408T001440Z/attest_lidar.json | Captured lidar attestation response evidence. |
| audit_results/digital_twin_continuous_20260408T001440Z/attest_image.json | Captured image attestation response evidence. |
| audit_results/digital_twin_continuous_20260408T001440Z/attest_gps.json | Captured GPS attestation response evidence. |
| .gitignore | Ignores /bin/ artifacts. |
| .github/workflows/fedavg-benchmark-compare.yml | Switches workflow to install dependencies from requirements-ci.txt. |
| .github/workflows/deploy.yml | Switches workflow to install dependencies from requirements-ci.txt. |
| .github/workflows/build.yml | Switches workflow to install dependencies from requirements-ci.txt. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ledgerReady, ledgerErr := h.ledger.Readiness() | ||
| status := "healthy" | ||
| if !ledgerReady || strings.TrimSpace(h.ledgerInitError) != "" { | ||
| status = "degraded" | ||
| } | ||
|
|
||
| response := map[string]string{ | ||
| "status": "healthy", | ||
| response := map[string]interface{}{ | ||
| "status": status, | ||
| "service": "sovereign-map-fl", | ||
| "time": time.Now().UTC().Format(time.RFC3339), | ||
| "ledger": map[string]interface{}{ | ||
| "ready": ledgerReady, | ||
| "storage_mode": h.ledger.StorageMode(), | ||
| "init_error": h.ledgerInitError, | ||
| "error": ledgerErr, | ||
| }, |
There was a problem hiding this comment.
/health, /readyz, and /api/v1/capabilities are unauthenticated endpoints, but the response now includes ledger.init_error and ledger.error. Those error strings can include internal connection details (and potentially credentials, depending on driver/DSN errors). Consider omitting these fields on open endpoints, or returning a sanitized/high-level status code/message and exposing details only on an auth-protected debug endpoint/logs.
There was a problem hiding this comment.
Applied in commits e094c08 and d2b686e. The init_error and error string fields have been removed from /health, /readyz, and /api/v1/capabilities. These unauthenticated endpoints now expose only a boolean has_error field (using consistent strings.TrimSpace on both error sources) to indicate whether a ledger error exists without revealing connection details or credentials. Full error details remain on the auth-protected /api/v1/ledger endpoint.
| if i > 0 { | ||
| wantSeq := entries[i-1].SeqNo + 1 | ||
| if entries[i].SeqNo != wantSeq { | ||
| streamIssues = append(streamIssues, fmt.Sprintf("sequence gap at stream=%s seq=%d expected=%d", streamID, entries[i].SeqNo, wantSeq)) | ||
| } |
There was a problem hiding this comment.
Reconcile() currently treats any per-stream sequence gap or prev_hash mismatch between retained entries as corruption. With the global ring buffer eviction policy, it’s expected that older entries (including an entry’s immediate predecessor) can be evicted while later entries remain, which will reliably produce false sequence gap / prev_hash mismatch issues. Consider changing reconcile semantics to either (a) only verify contiguous segments (skip continuity checks when the previous seq/hash isn’t present), or (b) store per-stream ring buffers so retained entries stay contiguous per stream.
There was a problem hiding this comment.
Applied in commit 9b05d0c. reconcileEntries now skips continuity checks for non-contiguous entries — sequence gaps from ring buffer eviction are no longer flagged as issues. Only contiguous adjacent entries have their prev_hash chain verified. A gaps counter is added to each stream's summary for observability. A new test TestReconcileSkipsGapsBetweenNonContiguousEntries validates this directly by calling reconcileEntries with a gapped entry set.
| seqNo, prevHash, err := loadStreamHeadForUpdate(tx, streamID) | ||
| if err != nil { | ||
| return LedgerEntry{}, false | ||
| } | ||
| nextSeq := seqNo + 1 |
There was a problem hiding this comment.
Concurrency issue: when a stream has no row in mohawk_ledger_stream_heads, loadStreamHeadForUpdate returns (0, "", nil) and no lock is taken. Two concurrent transactions can both compute nextSeq=1 and race; one will hit the UNIQUE(stream_id, seq_no) constraint and the event will be dropped (currently with no retry and no surfaced error). Consider creating/locking the head row first (e.g., INSERT ... ON CONFLICT DO NOTHING then SELECT ... FOR UPDATE), and/or using SERIALIZABLE + retry-on-restart/unique-violation handling so concurrent inserts don’t lose events.
There was a problem hiding this comment.
Applied in commit 9b05d0c. loadStreamHeadForUpdate now does INSERT INTO mohawk_ledger_stream_heads (stream_id, seq_no, entry_hash) VALUES ($1, 0, '') ON CONFLICT (stream_id) DO NOTHING before the SELECT ... FOR UPDATE. This ensures the row exists and is locked before either concurrent transaction computes nextSeq, eliminating the lost-update race on first write for a new stream.
| tx, err := l.db.Begin() | ||
| if err != nil { | ||
| return LedgerEntry{}, false | ||
| } |
There was a problem hiding this comment.
RecordWithOptions returns an empty entry on SQL failures (Begin/query/insert/commit) and drops the underlying error. Even if the interface can’t return an error, this makes failures silent and hard to debug/alert on. Consider at least emitting a log/metric and/or storing the last error string in the ledger instance for /api/ledger (auth-gated) inspection.
There was a problem hiding this comment.
Applied in commit 9b05d0c. SQL write failures in RecordWithOptions are now logged via log.Printf and stored with a timestamp via recordWriteErr(). A LastError() string method was added to both SQLProofLedger and the ProofLedgerStore interface, and the auth-gated /api/v1/ledger response now includes a last_write_error field for operator inspection.
| }, | ||
| audit: { | ||
| uid: 'sovereign-audit-gold-standard', | ||
| title: 'Sovereign Audit Gold Standard', | ||
| tags: ['audit', 'ledger', 'verification', 'trust', 'sovereign-map'], | ||
| description: 'Ledger integrity, verification quality, replay defense, and control-plane traceability', | ||
| panels: 13, | ||
| rows: 4 | ||
| } |
There was a problem hiding this comment.
This script now defines an audit dashboard with uid sovereign-audit-gold-standard, but the repo also provisions a static dashboard JSON with the same uid (grafana/provisioning/dashboards/audit_overview.json). Using both approaches (file provisioning + script deploy/generate) will cause uid collisions and unintended overwrites. Consider either (a) having this script deploy the checked-in JSON dashboard, or (b) using a distinct uid/name here to avoid clobbering the provisioned dashboard.
There was a problem hiding this comment.
Applied in commit 9b05d0c. The audit entry in DASHBOARDS now has a sourcePath pointing to grafana/provisioning/dashboards/audit_overview.json. generateDashboardJSON reads and wraps the checked-in JSON when sourcePath is set, so the script deploys the authoritative dashboard instead of generating a conflicting copy with the same uid. A descriptive error with the dashboard key and path is thrown if the file cannot be read.
| local offer_list | ||
| offer_list="$(curl -sS "${BASE_URL}/marketplace/offers?seller_node_id=${seller}&limit=10")" | ||
| local existing_id | ||
| existing_id="$(jq -r '.offers[0].offer_id // empty' <<<"${offer_list}")" |
There was a problem hiding this comment.
ensure_offer depends on jq but the script doesn’t validate that jq is installed (unlike scripts/participant_bootstrap.sh). With set -euo pipefail, a missing jq will terminate the loop. Consider adding a small preflight check for required tools (curl, jq) near startup with a clear error message.
There was a problem hiding this comment.
Applied in commit 9b05d0c. Added preflight checks for curl and jq using command -v immediately after variable initialization (matching the style of participant_bootstrap.sh). The script exits with a descriptive error message on stderr if either tool is missing, before entering the traffic-generation loop.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ryan <221235059+rwilliamspbg-ops@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ryan <221235059+rwilliamspbg-ops@users.noreply.github.com>
Remove init_error and error string fields from /health, /readyz, and /api/v1/capabilities responses. These unauthenticated endpoints now expose only a boolean has_error flag instead of potentially sensitive connection strings or credentials. Full error details remain available on the auth-protected /api/v1/ledger endpoint. Update the affected test to check for has_error instead of init_error. Agent-Logs-Url: https://github.com/rwilliamspbg-ops/Sovereign_Map_Federated_Learning/sessions/cde2dbe3-b27e-4ab1-81a1-1104463f7185 Co-authored-by: rwilliamspbg-ops <221235059+rwilliamspbg-ops@users.noreply.github.com>
…ror check Agent-Logs-Url: https://github.com/rwilliamspbg-ops/Sovereign_Map_Federated_Learning/sessions/cde2dbe3-b27e-4ab1-81a1-1104463f7185 Co-authored-by: rwilliamspbg-ops <221235059+rwilliamspbg-ops@users.noreply.github.com>
Unauthenticated endpoints (
/health,/readyz,/api/v1/capabilities) were returning rawinit_erroranderrorstrings from the ledger subsystem. These can contain SQL DSNs, driver error messages, or credentials depending on the backend configuration.Summary
Replace opaque error strings with a boolean
has_errorflag on all open endpoints. Full error details remain available on the auth-protected/api/v1/ledgerendpoint.Before (
/healthresponse, unauthenticated):After:
Changes:
HealthCheck(/health): removedinit_error/error, replaced withhas_errorboolReadinessCheck(/readyz): sameGetCapabilities(/api/v1/capabilities): removedledger_state.init_error, replaced withhas_errorboolGetLedger(/api/v1/ledger, auth-gated): unchanged — full details still exposed herestrings.TrimSpaceto both error sources inhas_errorevaluationTestCockroachBackendFallbackMetadatato asserthas_error == trueValidation
go test ./internal/api/...Evidence
N/A — no runtime behavior change for authenticated callers; open endpoints lose error string fields only.
Checklist
make smokepasses locallymake screenshots-checkpasses locally (or N/A for non-runtime changes) — N/A.github/workflows/*.ymluses:refs are pinned to 40-char commit SHAsdeploy.ymlstaging and production paths validatedRisk and Rollback
e094c08,d2b686e); no schema or config changesinit_error/errorfields from open endpoints will see those keys absent;has_errorbool added in their place