Skip to content

security: sanitize ledger error details from unauthenticated endpoints#84

Closed
rwilliamspbg-ops wants to merge 13 commits intomainfrom
feature/digital-twin-mesh-artifacts
Closed

security: sanitize ledger error details from unauthenticated endpoints#84
rwilliamspbg-ops wants to merge 13 commits intomainfrom
feature/digital-twin-mesh-artifacts

Conversation

@rwilliamspbg-ops
Copy link
Copy Markdown
Owner

@rwilliamspbg-ops rwilliamspbg-ops commented Apr 8, 2026

Unauthenticated endpoints (/health, /readyz, /api/v1/capabilities) were returning raw init_error and error strings from the ledger subsystem. These can contain SQL DSNs, driver error messages, or credentials depending on the backend configuration.

Summary

Replace opaque error strings with a boolean has_error flag on all open endpoints. Full error details remain available on the auth-protected /api/v1/ledger endpoint.

Before (/health response, unauthenticated):

"ledger": {
  "ready": false,
  "storage_mode": "cockroach-compatible-inmemory",
  "init_error": "failed to connect: pq: password authentication failed for user 'admin' (dsn=******db:5432/...)",
  "error": "dial tcp: connection refused"
}

After:

"ledger": {
  "ready": false,
  "storage_mode": "cockroach-compatible-inmemory",
  "has_error": true
}

Changes:

  • HealthCheck (/health): removed init_error/error, replaced with has_error bool
  • ReadinessCheck (/readyz): same
  • GetCapabilities (/api/v1/capabilities): removed ledger_state.init_error, replaced with has_error bool
  • GetLedger (/api/v1/ledger, auth-gated): unchanged — full details still exposed here
  • Applied consistent strings.TrimSpace to both error sources in has_error evaluation
  • Updated TestCockroachBackendFallbackMetadata to assert has_error == true

Validation

go test ./internal/api/...

Evidence

N/A — no runtime behavior change for authenticated callers; open endpoints lose error string fields only.

Checklist

  • CI workflows pass on this branch
  • Security checks pass (including CodeQL) — 0 alerts
  • make smoke passes locally
  • make screenshots-check passes locally (or N/A for non-runtime changes) — N/A
  • Documentation updated for behavior/config changes
  • No secrets or credentials committed
  • Any .github/workflows/*.yml uses: refs are pinned to 40-char commit SHAs
  • If deployment logic changed: deploy.yml staging and production paths validated
  • If frontend assets changed: CDN publish/invalidation plan verified
  • If Phase 3D changed: dataset/device/multi-GPU config compatibility validated

Risk and Rollback

  • Risk level: low
  • Rollback plan: revert two commits (e094c08, d2b686e); no schema or config changes
  • Operational blast radius: clients parsing init_error/error fields from open endpoints will see those keys absent; has_error bool added in their place

rwilliamspbg-ops and others added 9 commits April 8, 2026 00:15
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ryan <221235059+rwilliamspbg-ops@users.noreply.github.com>
Agent-Logs-Url: https://github.com/rwilliamspbg-ops/Sovereign_Map_Federated_Learning/sessions/04ecac5d-df42-4026-b6bb-e83b7b28abad

Co-authored-by: rwilliamspbg-ops <221235059+rwilliamspbg-ops@users.noreply.github.com>
…mesh-artifacts

# Conflicts:
#	scripts/check_dashboard_queries.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds “digital twin mesh” audit artifacts and observability upgrades, including a hash-chained proof ledger (with optional Cockroach/Postgres SQL storage), a new Grafana audit dashboard, and supporting scripts/CI tweaks.

Changes:

  • Introduce hash-chained ledger entries with per-stream sequencing, idempotency/replay detection, checkpoints, and a /api/v1/ledger/reconcile endpoint; add optional SQL-backed ledger storage.
  • Add an “Sovereign Audit Gold Standard” Grafana dashboard plus improved PromQL query validation.
  • Add a continuous digital-twin mesh traffic generator script and commit a bundle of captured audit artifacts; switch some GitHub workflows to install a minimal requirements-ci.txt.

Reviewed changes

Copilot reviewed 26 out of 29 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
scripts/run_continuous_digital_twin_mesh.sh Adds a looped traffic generator that emits offers/attestations/TPM events and polls key endpoints.
scripts/provision-grafana-dashboards.js Registers a new “audit” dashboard definition for script-based provisioning/deploy.
scripts/check_dashboard_queries.py Improves PromQL parsing to avoid misclassifying group labels as metrics.
requirements-ci.txt Introduces a minimal CI Python dependency set (currently only NumPy).
internal/api/ledger.go Expands the in-memory ledger to include stream sequencing, hash chaining, checkpoints, reconcile reporting, and an interface for pluggable backends.
internal/api/ledger_sql.go Adds a Cockroach/Postgres-compatible SQL ledger backend with schema bootstrap, record, query, checkpoint, and readiness logic.
internal/api/handlers.go Switches handler to ledger interface, adds readiness endpoints, idempotency header support, ledger reconcile route, and richer health/readiness metadata.
internal/api/handlers_test.go Updates capability contract expectations and adds tests for replay/idempotency, reconcile, readiness, and fallback metadata.
grafana/provisioning/dashboards/audit_overview.json Adds a new audit-focused Grafana dashboard JSON (uid sovereign-audit-gold-standard).
go.mod Adds github.com/lib/pq for SQL ledger connectivity.
go.sum Updates sums for github.com/lib/pq and removes older unused sums.
audit_results/digital_twin_continuous_20260408T001440Z/trust_snapshot_three_node.json Captured trust snapshot evidence from the continuous mesh run.
audit_results/digital_twin_continuous_20260408T001440Z/train_status_final.json Captured final training status evidence.
audit_results/digital_twin_continuous_20260408T001440Z/train_start.json Captured training start response evidence.
audit_results/digital_twin_continuous_20260408T001440Z/README.md Documents the contents of the captured artifact bundle.
audit_results/digital_twin_continuous_20260408T001440Z/ops_events_three_node.json Captured recent ops/events evidence.
audit_results/digital_twin_continuous_20260408T001440Z/offers_after_three_node.json Captured marketplace offers snapshot evidence.
audit_results/digital_twin_continuous_20260408T001440Z/offer_lidar.json Captured lidar offer response evidence.
audit_results/digital_twin_continuous_20260408T001440Z/offer_image.json Captured image offer response evidence.
audit_results/digital_twin_continuous_20260408T001440Z/offer_gps.json Captured GPS offer response evidence.
audit_results/digital_twin_continuous_20260408T001440Z/continuous_digital_twin_mesh.log Captured runtime log from the continuous generator.
audit_results/digital_twin_continuous_20260408T001440Z/attestations_after_three_node.json Captured attestations feed snapshot evidence.
audit_results/digital_twin_continuous_20260408T001440Z/attest_lidar.json Captured lidar attestation response evidence.
audit_results/digital_twin_continuous_20260408T001440Z/attest_image.json Captured image attestation response evidence.
audit_results/digital_twin_continuous_20260408T001440Z/attest_gps.json Captured GPS attestation response evidence.
.gitignore Ignores /bin/ artifacts.
.github/workflows/fedavg-benchmark-compare.yml Switches workflow to install dependencies from requirements-ci.txt.
.github/workflows/deploy.yml Switches workflow to install dependencies from requirements-ci.txt.
.github/workflows/build.yml Switches workflow to install dependencies from requirements-ci.txt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/api/handlers.go
Comment on lines +355 to +370
ledgerReady, ledgerErr := h.ledger.Readiness()
status := "healthy"
if !ledgerReady || strings.TrimSpace(h.ledgerInitError) != "" {
status = "degraded"
}

response := map[string]string{
"status": "healthy",
response := map[string]interface{}{
"status": status,
"service": "sovereign-map-fl",
"time": time.Now().UTC().Format(time.RFC3339),
"ledger": map[string]interface{}{
"ready": ledgerReady,
"storage_mode": h.ledger.StorageMode(),
"init_error": h.ledgerInitError,
"error": ledgerErr,
},
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/health, /readyz, and /api/v1/capabilities are unauthenticated endpoints, but the response now includes ledger.init_error and ledger.error. Those error strings can include internal connection details (and potentially credentials, depending on driver/DSN errors). Consider omitting these fields on open endpoints, or returning a sanitized/high-level status code/message and exposing details only on an auth-protected debug endpoint/logs.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in commits e094c08 and d2b686e. The init_error and error string fields have been removed from /health, /readyz, and /api/v1/capabilities. These unauthenticated endpoints now expose only a boolean has_error field (using consistent strings.TrimSpace on both error sources) to indicate whether a ledger error exists without revealing connection details or credentials. Full error details remain on the auth-protected /api/v1/ledger endpoint.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment thread internal/api/ledger.go
Comment on lines +251 to +255
if i > 0 {
wantSeq := entries[i-1].SeqNo + 1
if entries[i].SeqNo != wantSeq {
streamIssues = append(streamIssues, fmt.Sprintf("sequence gap at stream=%s seq=%d expected=%d", streamID, entries[i].SeqNo, wantSeq))
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconcile() currently treats any per-stream sequence gap or prev_hash mismatch between retained entries as corruption. With the global ring buffer eviction policy, it’s expected that older entries (including an entry’s immediate predecessor) can be evicted while later entries remain, which will reliably produce false sequence gap / prev_hash mismatch issues. Consider changing reconcile semantics to either (a) only verify contiguous segments (skip continuity checks when the previous seq/hash isn’t present), or (b) store per-stream ring buffers so retained entries stay contiguous per stream.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in commit 9b05d0c. reconcileEntries now skips continuity checks for non-contiguous entries — sequence gaps from ring buffer eviction are no longer flagged as issues. Only contiguous adjacent entries have their prev_hash chain verified. A gaps counter is added to each stream's summary for observability. A new test TestReconcileSkipsGapsBetweenNonContiguousEntries validates this directly by calling reconcileEntries with a gapped entry set.

Comment on lines +162 to +166
seqNo, prevHash, err := loadStreamHeadForUpdate(tx, streamID)
if err != nil {
return LedgerEntry{}, false
}
nextSeq := seqNo + 1
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrency issue: when a stream has no row in mohawk_ledger_stream_heads, loadStreamHeadForUpdate returns (0, "", nil) and no lock is taken. Two concurrent transactions can both compute nextSeq=1 and race; one will hit the UNIQUE(stream_id, seq_no) constraint and the event will be dropped (currently with no retry and no surfaced error). Consider creating/locking the head row first (e.g., INSERT ... ON CONFLICT DO NOTHING then SELECT ... FOR UPDATE), and/or using SERIALIZABLE + retry-on-restart/unique-violation handling so concurrent inserts don’t lose events.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in commit 9b05d0c. loadStreamHeadForUpdate now does INSERT INTO mohawk_ledger_stream_heads (stream_id, seq_no, entry_hash) VALUES ($1, 0, '') ON CONFLICT (stream_id) DO NOTHING before the SELECT ... FOR UPDATE. This ensures the row exists and is locked before either concurrent transaction computes nextSeq, eliminating the lost-update race on first write for a new stream.

Comment on lines +144 to +147
tx, err := l.db.Begin()
if err != nil {
return LedgerEntry{}, false
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecordWithOptions returns an empty entry on SQL failures (Begin/query/insert/commit) and drops the underlying error. Even if the interface can’t return an error, this makes failures silent and hard to debug/alert on. Consider at least emitting a log/metric and/or storing the last error string in the ledger instance for /api/ledger (auth-gated) inspection.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in commit 9b05d0c. SQL write failures in RecordWithOptions are now logged via log.Printf and stored with a timestamp via recordWriteErr(). A LastError() string method was added to both SQLProofLedger and the ProofLedgerStore interface, and the auth-gated /api/v1/ledger response now includes a last_write_error field for operator inspection.

Comment thread internal/api/ledger_sql.go
Comment thread requirements-ci.txt Outdated
Comment on lines +76 to 84
},
audit: {
uid: 'sovereign-audit-gold-standard',
title: 'Sovereign Audit Gold Standard',
tags: ['audit', 'ledger', 'verification', 'trust', 'sovereign-map'],
description: 'Ledger integrity, verification quality, replay defense, and control-plane traceability',
panels: 13,
rows: 4
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script now defines an audit dashboard with uid sovereign-audit-gold-standard, but the repo also provisions a static dashboard JSON with the same uid (grafana/provisioning/dashboards/audit_overview.json). Using both approaches (file provisioning + script deploy/generate) will cause uid collisions and unintended overwrites. Consider either (a) having this script deploy the checked-in JSON dashboard, or (b) using a distinct uid/name here to avoid clobbering the provisioned dashboard.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in commit 9b05d0c. The audit entry in DASHBOARDS now has a sourcePath pointing to grafana/provisioning/dashboards/audit_overview.json. generateDashboardJSON reads and wraps the checked-in JSON when sourcePath is set, so the script deploys the authoritative dashboard instead of generating a conflicting copy with the same uid. A descriptive error with the dashboard key and path is thrown if the file cannot be read.

Comment on lines +43 to +46
local offer_list
offer_list="$(curl -sS "${BASE_URL}/marketplace/offers?seller_node_id=${seller}&limit=10")"
local existing_id
existing_id="$(jq -r '.offers[0].offer_id // empty' <<<"${offer_list}")"
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure_offer depends on jq but the script doesn’t validate that jq is installed (unlike scripts/participant_bootstrap.sh). With set -euo pipefail, a missing jq will terminate the loop. Consider adding a small preflight check for required tools (curl, jq) near startup with a clear error message.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in commit 9b05d0c. Added preflight checks for curl and jq using command -v immediately after variable initialization (matching the style of participant_bootstrap.sh). The script exits with a descriptive error message on stderr if either tool is missing, before entering the traffic-generation loop.

rwilliamspbg-ops and others added 4 commits April 7, 2026 18:44
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ryan <221235059+rwilliamspbg-ops@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ryan <221235059+rwilliamspbg-ops@users.noreply.github.com>
Remove init_error and error string fields from /health, /readyz, and
/api/v1/capabilities responses. These unauthenticated endpoints now
expose only a boolean has_error flag instead of potentially sensitive
connection strings or credentials. Full error details remain available
on the auth-protected /api/v1/ledger endpoint. Update the affected test
to check for has_error instead of init_error.

Agent-Logs-Url: https://github.com/rwilliamspbg-ops/Sovereign_Map_Federated_Learning/sessions/cde2dbe3-b27e-4ab1-81a1-1104463f7185

Co-authored-by: rwilliamspbg-ops <221235059+rwilliamspbg-ops@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants