This document covers deployment of Sovereign Map federated learning system for testnet with 5-100+ nodes.
- Architecture Overview
- Prerequisites
- Local Testing (5 Nodes)
- Staging Deployment (50 Nodes)
- Production Deployment (100+ Nodes)
- Monitoring & Verification
- Troubleshooting
- Performance Expectations
The backend runs in dual-mode:
-
Flower Aggregator (Port 8080)
- Federated learning coordination
- Stake-weighted aggregation
- Byzantine tolerance (50% fault tolerance)
- gRPC-based node communication
-
Flask Metrics API (Port 8000)
- Real-time convergence tracking
- Prometheus metrics export
- Health checks
- Performance monitoring
Each node runs a Flower client that:
- Connects to aggregator at
backend:8080 - Trains local MNIST model
- Applies differential privacy (Opacus)
- Supports Byzantine mode (inverted updates for testing)
- Reports metrics back to aggregator
┌─────────────────────────────────────────┐
│ Flower Aggregator (Backend) │
│ ✓ Port 8080: gRPC node communication │
│ ✓ Port 8000: Flask metrics API │
│ ✓ Byzantine-robust aggregation │
│ ✓ Convergence tracking │
└──────┬──────────────────────────────────┘
│
├──────────────────────┬──────────────────────┐
│ │ │
Node 1 Node 2 Node N
(Flower (Flower (Flower
Client) Client) Client)
(MNIST) (MNIST) (MNIST)
(DP+Privacy) (DP+Privacy) (DP+Privacy)
- Docker Desktop 4.0+ (with Docker Compose v2)
- 8GB+ RAM available
- 20GB+ disk space
- Linux/macOS/Windows (WSL2)
docker --version # Docker 24.0+
docker compose version # Docker Compose 2.0+For local testing: localhost only
For remote staging/prod: Network connectivity to aggregator host
cd Sovereign_Map_Federated_Learningdocker compose -f docker-compose.full.yml buildFirst-time build: ~5-10 minutes (downloads PyTorch base image) Subsequent builds: ~30 seconds (cached layers)
# Start backend + 5 node agents
docker compose -f docker-compose.full.yml up --scale node-agent=5 -d
# View logs
docker compose -f docker-compose.full.yml logs -f backend
# View node logs
docker compose -f docker-compose.full.yml logs -f node-agent# Backend health check
curl http://localhost:8000/health
# Expected: {"status": "healthy", "service": "metrics-api"}
# Check active nodes
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets'# Get convergence data in real-time
curl http://localhost:8000/convergence | jq .
# Expected output:
# {
# "rounds": [1, 2, 3, ...],
# "accuracies": [65.2, 67.8, 70.1, ...],
# "losses": [3.42, 3.05, 2.71, ...],
# "current_accuracy": 70.1,
# "current_round": 3
# }-
Grafana: http://localhost:3000 (admin/admin)
- Dashboard: "Sovereign Map - FL Monitoring"
- Watch: Accuracy convergence, loss reduction
-
Prometheus: http://localhost:9090
- Query:
sovereignmap_fl_accuracy(real-time accuracy)
- Query:
docker compose -f docker-compose.full.yml down -vCreate .env file:
cat > .env << 'EOF'
NUM_NODES=50
BYZANTINE_NODES=2
FLASK_ENV=production
PROMETHEUS_PORT=9090
GRAFANA_PORT=3000
EOFdocker compose -f docker-compose.full.yml up --scale node-agent=50 -dExpected startup time: 30-60 seconds Memory usage: ~4-6GB CPU usage: 2-4 cores
# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets?state=active | jq '.data.activeTargets | length'
# Should show 50+ active targetsRun 2 Byzantine nodes:
# Start nodes with Byzantine flag
docker compose -f docker-compose.full.yml up \
--scale node-agent=48 \
-e BYZANTINE=true \
-dExpected behavior:
- System continues learning despite 2 Byzantine nodes
- Accuracy still converges (slower but stable)
- See metrics for impact analysis
Generate continuous load:
# Run 100 FL rounds continuously
for i in {1..100}; do
curl -s http://localhost:8000/convergence | jq '.current_accuracy'
sleep 30
done# Check memory usage
docker stats sovereign-backend --no-stream
# Check aggregator response time
time curl http://localhost:8000/metrics_summary
# View latency metrics
curl -s http://localhost:9090/api/v1/query?query=sovereignmap_fl_round_duration_seconds | jq '.data.result'- Dedicated server: 16GB+ RAM, 8+ CPU cores
- Fixed IP address (for node connections)
- Docker Swarm or Kubernetes (optional but recommended)
- SSL/TLS certificates (for mTLS node communication)
- Persistent volumes for metrics storage
# Create data directories for persistence
mkdir -p /var/sovereign-map/{prometheus,grafana,alertmanager}
# Set proper permissions
chmod 755 /var/sovereign-map/*Create docker-compose.prod.yml:
version: '3.9'
services:
backend:
image: ghcr.io/rwilliamspbg-ops/sovereign-map-backend:latest
environment:
- NUM_ROUNDS=1000
- MIN_FIT_CLIENTS=100
volumes:
- /var/sovereign-map/backend:/app/data
restart: always
deploy:
resources:
limits:
cpus: '4'
memory: 8G
node-agent:
image: ghcr.io/rwilliamspbg-ops/sovereign-map-backend:latest
environment:
- AGGREGATOR_HOST=backend
- AGGREGATOR_PORT=8080
deploy:
replicas: 100
resources:
limits:
cpus: '0.5'
memory: 512M
prometheus:
volumes:
- /var/sovereign-map/prometheus:/prometheus
restart: always
grafana:
volumes:
- /var/sovereign-map/grafana:/var/lib/grafana
restart: always
volumes:
prometheus-data:
driver: local
driver_opts:
type: none
o: bind
device: /var/sovereign-map/prometheusdocker compose -f docker-compose.prod.yml up --scale node-agent=100 -d
# Verify all nodes connected
sleep 60
curl http://localhost:8000/convergence | jq '.current_round'# Set up alert notifications (email/Slack)
curl -X PUT http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '{"groupBy": ["alertname"], "receiver": "slack"}'# Create Kubernetes deployment (if using K8s)
kubectl apply -f - << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: sovereign-map-node
spec:
replicas: 100
selector:
matchLabels:
app: node-agent
template:
metadata:
labels:
app: node-agent
spec:
containers:
- name: node-agent
image: ghcr.io/rwilliamspbg-ops/sovereign-map-backend:latest
env:
- name: AGGREGATOR_HOST
value: backend
- name: NODE_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
EOF# Current FL accuracy
curl http://localhost:8000/convergence | jq '.current_accuracy'
# Current loss
curl http://localhost:8000/convergence | jq '.current_loss'
# Participants in last round
curl http://localhost:8000/convergence | jq '.rounds | length'-
Sovereign Map - FL Monitoring
- Accuracy convergence curve
- Loss over time
- Active participants
- FL round duration
-
Byzantine Tolerance Analysis
- Accuracy vs Byzantine percentage
- Impact on convergence speed
- Model robustness score
-
System Performance
- CPU/Memory usage per node
- Aggregator throughput
- Network bandwidth
# Average accuracy across all rounds
avg(sovereignmap_fl_accuracy)
# FL round duration p95
histogram_quantile(0.95, sovereignmap_fl_round_duration_seconds)
# Node participation rate
count(sovereignmap_active_nodes) / 100
# Byzantine impact (accuracy delta)
rate(sovereignmap_fl_accuracy[5m])
# Backend health
curl -s http://localhost:8000/health | jq '.status'
# Prometheus health
curl -s http://localhost:9090/-/healthy
# Grafana health
curl -s http://localhost:3000/api/health | jq '.status'
# Count connected nodes (from Prometheus)
curl -s http://localhost:9090/api/v1/targets?state=active | jq '.data.activeTargets | length'Symptom: Nodes repeatedly try to connect but fail
Solution:
# Verify backend is running
docker ps | grep sovereign-backend
# Check Flower server is listening on 8080
docker exec sovereign-backend ss -tuln | grep 8080
# Check network connectivity from node
docker exec <node-container> curl -v backend:8080Symptom: current_round stays at 0
Solution:
# Check backend logs
docker logs sovereign-backend | grep "FL Round"
# Verify min_fit_clients setting
# Edit docker-compose.yml: MIN_FIT_CLIENTS should be <= active nodes
# Restart backend
docker restart sovereign-backendSymptom: Containers killed with OOMKilled
Solution:
# Reduce node scale
docker compose -f docker-compose.full.yml up --scale node-agent=10
# Or increase system memory limits
docker update --memory 16g sovereign-backend
docker update --memory 1g node-agentSymptom: Accuracy stays flat or oscillates wildly
Solution:
# Check for too many Byzantine nodes (>50%)
curl http://localhost:8000/convergence | jq '.round_participants'
# Reduce Byzantine ratio
docker compose -f docker-compose.full.yml up --scale node-agent=100 -d
# Monitor convergence for 10+ rounds
for i in {1..10}; do
sleep 60
curl http://localhost:8000/convergence | jq '{round: .current_round, accuracy: .current_accuracy}'
doneSymptom: PyTorch dependency timeout
Solution:
# Use pre-built image (skip build)
docker pull pytorch/pytorch:2.1.0-runtime-slim
# Or build with timeout
docker compose -f docker-compose.full.yml build --no-cache --progress=plain
# Or build specific stage
docker buildx build --target builder -t sovereign-test:builder . --load| Scenario | Rounds to 80% | Rounds to 95% | Notes |
|---|---|---|---|
| 10 Honest Nodes | 5 | 15 | Baseline |
| 50 Honest Nodes | 3 | 10 | 40% faster |
| 100 Honest Nodes | 2 | 8 | 50% faster |
| 50 Nodes (5% Byzantine) | 3 | 11 | Minimal impact |
| 50 Nodes (20% Byzantine) | 5 | 15 | 50% slower |
| 50 Nodes (50% Byzantine) | 8 | 25 | 70% slower but still converges |
| Component | 5 Nodes | 50 Nodes | 100 Nodes |
|---|---|---|---|
| Backend CPU | 20% | 40% | 60-70% |
| Backend Memory | 1.5GB | 3GB | 6GB |
| Per Node CPU | 15% | 10% | 5% |
| Per Node Memory | 256MB | 200MB | 150MB |
| Aggregator Latency | <10ms | <50ms | <100ms |
- Per Round Bandwidth: ~10-50MB (depends on model size)
- FL Round Duration: 30-120 seconds (network + training)
- Node Update Rate: Every 30s (configurable)
- Load Testing: Run with 1000+ simulated nodes (distributed)
- Security Audit: Review mTLS/TPM trust chain
- Performance Optimization: Profile and optimize aggregation
- Mainnet Preparation: Generate production SSL certificates
- Documentation: Update smart contract integration docs
- Setup monitoring alerts (PagerDuty/Slack)
- Create rollback procedures
- Document incident response
- Backup strategy for metrics/models
- Node operator onboarding guide
- Governance setup (DAO voting)
- Token economics (staking/rewards)
For issues or questions:
- Check Troubleshooting section
- Review backend logs:
docker logs sovereign-backend - Check node logs:
docker logs <node-container> - Open GitHub issue: https://github.com/rwilliamspbg-ops/Sovereign_Map_Federated_Learning/issues
Version: 1.0.0
Last Updated: 2026-02-26
Status: ✅ Testnet Ready