Incident Report: Service Outage After System Upgrade¶
Date: November 20, 2025
Server: WisePulse Production (Hetzner)
Duration: ~10-15 minutes
Severity: Medium
Summary¶
After a routine apt upgrade and reboot, SILO/LAPIS and V-Pipe Scout failed to auto-restart. Loculus (k3d) recovered automatically.
Root cause: Operator error - did not anticipate that upgrading containerd.io (1.7.27→2.1.5) would restart Docker daemon, stopping all containers. Combined with restart: unless-stopped policy (which doesn't restart manually-stopped containers), this prevented auto-recovery after reboot.
Timeline¶
| Time | Event |
|---|---|
| 15:23 | apt upgrade started (44 packages including Docker/containerd) |
| 15:26 | Upgrade completed |
| 15:27 | Server rebooted (new kernel 6.8.0-88) |
| 15:31 | Server back online (boot time: 3m 43s) |
| 15:31 | Loculus ✅ auto-started, SILO/LAPIS ❌ down, V-Pipe Scout ❌ down |
| 15:40 | SILO/LAPIS manually restarted |
| 15:41 | V-Pipe Scout manually restarted |
Root Cause¶
Operator Error: Ran apt upgrade without checking which packages would be upgraded and their impact.
What Happened¶
- Unexpected Docker restart:
containerd.iomajor version upgrade (1.7.27→2.1.5) triggered Docker daemon restart - Docker stopped all containers gracefully during the restart
- Containers marked "manually stopped" by Docker's shutdown logic
- Reboot occurred (for new kernel)
restart: unless-stoppedpolicy on SILO/LAPIS skipped "manually stopped" containersrestart: nopolicy on V-Pipe Scout never auto-restarts anyway
Why This Wasn't Obvious¶
restart: unless-stoppedseems like it should mean "restart after reboot" - but it doesn't- Package upgrades don't always show which services will restart
- No staging environment to test upgrade impact
Lessons Learned¶
- Check package list before upgrading:
apt list --upgradableto spot Docker/containerd - Expect Docker restarts when upgrading containerd, docker-ce, or docker-compose-plugin
- Use
restart: alwaysfor production services that must survive reboots/restarts - Test upgrades in staging before production
Key Packages Upgraded¶
- containerd.io: 1.7.27 → 2.1.5 (major upgrade, caused Docker restart)
- docker-ce: 5:28.3.2 → 5:29.0.2
- openssh-server: Security update (SSH config conflict handled correctly)
- linux-image-generic: 6.8.0-87 → 6.8.0-88 (new kernel, required reboot)
- Total: 44 packages
Notes¶
- ✅ SSH config preserved correctly (password auth still disabled)
- ✅ Boot time 3m43s is normal (2m41s firmware init)
- ✅ No data loss
- ✅ All services recovered successfully
Preventive Measures¶
Before Next Upgrade¶
Pre-upgrade checklist:
# 1. Check what will be upgraded
apt list --upgradable | grep -E 'docker|containerd|systemd|kernel'
# 2. If Docker/containerd is upgrading:
# - Expect Docker daemon restart → all containers stop
# - Plan for manual service restart after upgrade
# - OR fix restart policies first (see below)
# 3. Timing considerations
# - Off-peak hours
# - Announce maintenance window to users
# - Have rollback plan
Action Items¶
🔴 High Priority (1 week)¶
-
Fix SILO/LAPIS restart policy
# /opt/WisePulse/roles/srsilo/templates/docker-compose.yml.j2 services: lapisOpen: restart: always # Change from unless-stopped silo: restart: always # Change from unless-stopped -
Fix V-Pipe Scout restart policy
# /opt/v-pipe-scout/docker-compose.yml services: redis: restart: always # Add streamlit: restart: always # Add worker: restart: always # Add
🟡 Medium Priority (1 month)¶
- Migrate V-Pipe Scout to Ansible - Consistent management with other services
- Audit all Docker restart policies - Run:
docker inspect --format '{{.Name}}: {{.HostConfig.RestartPolicy.Name}}' $(docker ps -aq)
🟢 Low Priority (3 months)¶
- Implement staging environment - Test upgrades before production
- Add service monitoring - Prometheus/Grafana alerting for failures
- Document update procedures - Create
/opt/WisePulse/docs/SERVER_MAINTENANCE.md
Quick Reference¶
Manual restart commands:
# SILO/LAPIS
cd /opt/srsilo/tools && LAPIS_PORT=8083 docker compose up -d
# V-Pipe Scout
cd /opt/v-pipe-scout && docker compose up -d
# Check status
kubectl get pods -A # Loculus
docker ps -a # All containers