Troubleshooting¶
Common Issues¶
srSILO OOM During Preprocessing¶
Symptoms: Pipeline fails during SILO preprocessing phase with out-of-memory errors.
Solutions:
-
Lower
srsilo_chunk_sizein group_vars:srsilo_virus_config: covid: chunk_size: 500000 # Reduce from 1000000 -
Increase
srsilo_docker_memory_limitif RAM available:srsilo_virus_config: covid: docker_memory_limit: 400g -
Use test_vars.yml for resource-constrained environments:
ansible-playbook playbooks/srsilo/update-pipeline.yml -i inventory.ini \ -e "@playbooks/srsilo/vars/test_vars.yml"
API Won't Start¶
Symptoms: LAPIS/SILO containers fail to start or immediately exit.
Diagnosis:
# Check Docker logs (use actual container name for your virus)
# COVID: wise-sarsCoV2-lapis / wise-sarsCoV2-silo
# RSV-A: wise-rsva-lapis / wise-rsva-silo
docker logs wise-sarsCoV2-lapis
# Verify index exists (replace <virus> with covid, rsva, etc.)
ls -la /opt/srsilo/<virus>/output/<timestamp>/
# Check permissions
ls -ld /opt/srsilo/<virus>/output
Common causes:
- Missing or corrupted index
- Permission issues on output directory
- Port already in use
Timer Not Running¶
Symptoms: Automated pipeline runs aren't happening.
Diagnosis:
# Check status
systemctl status srsilo-update.timer
# View schedule
systemctl list-timers
# Check service logs
journalctl -u srsilo-update.service -n 100
Solutions:
-
Enable the timer:
sudo systemctl enable srsilo-update.timer sudo systemctl start srsilo-update.timer -
Re-run setup:
ansible-playbook playbooks/srsilo/setup-timer.yml -i inventory.ini
Containers Not Auto-Restarting After Reboot¶
Symptoms: SILO/LAPIS down after server reboot.
Note: The default restart: unless-stopped policy does restart containers after reboot, unless they were manually stopped. If containers aren't restarting, check:
- Docker daemon status:
systemctl status docker - Container was manually stopped before reboot
- Container crashed during startup (check logs)
To change restart policy: Edit the Ansible template at roles/srsilo/templates/docker-compose.yml.j2 and re-run the playbook. Direct edits to /opt/srsilo/<virus>/config/docker-compose.yml will be overwritten.
See 2025-11-20 Service Outage for detailed analysis.
Pipeline Stuck / Orphaned State¶
Symptoms: Pipeline refuses to run, reports "preprocessing in progress".
Diagnosis:
# Check for orphan marker (replace <virus> with covid, rsva, etc.)
ls -la /opt/srsilo/<virus>/output/.preprocessing_in_progress
Solution:
# Clean failed run artifacts for a specific virus
sudo rm -rf /opt/srsilo/<virus>/sorted_chunks/*
sudo rm -rf /opt/srsilo/<virus>/tmp/*
sudo rm /opt/srsilo/<virus>/output/.preprocessing_in_progress
Diagnostic Commands¶
Check All Services¶
# Docker containers
docker ps -a | grep -E 'silo|lapis'
# Kubernetes pods (Loculus)
kubectl get pods -A
# Systemd timers
systemctl list-timers --all | grep srsilo
Resource Usage¶
# Memory usage
free -h
# Disk usage
df -h /opt/srsilo
# Docker resource usage
docker stats --no-stream
Network Connectivity¶
# API endpoints
curl -s http://localhost:8083/sample/info | head -3
curl -s http://localhost:8084/sample/info | head -3
# External connectivity
curl -s https://lapis.wasap.genspectrum.org/covid/sample/info | head -3