Incident & On-Call Playbook
Standard Operating Procedure für Ausfälle, Eskalationen und Postmortems. Owner: Ops Captain, Deputies: Server Pilot + Player Support Lead.
Escalation Tree
- Tier 0 (Alert Trigger) – Prometheus/Alertmanager, Bot Monitoring, Community Reports.
- Tier 1 (On-Call Engineer) – Server Pilot (Rotation). Reaktionszeit < 10 Min.
- Tier 2 (Captain Escalation) – Ops Captain + Backup (Growth/Community Captains bei Impact auf Member).
- Tier 3 (Executive Council) – Nur bei Datenverlust, Security, Reputationsrisiko.
Channel-Matrix:
#ops-alerts(Discord) – Automatisches Alerting + laufender Thread.#council-war-room– Falls Major Incident, cross-division sync.- Status Page (TODO) – Öffentliche Updates.
Incident Workflow
- Detection – Alert, User-Report oder Monitoring schlägt an.
- Triage (On-Call):
- Impact? (Users, Revenue, Reputation)
- Scope? (Discord, Game Servers, Payments, Data)
- Immediate mitigation? (Restart, reroute, communication)
- Communication:
- Update
#ops-alertsTemplate:🛎 Incident Start - Time: - Impact: - Owner: - Next update: - Bei Public Impact: Discord Announcements + Status Page nach 15 Min.
- Update
- Mitigation – Technische Maßnahmen, ggf. Rollback/Failover.
- Resolution – Incident geschlossen, Summary ins Template.
Postmortem Template
- Summary: Was ist passiert? Timeline + Impact.
- Root Cause: Technische / Prozessfehler.
- Detection: Wie wurde es erkannt? (Monitoring, User, Zufall)
- Response: Was lief gut? Was nicht?
- Action Items:
- Communication Review: Discord/Status Page/Partner Alerts.
- Links: Grafana Dashboards, Logs, PRs.
Alle Postmortems im Vault (Ordner 04_Infrastruktur/Postmortems) ablegen und in Templates/Meeting-Template verlinken.
On-Call Rotation
- Week 1: Server Pilot
- Week 2: Tech Lead Deputy
- Week 3: Ops Captain (fallback)
- Player Support Lead übernimmt Kommunikation (Ticket Updates, Community Messaging).
- Kalender (Google/Obsidian) pflegen, On-Call Handover im Meeting ansprechen.
Tooling & Automation
- Alertmanager → Discord Webhook
#ops-alerts. - PagerDuty/ntfy (Optional) für SMS/Push.
- Loki + Grafana Tempo für Logs/Traces (Phase 6).
- Incident Forms (Google/Garden) für Member Reports → Ticket Tool Integration.
Training & Drills
- Quartalsweise „Game Day“ (Chaos Test) – Node Fail, Discord Bot Outage, Payment Error.
- Shadowing für Deputies (Senior Guardian, Creator Liaison) um cross-division Impact zu verstehen.
- Lessons Learned in Lessons Hub sammeln → Manifest aktualisieren.