Incident & On-Call Playbook

Standard Operating Procedure für Ausfälle, Eskalationen und Postmortems. Owner: Ops Captain, Deputies: Server Pilot + Player Support Lead.

Escalation Tree

  1. Tier 0 (Alert Trigger) – Prometheus/Alertmanager, Bot Monitoring, Community Reports.
  2. Tier 1 (On-Call Engineer) – Server Pilot (Rotation). Reaktionszeit < 10 Min.
  3. Tier 2 (Captain Escalation) – Ops Captain + Backup (Growth/Community Captains bei Impact auf Member).
  4. Tier 3 (Executive Council) – Nur bei Datenverlust, Security, Reputationsrisiko.

Channel-Matrix:

Incident Workflow

  1. Detection – Alert, User-Report oder Monitoring schlägt an.
  2. Triage (On-Call):
    • Impact? (Users, Revenue, Reputation)
    • Scope? (Discord, Game Servers, Payments, Data)
    • Immediate mitigation? (Restart, reroute, communication)
  3. Communication:
    • Update #ops-alerts Template:
      🛎 Incident Start
      - Time: 
      - Impact: 
      - Owner: 
      - Next update: 
      
    • Bei Public Impact: Discord Announcements + Status Page nach 15 Min.
  4. Mitigation – Technische Maßnahmen, ggf. Rollback/Failover.
  5. Resolution – Incident geschlossen, Summary ins Template.

Postmortem Template

Alle Postmortems im Vault (Ordner 04_Infrastruktur/Postmortems) ablegen und in Templates/Meeting-Template verlinken.

On-Call Rotation

Tooling & Automation

Training & Drills

Server Setup · Templates/Meeting-Template · Home