{
  "_meta": {
    "schema": "https://www.drawdecisiontree.com/decision-dag.schema.json",
    "source": "https://www.drawdecisiontree.com",
    "description": "DrawDecisionTree.com is a free tool for building, sharing, and embedding interactive decision trees. This file is the machine-readable export of a published decision tree. The `dsl` field contains the original source in the Decision DAG DSL; the `dag` schema is documented at the URL in `schema` above.",
    "links": {
      "interactive": "https://www.drawdecisiontree.com/t/drawdecisiontree/engineering-on-call-escalation.html",
      "embed": "https://www.drawdecisiontree.com/embed/path/drawdecisiontree/engineering-on-call-escalation",
      "dsl_reference": "https://www.drawdecisiontree.com/decision-tree-dsl-reference.html",
      "guides": "https://www.drawdecisiontree.com/guides",
      "schema_docs": "https://www.drawdecisiontree.com/decision-dag.schema.json",
      "author_trees": "https://www.drawdecisiontree.com/trees/drawdecisiontree"
    },
    "generated_at": "2026-05-29T12:05:39.283Z"
  },
  "author": {
    "handle": "drawdecisiontree",
    "first_name": "Andrew",
    "last_name": null,
    "avatar_url": "1d32d828-b6ca-40ec-bdd7-771fe7b9c36a/avatar-1778531481027.svg",
    "display_name": "Andrew"
  },
  "file": {
    "id": "65e0f2b0-ca6a-4b4a-ae10-a18d6870157b",
    "name": "Who should I escalate this on-call incident to?",
    "public_slug": "engineering-on-call-escalation",
    "updated_at": "2026-05-12T16:53:43.587978+00:00",
    "url": "https://www.drawdecisiontree.com/t/drawdecisiontree/engineering-on-call-escalation.html",
    "json_url": "https://www.drawdecisiontree.com/t/drawdecisiontree/engineering-on-call-escalation/tree.json",
    "dsl_url": "https://www.drawdecisiontree.com/t/drawdecisiontree/engineering-on-call-escalation/tree.dag"
  },
  "meta": {
    "description": "Every minute an on-call engineer spends deciding who to call is a minute not spent fixing the problem—but over-escalating creates alert fatigue and erodes trust in the paging system. This tree gives on-call engineers a fast, consistent framework for deciding whether to handle an issue solo, wake a secondary, page their engineering lead, or trigger a full incident response with executive and communications involvement.",
    "mode": "decision",
    "entry": "Q1",
    "tags": [
      "engineering",
      "on-call",
      "incident response",
      "sre",
      "operations"
    ],
    "image": "https://images.unsplash.com/photo-1551288049-bebda4e38f71?w=1200&q=80"
  },
  "questions": [
    {
      "id": "Q1",
      "text": "Do you suspect a data breach, data loss, or active security compromise?"
    },
    {
      "id": "Q2",
      "text": "Is this issue customer-facing, or is it limited to internal tooling and back-office systems?"
    },
    {
      "id": "Q3",
      "text": "Is the customer-facing service completely down (full outage), or is it degraded (partial functionality or elevated errors)?"
    },
    {
      "id": "Q4",
      "text": "Has the issue persisted for more than 30 minutes without a clear resolution path?"
    },
    {
      "id": "Q5",
      "text": "Is the complete outage affecting a significant portion of your customer base, or has it persisted for more than 15 minutes?"
    }
  ],
  "outcomes": [
    {
      "id": "SOLO",
      "label": "Handle Alone"
    },
    {
      "id": "SECONDARY",
      "label": "Wake Secondary On-Call"
    },
    {
      "id": "PAGE_LEAD",
      "label": "Page Engineering Lead"
    },
    {
      "id": "FULL_INCIDENT",
      "label": "Invoke Full Incident Response (Exec + Comms)"
    }
  ],
  "dsl": "dag: Who should I escalate this on-call incident to?\nversion: 1.0.0\nimage: https://images.unsplash.com/photo-1551288049-bebda4e38f71?w=1200&q=80\ndescription: Every minute an on-call engineer spends deciding who to call is a minute not spent fixing the problem—but over-escalating creates alert fatigue and erodes trust in the paging system. This tree gives on-call engineers a fast, consistent framework for deciding whether to handle an issue solo, wake a secondary, page their engineering lead, or trigger a full incident response with executive and communications involvement.\ntags: engineering, on-call, incident response, sre, operations\nentry: Q1\n\nQ1: Do you suspect a data breach, data loss, or active security compromise?\n  hint: Any situation involving unauthorised data access, evidence of intrusion (unexpected privilege escalation, unfamiliar processes, anomalous network egress), or confirmed data loss should be treated as a security incident immediately—regardless of user-facing impact. Security incidents have legal, regulatory, and communications implications that require specialist involvement from the first moment. When in doubt, escalate: the cost of an unnecessary security escalation is far lower than the cost of a delayed response to a real breach.\n  yes -> [FULL_INCIDENT]\n  no  -> Q2\n\nQ2: Is this issue customer-facing, or is it limited to internal tooling and back-office systems?\n  hint: \"Customer-facing\" means end-users, paying customers, or partner integrations are experiencing degraded or absent service. Internal tooling issues (dashboards, data pipelines, admin panels) are serious but rarely warrant the same escalation urgency. Be precise: a broken internal deployment pipeline that prevents shipping a fix to a customer-facing issue is itself customer-facing by proxy. Consider your SLA commitments—some internal services may have contractual obligations to other teams that mirror external SLAs.\n  A: Customer-facing service affected -> Q3\n  B: Internal tooling or back-office only -> Q4\n\nQ3: Is the customer-facing service completely down (full outage), or is it degraded (partial functionality or elevated errors)?\n  hint: A \"complete outage\" means customers cannot access core functionality at all—login is broken, checkout is failing, the app returns 5xx on every request. \"Degraded\" covers a wide range: slower than normal, a subset of features unavailable, elevated error rates below 100%, or impact limited to a subset of customers or regions. Your incident severity runbook should have specific thresholds (e.g., error rate > 5 % = degraded, > 50 % = down); use those if they exist.\n  A: Complete outage — service is fully unavailable -> Q5\n  B: Degraded — partial functionality loss or elevated errors -> Q4\n\nQ4: Has the issue persisted for more than 30 minutes without a clear resolution path?\n  hint: Thirty minutes is a reasonable threshold for an on-call engineer to have formed a hypothesis, attempted a remediation, and assessed whether it is working. If you have a clear fix in progress and confidence it will resolve the issue, use your judgement. If you are still in diagnostic mode, or if your first remediation attempt failed, a second perspective will almost always shorten the overall resolution time. Do not let pride or the hope that the next thing you try will work delay escalation past this window.\n  yes -> [SECONDARY]\n  no  -> [SOLO]\n\nQ5: Is the complete outage affecting a significant portion of your customer base, or has it persisted for more than 15 minutes?\n  hint: \"Significant portion\" depends on your business—for a consumer product it might be more than 10 % of active users; for a B2B SaaS it might be any enterprise customer hitting the issue. Fifteen minutes is a short window deliberately: full outages have compounding effects (customer trust, SLA penalties, social media amplification) that make early escalation almost always the right call. If you are within the 15-minute window and have high confidence in a fix already being deployed, you may briefly continue solo, but keep your lead informed via Slack.\n  yes -> [PAGE_LEAD]\n  no  -> [SECONDARY]\n\n[SOLO]: Handle Alone\n  color: #16a34a\n  description: You are within your autonomous resolution window—continue investigating and remediating without waking anyone else. Keep a running incident log (even brief notes in your incident management tool or a Slack thread) so that if the situation changes and you need to escalate, others can get up to speed quickly. Set a personal timer for 20–30 minutes: if you have not resolved or clearly identified the root cause by then, move immediately to waking your secondary. After resolution, write a concise incident summary within 24 hours covering timeline, root cause, and any follow-up actions—even for issues you handled solo.\n  code: ENG_ONC_SOLO\n\n[SECONDARY]: Wake Secondary On-Call\n  color: #d97706\n  description: Wake your secondary on-call engineer—this issue has exceeded the threshold for solo handling. Send a clear, structured message: what is broken, what symptoms you are seeing, what you have already tried, and what your current hypothesis is. Do not ask them to take over; you remain the incident commander unless agreed otherwise. A second engineer brings fresh eyes, can parallelize investigation (one person checks logs while the other checks infrastructure), and ensures continuity if the issue extends beyond a single engineer's capacity. Update your incident channel immediately so the team has visibility, even if no further escalation is needed.\n  code: ENG_ONC_SECONDARY\n\n[PAGE_LEAD]: Page Engineering Lead\n  color: #ea580c\n  description: Page your engineering lead immediately—this is a significant, prolonged outage that requires leadership awareness, resource coordination, and potentially customer communication. When they join, give a concise 60-second brief: service affected, duration, customer impact, steps taken so far, and current hypothesis. The lead will decide whether to pull in additional engineers, contact a vendor, or trigger further escalation. Your job remains hands-on remediation; the lead takes the coordination burden off you. Post an initial status update to your customer status page within 10 minutes of paging the lead, even if it is just \"We are investigating an issue affecting [service]. Updates to follow.\"\n  code: ENG_ONC_LEAD\n\n[FULL_INCIDENT]: Invoke Full Incident Response (Exec + Comms)\n  color: #dc2626\n  description: Trigger your organisation's full incident response plan immediately—this situation involves potential security compromise, data breach, or major outage that requires executive awareness and coordinated external communications. Page your engineering lead and security lead simultaneously; do not wait to confirm your suspicions before escalating—the cost of a false alarm is negligible compared to a delayed response to a real breach. Preserve all evidence: do not reboot or redeploy affected systems until your security team has confirmed forensic preservation. Your communications team must be looped in within the first 15 minutes to prepare customer notifications, regulator disclosures (if applicable), and internal updates. Maintain a second-by-second log of all actions taken—this record will be essential for the post-incident review and any regulatory investigation.\n  code: ENG_ONC_FULL\n"
}