Who should I escalate this on-call incident to?

Who should I escalate this on-call incident to?

Decision tree engineeringon-callincident responsesreoperations

Every minute an on-call engineer spends deciding who to call is a minute not spent fixing the problem—but over-escalating creates alert fatigue and erodes trust in the paging system. This tree gives on-call engineers a fast, consistent framework for deciding whether to handle an issue solo, wake a secondary, page their engineering lead, or trigger a full incident response with executive and communications involvement.

Overview

Type
Decision tree
Tags
engineering, on-call, incident response, sre, operations
Entry
Q1
Questions
5
Outcomes
4
Author
Andrew
Last updated
2026-05-12

Decision Tree

Start: Do you suspect a data breach, data loss, or active security compromise?

yes

  • Outcome: Invoke Full Incident Response (Exec + Comms)

no

  • Continues to question: Is this issue customer-facing, or is it limited to internal tooling and back-office systems?

Machine-Readable JSON (Canonical Model)

View JSON
{
  "_meta": {
    "schema": "https://www.drawdecisiontree.com/decision-dag.schema.json",
    "source": "https://www.drawdecisiontree.com",
    "description": "DrawDecisionTree.com is a free tool for building, sharing, and embedding interactive decision trees. This file is the machine-readable export of a published decision tree. The `dsl` field contains the original source in the Decision DAG DSL; the `dag` schema is documented at the URL in `schema` above.",
    "links": {
      "interactive": "https://www.drawdecisiontree.com/t/drawdecisiontree/engineering-on-call-escalation.html",
      "embed": "https://www.drawdecisiontree.com/embed/path/drawdecisiontree/engineering-on-call-escalation",
      "dsl_reference": "https://www.drawdecisiontree.com/decision-tree-dsl-reference.html",
      "guides": "https://www.drawdecisiontree.com/guides",
      "schema_docs": "https://www.drawdecisiontree.com/decision-dag.schema.json",
      "author_trees": "https://www.drawdecisiontree.com/trees/drawdecisiontree"
    },
    "generated_at": "2026-05-29T12:05:39.283Z"
  },
  "author": {
    "handle": "drawdecisiontree",
    "first_name": "Andrew",
    "last_name": null,
    "avatar_url": "1d32d828-b6ca-40ec-bdd7-771fe7b9c36a/avatar-1778531481027.svg",
    "display_name": "Andrew"
  },
  "file": {
    "id": "65e0f2b0-ca6a-4b4a-ae10-a18d6870157b",
    "name": "Who should I escalate this on-call incident to?",
    "public_slug": "engineering-on-call-escalation",
    "updated_at": "2026-05-12T16:53:43.587978+00:00",
    "url": "https://www.drawdecisiontree.com/t/drawdecisiontree/engineering-on-call-escalation.html",
    "json_url": "https://www.drawdecisiontree.com/t/drawdecisiontree/engineering-on-call-escalation/tree.json",
    "dsl_url": "https://www.drawdecisiontree.com/t/drawdecisiontree/engineering-on-call-escalation/tree.dag"
  },
  "meta": {
    "description": "Every minute an on-call engineer spends deciding who to call is a minute not spent fixing the problem—but over-escalating creates alert fatigue and erodes trust in the paging system. This tree gives on-call engineers a fast, consistent framework for deciding whether to handle an issue solo, wake a secondary, page their engineering lead, or trigger a full incident response with executive and communications involvement.",
    "mode": "decision",
    "entry": "Q1",
    "tags": [
      "engineering",
      "on-call",
      "incident response",
      "sre",
      "operations"
    ],
    "image": "https://images.unsplash.com/photo-1551288049-bebda4e38f71?w=1200&q=80"
  },
  "questions": [
    {
      "id": "Q1",
      "text": "Do you suspect a data breach, data loss, or active security compromise?"
    },
    {
      "id": "Q2",
      "text": "Is this issue customer-facing, or is it limited to internal tooling and back-office systems?"
    },
    {
      "id": "Q3",
      "text": "Is the customer-facing service completely down (full outage), or is it degraded (partial functionality or elevated errors)?"
    },
    {
      "id": "Q4",
      "text": "Has the issue persisted for more than 30 minutes without a clear resolution path?"
    },
    {
      "id": "Q5",
      "text": "Is the complete outage affecting a significant portion of your customer base, or has it persisted for more than 15 minutes?"
    }
  ],
  "outcomes": [
    {
      "id": "SOLO",
      "label": "Handle Alone"
    },
    {
      "id": "SECONDARY",
      "label": "Wake Secondary On-Call"
    },
    {
      "id": "PAGE_LEAD",
      "label": "Page Engineering Lead"
    },
    {
      "id": "FULL_INCIDENT",
      "label": "Invoke Full Incident Response (Exec + Comms)"
    }
  ],
  "dsl": "dag: Who should I escalate this on-call incident to?\nversion: 1.0.0\nimage: https://images.unsplash.com/photo-1551288049-bebda4e38f71?w=1200&q=80\ndescription: Every minute an on-call engineer spends deciding who to call is a minute not spent fixing the problem—but over-escalating creates alert fatigue and erodes trust in the paging system. This tree gives on-call engineers a fast, consistent framework for deciding whether to handle an issue solo, wake a secondary, page their engineering lead, or trigger a full incident response with executive and communications involvement.\ntags: engineering, on-call, incident response, sre, operations\nentry: Q1\n\nQ1: Do you suspect a data breach, data loss, or active security compromise?\n  hint: Any situation involving unauthorised data access, evidence of intrusion (unexpected privilege escalation, unfamiliar processes, anomalous network egress), or confirmed data loss should be treated as a security incident immediately—regardless of user-facing impact. Security incidents have legal, regulatory, and communications implications that require specialist involvement from the first moment. When in doubt, escalate: the cost of an unnecessary security escalation is far lower than the cost of a delayed response to a real breach.\n  yes -> [FULL_INCIDENT]\n  no  -> Q2\n\nQ2: Is this issue customer-facing, or is it limited to internal tooling and back-office systems?\n  hint: \"Customer-facing\" means end-users, paying customers, or partner integrations are experiencing degraded or absent service. Internal tooling issues (dashboards, data pipelines, admin panels) are serious but rarely warrant the same escalation urgency. Be precise: a broken internal deployment pipeline that prevents shipping a fix to a customer-facing issue is itself customer-facing by proxy. Consider your SLA commitments—some internal services may have contractual obligations to other teams that mirror external SLAs.\n  A: Customer-facing service affected -> Q3\n  B: Internal tooling or back-office only -> Q4\n\nQ3: Is the customer-facing service completely down (full outage), or is it degraded (partial functionality or elevated errors)?\n  hint: A \"complete outage\" means customers cannot access core functionality at all—login is broken, checkout is failing, the app returns 5xx on every request. \"Degraded\" covers a wide range: slower than normal, a subset of features unavailable, elevated error rates below 100%, or impact limited to a subset of customers or regions. Your incident severity runbook should have specific thresholds (e.g., error rate > 5 % = degraded, > 50 % = down); use those if they exist.\n  A: Complete outage — service is fully unavailable -> Q5\n  B: Degraded — partial functionality loss or elevated errors -> Q4\n\nQ4: Has the issue persisted for more than 30 minutes without a clear resolution path?\n  hint: Thirty minutes is a reasonable threshold for an on-call engineer to have formed a hypothesis, attempted a remediation, and assessed whether it is working. If you have a clear fix in progress and confidence it will resolve the issue, use your judgement. If you are still in diagnostic mode, or if your first remediation attempt failed, a second perspective will almost always shorten the overall resolution time. Do not let pride or the hope that the next thing you try will work delay escalation past this window.\n  yes -> [SECONDARY]\n  no  -> [SOLO]\n\nQ5: Is the complete outage affecting a significant portion of your customer base, or has it persisted for more than 15 minutes?\n  hint: \"Significant portion\" depends on your business—for a consumer product it might be more than 10 % of active users; for a B2B SaaS it might be any enterprise customer hitting the issue. Fifteen minutes is a short window deliberately: full outages have compounding effects (customer trust, SLA penalties, social media amplification) that make early escalation almost always the right call. If you are within the 15-minute window and have high confidence in a fix already being deployed, you may briefly continue solo, but keep your lead informed via Slack.\n  yes -> [PAGE_LEAD]\n  no  -> [SECONDARY]\n\n[SOLO]: Handle Alone\n  color: #16a34a\n  description: You are within your autonomous resolution window—continue investigating and remediating without waking anyone else. Keep a running incident log (even brief notes in your incident management tool or a Slack thread) so that if the situation changes and you need to escalate, others can get up to speed quickly. Set a personal timer for 20–30 minutes: if you have not resolved or clearly identified the root cause by then, move immediately to waking your secondary. After resolution, write a concise incident summary within 24 hours covering timeline, root cause, and any follow-up actions—even for issues you handled solo.\n  code: ENG_ONC_SOLO\n\n[SECONDARY]: Wake Secondary On-Call\n  color: #d97706\n  description: Wake your secondary on-call engineer—this issue has exceeded the threshold for solo handling. Send a clear, structured message: what is broken, what symptoms you are seeing, what you have already tried, and what your current hypothesis is. Do not ask them to take over; you remain the incident commander unless agreed otherwise. A second engineer brings fresh eyes, can parallelize investigation (one person checks logs while the other checks infrastructure), and ensures continuity if the issue extends beyond a single engineer's capacity. Update your incident channel immediately so the team has visibility, even if no further escalation is needed.\n  code: ENG_ONC_SECONDARY\n\n[PAGE_LEAD]: Page Engineering Lead\n  color: #ea580c\n  description: Page your engineering lead immediately—this is a significant, prolonged outage that requires leadership awareness, resource coordination, and potentially customer communication. When they join, give a concise 60-second brief: service affected, duration, customer impact, steps taken so far, and current hypothesis. The lead will decide whether to pull in additional engineers, contact a vendor, or trigger further escalation. Your job remains hands-on remediation; the lead takes the coordination burden off you. Post an initial status update to your customer status page within 10 minutes of paging the lead, even if it is just \"We are investigating an issue affecting [service]. Updates to follow.\"\n  code: ENG_ONC_LEAD\n\n[FULL_INCIDENT]: Invoke Full Incident Response (Exec + Comms)\n  color: #dc2626\n  description: Trigger your organisation's full incident response plan immediately—this situation involves potential security compromise, data breach, or major outage that requires executive awareness and coordinated external communications. Page your engineering lead and security lead simultaneously; do not wait to confirm your suspicions before escalating—the cost of a false alarm is negligible compared to a delayed response to a real breach. Preserve all evidence: do not reboot or redeploy affected systems until your security team has confirmed forensic preservation. Your communications team must be looped in within the first 15 minutes to prepare customer notifications, regulator disclosures (if applicable), and internal updates. Maintain a second-by-second log of all actions taken—this record will be essential for the post-incident review and any regulatory investigation.\n  code: ENG_ONC_FULL\n"
}

DSL Representation

dag: Who should I escalate this on-call incident to?
version: 1.0.0
image: https://images.unsplash.com/photo-1551288049-bebda4e38f71?w=1200&q=80
description: Every minute an on-call engineer spends deciding who to call is a minute not spent fixing the problem—but over-escalating creates alert fatigue and erodes trust in the paging system. This tree gives on-call engineers a fast, consistent framework for deciding whether to handle an issue solo, wake a secondary, page their engineering lead, or trigger a full incident response with executive and communications involvement.
tags: engineering, on-call, incident response, sre, operations
entry: Q1

Q1: Do you suspect a data breach, data loss, or active security compromise?
  hint: Any situation involving unauthorised data access, evidence of intrusion (unexpected privilege escalation, unfamiliar processes, anomalous network egress), or confirmed data loss should be treated as a security incident immediately—regardless of user-facing impact. Security incidents have legal, regulatory, and communications implications that require specialist involvement from the first moment. When in doubt, escalate: the cost of an unnecessary security escalation is far lower than the cost of a delayed response to a real breach.
  yes -> [FULL_INCIDENT]
  no  -> Q2

Q2: Is this issue customer-facing, or is it limited to internal tooling and back-office systems?
  hint: "Customer-facing" means end-users, paying customers, or partner integrations are experiencing degraded or absent service. Internal tooling issues (dashboards, data pipelines, admin panels) are serious but rarely warrant the same escalation urgency. Be precise: a broken internal deployment pipeline that prevents shipping a fix to a customer-facing issue is itself customer-facing by proxy. Consider your SLA commitments—some internal services may have contractual obligations to other teams that mirror external SLAs.
  A: Customer-facing service affected -> Q3
  B: Internal tooling or back-office only -> Q4

Q3: Is the customer-facing service completely down (full outage), or is it degraded (partial functionality or elevated errors)?
  hint: A "complete outage" means customers cannot access core functionality at all—login is broken, checkout is failing, the app returns 5xx on every request. "Degraded" covers a wide range: slower than normal, a subset of features unavailable, elevated error rates below 100%, or impact limited to a subset of customers or regions. Your incident severity runbook should have specific thresholds (e.g., error rate > 5 % = degraded, > 50 % = down); use those if they exist.
  A: Complete outage — service is fully unavailable -> Q5
  B: Degraded — partial functionality loss or elevated errors -> Q4

Q4: Has the issue persisted for more than 30 minutes without a clear resolution path?
  hint: Thirty minutes is a reasonable threshold for an on-call engineer to have formed a hypothesis, attempted a remediation, and assessed whether it is working. If you have a clear fix in progress and confidence it will resolve the issue, use your judgement. If you are still in diagnostic mode, or if your first remediation attempt failed, a second perspective will almost always shorten the overall resolution time. Do not let pride or the hope that the next thing you try will work delay escalation past this window.
  yes -> [SECONDARY]
  no  -> [SOLO]

Q5: Is the complete outage affecting a significant portion of your customer base, or has it persisted for more than 15 minutes?
  hint: "Significant portion" depends on your business—for a consumer product it might be more than 10 % of active users; for a B2B SaaS it might be any enterprise customer hitting the issue. Fifteen minutes is a short window deliberately: full outages have compounding effects (customer trust, SLA penalties, social media amplification) that make early escalation almost always the right call. If you are within the 15-minute window and have high confidence in a fix already being deployed, you may briefly continue solo, but keep your lead informed via Slack.
  yes -> [PAGE_LEAD]
  no  -> [SECONDARY]

[SOLO]: Handle Alone
  color: #16a34a
  description: You are within your autonomous resolution window—continue investigating and remediating without waking anyone else. Keep a running incident log (even brief notes in your incident management tool or a Slack thread) so that if the situation changes and you need to escalate, others can get up to speed quickly. Set a personal timer for 20–30 minutes: if you have not resolved or clearly identified the root cause by then, move immediately to waking your secondary. After resolution, write a concise incident summary within 24 hours covering timeline, root cause, and any follow-up actions—even for issues you handled solo.
  code: ENG_ONC_SOLO

[SECONDARY]: Wake Secondary On-Call
  color: #d97706
  description: Wake your secondary on-call engineer—this issue has exceeded the threshold for solo handling. Send a clear, structured message: what is broken, what symptoms you are seeing, what you have already tried, and what your current hypothesis is. Do not ask them to take over; you remain the incident commander unless agreed otherwise. A second engineer brings fresh eyes, can parallelize investigation (one person checks logs while the other checks infrastructure), and ensures continuity if the issue extends beyond a single engineer's capacity. Update your incident channel immediately so the team has visibility, even if no further escalation is needed.
  code: ENG_ONC_SECONDARY

[PAGE_LEAD]: Page Engineering Lead
  color: #ea580c
  description: Page your engineering lead immediately—this is a significant, prolonged outage that requires leadership awareness, resource coordination, and potentially customer communication. When they join, give a concise 60-second brief: service affected, duration, customer impact, steps taken so far, and current hypothesis. The lead will decide whether to pull in additional engineers, contact a vendor, or trigger further escalation. Your job remains hands-on remediation; the lead takes the coordination burden off you. Post an initial status update to your customer status page within 10 minutes of paging the lead, even if it is just "We are investigating an issue affecting [service]. Updates to follow."
  code: ENG_ONC_LEAD

[FULL_INCIDENT]: Invoke Full Incident Response (Exec + Comms)
  color: #dc2626
  description: Trigger your organisation's full incident response plan immediately—this situation involves potential security compromise, data breach, or major outage that requires executive awareness and coordinated external communications. Page your engineering lead and security lead simultaneously; do not wait to confirm your suspicions before escalating—the cost of a false alarm is negligible compared to a delayed response to a real breach. Preserve all evidence: do not reboot or redeploy affected systems until your security team has confirmed forensic preservation. Your communications team must be looped in within the first 15 minutes to prepare customer notifications, regulator disclosures (if applicable), and internal updates. Maintain a second-by-second log of all actions taken—this record will be essential for the post-incident review and any regulatory investigation.
  code: ENG_ONC_FULL

Machine Access

Questions in this decision tree

Possible outcomes

How to use this decision tree

Click "Open interactive version" to step through the questions. Your answers narrow the tree until a recommended outcome is reached. You can also embed this tree on your own site.

More decision trees by Andrew

Which API design pattern is right for my project?
Which API design pattern is right for my project?
Determine the right API design style for your integration scenario.
Authentication Method Selection
Authentication Method Selection
Authentication is a security-critical, high-friction decision to reverse — migrating users from one auth method to another requires coordinated password resets or credential migration campaigns. This tree eliminates methods that don't match your user type, enterprise requirements, and security posture, giving you a clear shortlist before you write a line of code.
Caching Strategy Selection
Caching Strategy Selection
Premature or misapplied caching adds complexity — stale data bugs, invalidation logic, and distributed consistency problems — without solving the actual bottleneck. This tree routes you to the caching pattern that matches your data access profile, so you apply the right tool to the right problem rather than defaulting to Redis for everything.
CI/CD Pipeline Tool Selection
CI/CD Pipeline Tool Selection
Choosing a CI/CD platform is a long-term infrastructure commitment — pipelines accumulate config, custom scripts, and team muscle memory that make switching painful. This tree eliminates tools that don't fit your source control host, infrastructure model, or team scale, leaving only the options genuinely viable for your situation.
Which cloud provider should I use — AWS, Azure, or Google Cloud?
Which cloud provider should I use — AWS, Azure, or Google Cloud?
Answer a few questions to identify the most suitable cloud platform for your workload.
Container Orchestration Platform Selection
Container Orchestration Platform Selection
Container orchestration is foundational infrastructure — the platform you choose shapes how you deploy, scale, network, and operate every service you run. This tree eliminates options that don't match your operational maturity, cloud provider commitment, and workload complexity, so you land on the platform that fits your team today without over-engineering for a scale you haven't reached.