Observability Stack Selection

Observability Stack Selection

Elimination matrix devopsobservabilitymonitoringsreinfrastructureoperations

Observability tooling accumulates dashboards, alert configurations, and on-call runbooks that are expensive to migrate. Choosing the wrong platform means either paying for capabilities you don't need or hitting walls when debugging complex production incidents. This tree eliminates options that don't match your cloud footprint, operational capacity, and primary debugging use case.

Overview

Type
Elimination matrix
Tags
devops, observability, monitoring, sre, infrastructure, operations
Entry
Q1
Questions
13
Outcomes
4
Author
Andrew
Last updated
2026-05-12

Decision Tree

Start: Is your entire production infrastructure on a single cloud provider?

A: AWS exclusively [CLOUDWATCH, DATADOG, NEW-RELIC, PROMETHEUS]

B: GCP exclusively [DATADOG, PROMETHEUS, HONEYCOMB, NEW-RELIC]

C: Multi-cloud, Azure, on-premise, or mixed [DATADOG, PROMETHEUS, HONEYCOMB, NEW-RELIC]

Machine-Readable JSON (Canonical Model)

View JSON
{
  "_meta": {
    "schema": "https://www.drawdecisiontree.com/decision-dag.schema.json",
    "source": "https://www.drawdecisiontree.com",
    "description": "DrawDecisionTree.com is a free tool for building, sharing, and embedding interactive decision trees. This file is the machine-readable export of a published decision tree. The `dsl` field contains the original source in the Decision DAG DSL; the `dag` schema is documented at the URL in `schema` above.",
    "links": {
      "interactive": "https://www.drawdecisiontree.com/t/drawdecisiontree/observability-stack.html",
      "embed": "https://www.drawdecisiontree.com/embed/path/drawdecisiontree/observability-stack",
      "dsl_reference": "https://www.drawdecisiontree.com/decision-tree-dsl-reference.html",
      "guides": "https://www.drawdecisiontree.com/guides",
      "schema_docs": "https://www.drawdecisiontree.com/decision-dag.schema.json",
      "author_trees": "https://www.drawdecisiontree.com/trees/drawdecisiontree"
    },
    "generated_at": "2026-05-29T12:05:39.332Z"
  },
  "author": {
    "handle": "drawdecisiontree",
    "first_name": "Andrew",
    "last_name": null,
    "avatar_url": "1d32d828-b6ca-40ec-bdd7-771fe7b9c36a/avatar-1778531481027.svg",
    "display_name": "Andrew"
  },
  "file": {
    "id": "50679d03-9b0e-4bfb-a5a7-8a10685a8687",
    "name": "Observability Stack Selection",
    "public_slug": "observability-stack",
    "updated_at": "2026-05-12T16:53:43.587978+00:00",
    "url": "https://www.drawdecisiontree.com/t/drawdecisiontree/observability-stack.html",
    "json_url": "https://www.drawdecisiontree.com/t/drawdecisiontree/observability-stack/tree.json",
    "dsl_url": "https://www.drawdecisiontree.com/t/drawdecisiontree/observability-stack/tree.dag"
  },
  "meta": {
    "description": "Observability tooling accumulates dashboards, alert configurations, and on-call runbooks that are expensive to migrate. Choosing the wrong platform means either paying for capabilities you don't need or hitting walls when debugging complex production incidents. This tree eliminates options that don't match your cloud footprint, operational capacity, and primary debugging use case.",
    "mode": "elimination",
    "entry": "Q1",
    "tags": [
      "devops",
      "observability",
      "monitoring",
      "sre",
      "infrastructure",
      "operations"
    ],
    "image": "https://images.unsplash.com/photo-1584949091598-c31daaaa4aa9?w=1200&q=80"
  },
  "questions": [
    {
      "id": "Q1",
      "text": "Is your entire production infrastructure on a single cloud provider?"
    },
    {
      "id": "A",
      "text": "AWS exclusively [CLOUDWATCH, DATADOG, NEW-RELIC, PROMETHEUS]"
    },
    {
      "id": "B",
      "text": "GCP exclusively [DATADOG, PROMETHEUS, HONEYCOMB, NEW-RELIC]"
    },
    {
      "id": "C",
      "text": "Multi-cloud, Azure, on-premise, or mixed [DATADOG, PROMETHEUS, HONEYCOMB, NEW-RELIC]"
    },
    {
      "id": "Q2",
      "text": "Does your team have the capacity to self-host, operate, and scale the observability infrastructure alongside your product?"
    },
    {
      "id": "A",
      "text": "YES — dedicated platform team, comfortable running our own stack [PROMETHEUS, CLOUDWATCH]"
    },
    {
      "id": "B",
      "text": "NO — fully managed SaaS only [DATADOG, HONEYCOMB, NEW-RELIC, CLOUDWATCH]"
    },
    {
      "id": "Q3",
      "text": "Is your primary observability use case debugging complex distributed system failures — tracing requests across microservices, finding unknown-unknown failures with high-cardinality data?"
    },
    {
      "id": "A",
      "text": "YES — distributed tracing and high-cardinality event exploration are the primary need [HONEYCOMB, DATADOG]"
    },
    {
      "id": "B",
      "text": "NO — infrastructure dashboards, uptime alerting, and standard metrics are the primary need [PROMETHEUS, CLOUDWATCH, DATADOG, NEW-RELIC]"
    },
    {
      "id": "Q4",
      "text": "Is per-host or per-service SaaS pricing a significant budget constraint?"
    },
    {
      "id": "A",
      "text": "YES — budget is a significant constraint [PROMETHEUS, CLOUDWATCH]"
    },
    {
      "id": "B",
      "text": "NO — per-host SaaS pricing is acceptable at our scale [DATADOG, NEW-RELIC, HONEYCOMB]"
    }
  ],
  "outcomes": [
    {
      "id": "DATADOG",
      "label": "Datadog"
    },
    {
      "id": "PROMETHEUS",
      "label": "Prometheus + Grafana"
    },
    {
      "id": "CLOUDWATCH",
      "label": "AWS CloudWatch"
    },
    {
      "id": "HONEYCOMB",
      "label": "Honeycomb"
    }
  ],
  "dsl": "dag: Observability Stack Selection\nversion: 1.0.0\nimage: https://images.unsplash.com/photo-1584949091598-c31daaaa4aa9?w=1200&q=80\ndescription: Observability tooling accumulates dashboards, alert configurations, and on-call runbooks that are expensive to migrate. Choosing the wrong platform means either paying for capabilities you don't need or hitting walls when debugging complex production incidents. This tree eliminates options that don't match your cloud footprint, operational capacity, and primary debugging use case.\ntags: devops, observability, monitoring, sre, infrastructure, operations\nentry: Q1\nmode: elimination\n\nQ1: Is your entire production infrastructure on a single cloud provider?\n  hint: If all your compute, databases, and services run in one cloud, a cloud-native monitoring service (CloudWatch on AWS, Cloud Monitoring on GCP) can provide reasonable coverage with zero setup. Multi-cloud or hybrid setups need a vendor-neutral observability layer that can ingest telemetry from all environments.\n  A: AWS exclusively [CLOUDWATCH, DATADOG, NEW-RELIC, PROMETHEUS]\n  B: GCP exclusively [DATADOG, PROMETHEUS, HONEYCOMB, NEW-RELIC]\n  C: Multi-cloud, Azure, on-premise, or mixed [DATADOG, PROMETHEUS, HONEYCOMB, NEW-RELIC]\n\nQ2: Does your team have the capacity to self-host, operate, and scale the observability infrastructure alongside your product?\n  hint: Prometheus plus Grafana is powerful and free, but running it at production scale — high-availability Prometheus, long-term storage (Thanos or Cortex), Alertmanager clustering, and Grafana LDAP integration — is a substantial ongoing engineering investment. If you don't have a dedicated platform team, a managed SaaS solution will pay for itself quickly.\n  A: YES — dedicated platform team, comfortable running our own stack [PROMETHEUS, CLOUDWATCH]\n  B: NO — fully managed SaaS only [DATADOG, HONEYCOMB, NEW-RELIC, CLOUDWATCH]\n\nQ3: Is your primary observability use case debugging complex distributed system failures — tracing requests across microservices, finding unknown-unknown failures with high-cardinality data?\n  hint: Traditional metrics and dashboards tell you that something is wrong; distributed tracing and high-cardinality event exploration tell you why. If your primary pain is debugging a 500ms latency spike that affects only users in a specific region using a specific client version, you need Honeycomb or Datadog APM — not just a dashboard of average response times.\n  A: YES — distributed tracing and high-cardinality event exploration are the primary need [HONEYCOMB, DATADOG]\n  B: NO — infrastructure dashboards, uptime alerting, and standard metrics are the primary need [PROMETHEUS, CLOUDWATCH, DATADOG, NEW-RELIC]\n\nQ4: Is per-host or per-service SaaS pricing a significant budget constraint?\n  hint: Datadog and New Relic both charge per monitored host, which scales aggressively for large fleets. At 50+ hosts, per-host SaaS pricing commonly exceeds $5,000–$15,000/month. Prometheus is open-source with compute and storage costs only. Honeycomb charges per event rather than per host, which may be more or less expensive depending on your event volume and team size.\n  A: YES — budget is a significant constraint [PROMETHEUS, CLOUDWATCH]\n  B: NO — per-host SaaS pricing is acceptable at our scale [DATADOG, NEW-RELIC, HONEYCOMB]\n\n[DATADOG]: Datadog\n  color: #632CA6\n  description: Datadog is the most comprehensive commercial observability platform available — it unifies infrastructure metrics, APM traces, logs, real user monitoring (RUM), synthetic monitoring, database monitoring, and security signals in a single pane of glass with a shared tagging model across all telemetry types. Its agent-based collection, 700+ integrations, and auto-instrumentation libraries cover virtually every technology in a modern stack. Datadog's notebook and dashboard system is excellent, and its alert correlation (using monitors, composite monitors, and event-based correlations) helps reduce alert fatigue. The trade-off is cost: Datadog is among the most expensive observability platforms per host, and costs can grow quickly as you add products. The per-SKU pricing model (APM, logs, RUM, security are all separate line items) means teams commonly spend more than budgeted as they expand coverage. Despite the cost, Datadog is the default choice for engineering organisations that value a single, deeply integrated observability platform and can justify the investment.\n  code: OBS_DATADOG\n\n[PROMETHEUS]: Prometheus + Grafana\n  color: #E6522C\n  description: Prometheus and Grafana are the open-source industry standard for infrastructure and application metrics. Prometheus scrapes metrics from instrumented applications and exporters (node_exporter for Linux, kube-state-metrics for Kubernetes, and hundreds of community exporters for databases, proxies, and services), storing them in a time-series database with a powerful query language (PromQL). Grafana provides dashboards, alerting, and — with Loki — log aggregation. The total software cost is zero; you pay only for compute and storage. Prometheus integrates natively with Kubernetes through ServiceMonitor and PodMonitor custom resources (via the Prometheus Operator), making it the default choice for Kubernetes-native observability. For long-term metrics storage beyond Prometheus's local retention, Thanos or Cortex provide highly available, object-storage-backed solutions. The investment required to run this stack reliably at scale is real but well-documented, and the ecosystem is mature enough that most operational patterns have established solutions.\n  code: OBS_PROMETHEUS\n\n[CLOUDWATCH]: AWS CloudWatch\n  color: #FF9900\n  description: AWS CloudWatch is the native monitoring service for AWS infrastructure — it automatically collects metrics from EC2, RDS, Lambda, ELB, ECS, EKS, and virtually every other AWS service with no agent installation or configuration required. CloudWatch Logs centralises log streams from all AWS services and Lambda functions. CloudWatch Alarms integrate directly with SNS for notifications and Auto Scaling for remediation actions. For workloads exclusively on AWS, CloudWatch provides adequate infrastructure visibility with zero operational overhead. Its limitations become apparent for application-level observability: the metrics and dashboards are adequate but not exceptional, distributed tracing requires X-Ray (a separate service with separate pricing), and the log query language (CloudWatch Logs Insights) is less powerful than Loki or Elasticsearch. CloudWatch is the right choice for AWS-only shops that need solid infrastructure monitoring without paying for a separate SaaS platform — and a reasonable starting point before deciding whether to add Datadog or Prometheus for deeper observability.\n  code: OBS_CLOUDWATCH\n\n[HONEYCOMB]: Honeycomb\n  color: #F5A623\n  description: Honeycomb is built around a fundamentally different observability philosophy: instead of pre-aggregating metrics into dashboards, it stores every event in full fidelity and allows arbitrary slicing by any combination of fields at query time. This high-cardinality, high-dimensionality approach makes it possible to ask questions like \"show me the p99 latency for requests from users on iOS 17.2 in the EU-WEST-1 region whose request included product ID 'abc123' in the last 15 minutes\" — a query impossible in traditional metrics-based systems. Honeycomb's BubbleUp feature automatically surfaces which dimensions correlate with a latency spike or error rate increase, dramatically reducing the time to diagnose complex production incidents. It integrates with OpenTelemetry, making it straightforward to instrument existing services. Honeycomb is the right choice for teams with complex microservice architectures, where unknown-unknown failures are the hardest debugging challenge and where a metrics dashboard would never surface the root cause. It is not a replacement for infrastructure monitoring — pair it with CloudWatch or node_exporter for host-level metrics.\n  code: OBS_HONEYCOMB\n\n[NEW-RELIC]: New Relic\n  color: #008C99\n  description: New Relic is a full-stack observability platform with a strong heritage in APM (Application Performance Monitoring) — its auto-instrumentation agents for Java, .NET, Python, Ruby, Node.js, PHP, and Go can provide deep application visibility with minimal code changes. The New Relic One platform unifies APM, infrastructure monitoring, distributed tracing, browser monitoring, and synthetic checks with a shared entity model and NRQL (New Relic Query Language) for ad-hoc analysis. New Relic shifted to a consumption-based pricing model in 2020 — you pay for data ingest (GB per month) plus seats for full users — which can be more predictable than Datadog's per-host model for large fleets but requires careful management of data volume. New Relic is a strong choice for organisations with polyglot application stacks that need deep language-level APM without manual instrumentation effort, and for teams migrating from legacy APM tools (AppDynamics, Dynatrace) who want a modern platform with a familiar feature set.\n  code: OBS_NEW_RELIC\n"
}

DSL Representation

dag: Observability Stack Selection
version: 1.0.0
image: https://images.unsplash.com/photo-1584949091598-c31daaaa4aa9?w=1200&q=80
description: Observability tooling accumulates dashboards, alert configurations, and on-call runbooks that are expensive to migrate. Choosing the wrong platform means either paying for capabilities you don't need or hitting walls when debugging complex production incidents. This tree eliminates options that don't match your cloud footprint, operational capacity, and primary debugging use case.
tags: devops, observability, monitoring, sre, infrastructure, operations
entry: Q1
mode: elimination

Q1: Is your entire production infrastructure on a single cloud provider?
  hint: If all your compute, databases, and services run in one cloud, a cloud-native monitoring service (CloudWatch on AWS, Cloud Monitoring on GCP) can provide reasonable coverage with zero setup. Multi-cloud or hybrid setups need a vendor-neutral observability layer that can ingest telemetry from all environments.
  A: AWS exclusively [CLOUDWATCH, DATADOG, NEW-RELIC, PROMETHEUS]
  B: GCP exclusively [DATADOG, PROMETHEUS, HONEYCOMB, NEW-RELIC]
  C: Multi-cloud, Azure, on-premise, or mixed [DATADOG, PROMETHEUS, HONEYCOMB, NEW-RELIC]

Q2: Does your team have the capacity to self-host, operate, and scale the observability infrastructure alongside your product?
  hint: Prometheus plus Grafana is powerful and free, but running it at production scale — high-availability Prometheus, long-term storage (Thanos or Cortex), Alertmanager clustering, and Grafana LDAP integration — is a substantial ongoing engineering investment. If you don't have a dedicated platform team, a managed SaaS solution will pay for itself quickly.
  A: YES — dedicated platform team, comfortable running our own stack [PROMETHEUS, CLOUDWATCH]
  B: NO — fully managed SaaS only [DATADOG, HONEYCOMB, NEW-RELIC, CLOUDWATCH]

Q3: Is your primary observability use case debugging complex distributed system failures — tracing requests across microservices, finding unknown-unknown failures with high-cardinality data?
  hint: Traditional metrics and dashboards tell you that something is wrong; distributed tracing and high-cardinality event exploration tell you why. If your primary pain is debugging a 500ms latency spike that affects only users in a specific region using a specific client version, you need Honeycomb or Datadog APM — not just a dashboard of average response times.
  A: YES — distributed tracing and high-cardinality event exploration are the primary need [HONEYCOMB, DATADOG]
  B: NO — infrastructure dashboards, uptime alerting, and standard metrics are the primary need [PROMETHEUS, CLOUDWATCH, DATADOG, NEW-RELIC]

Q4: Is per-host or per-service SaaS pricing a significant budget constraint?
  hint: Datadog and New Relic both charge per monitored host, which scales aggressively for large fleets. At 50+ hosts, per-host SaaS pricing commonly exceeds $5,000–$15,000/month. Prometheus is open-source with compute and storage costs only. Honeycomb charges per event rather than per host, which may be more or less expensive depending on your event volume and team size.
  A: YES — budget is a significant constraint [PROMETHEUS, CLOUDWATCH]
  B: NO — per-host SaaS pricing is acceptable at our scale [DATADOG, NEW-RELIC, HONEYCOMB]

[DATADOG]: Datadog
  color: #632CA6
  description: Datadog is the most comprehensive commercial observability platform available — it unifies infrastructure metrics, APM traces, logs, real user monitoring (RUM), synthetic monitoring, database monitoring, and security signals in a single pane of glass with a shared tagging model across all telemetry types. Its agent-based collection, 700+ integrations, and auto-instrumentation libraries cover virtually every technology in a modern stack. Datadog's notebook and dashboard system is excellent, and its alert correlation (using monitors, composite monitors, and event-based correlations) helps reduce alert fatigue. The trade-off is cost: Datadog is among the most expensive observability platforms per host, and costs can grow quickly as you add products. The per-SKU pricing model (APM, logs, RUM, security are all separate line items) means teams commonly spend more than budgeted as they expand coverage. Despite the cost, Datadog is the default choice for engineering organisations that value a single, deeply integrated observability platform and can justify the investment.
  code: OBS_DATADOG

[PROMETHEUS]: Prometheus + Grafana
  color: #E6522C
  description: Prometheus and Grafana are the open-source industry standard for infrastructure and application metrics. Prometheus scrapes metrics from instrumented applications and exporters (node_exporter for Linux, kube-state-metrics for Kubernetes, and hundreds of community exporters for databases, proxies, and services), storing them in a time-series database with a powerful query language (PromQL). Grafana provides dashboards, alerting, and — with Loki — log aggregation. The total software cost is zero; you pay only for compute and storage. Prometheus integrates natively with Kubernetes through ServiceMonitor and PodMonitor custom resources (via the Prometheus Operator), making it the default choice for Kubernetes-native observability. For long-term metrics storage beyond Prometheus's local retention, Thanos or Cortex provide highly available, object-storage-backed solutions. The investment required to run this stack reliably at scale is real but well-documented, and the ecosystem is mature enough that most operational patterns have established solutions.
  code: OBS_PROMETHEUS

[CLOUDWATCH]: AWS CloudWatch
  color: #FF9900
  description: AWS CloudWatch is the native monitoring service for AWS infrastructure — it automatically collects metrics from EC2, RDS, Lambda, ELB, ECS, EKS, and virtually every other AWS service with no agent installation or configuration required. CloudWatch Logs centralises log streams from all AWS services and Lambda functions. CloudWatch Alarms integrate directly with SNS for notifications and Auto Scaling for remediation actions. For workloads exclusively on AWS, CloudWatch provides adequate infrastructure visibility with zero operational overhead. Its limitations become apparent for application-level observability: the metrics and dashboards are adequate but not exceptional, distributed tracing requires X-Ray (a separate service with separate pricing), and the log query language (CloudWatch Logs Insights) is less powerful than Loki or Elasticsearch. CloudWatch is the right choice for AWS-only shops that need solid infrastructure monitoring without paying for a separate SaaS platform — and a reasonable starting point before deciding whether to add Datadog or Prometheus for deeper observability.
  code: OBS_CLOUDWATCH

[HONEYCOMB]: Honeycomb
  color: #F5A623
  description: Honeycomb is built around a fundamentally different observability philosophy: instead of pre-aggregating metrics into dashboards, it stores every event in full fidelity and allows arbitrary slicing by any combination of fields at query time. This high-cardinality, high-dimensionality approach makes it possible to ask questions like "show me the p99 latency for requests from users on iOS 17.2 in the EU-WEST-1 region whose request included product ID 'abc123' in the last 15 minutes" — a query impossible in traditional metrics-based systems. Honeycomb's BubbleUp feature automatically surfaces which dimensions correlate with a latency spike or error rate increase, dramatically reducing the time to diagnose complex production incidents. It integrates with OpenTelemetry, making it straightforward to instrument existing services. Honeycomb is the right choice for teams with complex microservice architectures, where unknown-unknown failures are the hardest debugging challenge and where a metrics dashboard would never surface the root cause. It is not a replacement for infrastructure monitoring — pair it with CloudWatch or node_exporter for host-level metrics.
  code: OBS_HONEYCOMB

[NEW-RELIC]: New Relic
  color: #008C99
  description: New Relic is a full-stack observability platform with a strong heritage in APM (Application Performance Monitoring) — its auto-instrumentation agents for Java, .NET, Python, Ruby, Node.js, PHP, and Go can provide deep application visibility with minimal code changes. The New Relic One platform unifies APM, infrastructure monitoring, distributed tracing, browser monitoring, and synthetic checks with a shared entity model and NRQL (New Relic Query Language) for ad-hoc analysis. New Relic shifted to a consumption-based pricing model in 2020 — you pay for data ingest (GB per month) plus seats for full users — which can be more predictable than Datadog's per-host model for large fleets but requires careful management of data volume. New Relic is a strong choice for organisations with polyglot application stacks that need deep language-level APM without manual instrumentation effort, and for teams migrating from legacy APM tools (AppDynamics, Dynatrace) who want a modern platform with a familiar feature set.
  code: OBS_NEW_RELIC

Machine Access

Questions in this decision tree

Possible outcomes

How to use this decision tree

Click "Open interactive version" to step through the questions. Your answers narrow the tree until a recommended outcome is reached. You can also embed this tree on your own site.

More decision trees by Andrew

Which API design pattern is right for my project?
Which API design pattern is right for my project?
Determine the right API design style for your integration scenario.
Authentication Method Selection
Authentication Method Selection
Authentication is a security-critical, high-friction decision to reverse — migrating users from one auth method to another requires coordinated password resets or credential migration campaigns. This tree eliminates methods that don't match your user type, enterprise requirements, and security posture, giving you a clear shortlist before you write a line of code.
Caching Strategy Selection
Caching Strategy Selection
Premature or misapplied caching adds complexity — stale data bugs, invalidation logic, and distributed consistency problems — without solving the actual bottleneck. This tree routes you to the caching pattern that matches your data access profile, so you apply the right tool to the right problem rather than defaulting to Redis for everything.
CI/CD Pipeline Tool Selection
CI/CD Pipeline Tool Selection
Choosing a CI/CD platform is a long-term infrastructure commitment — pipelines accumulate config, custom scripts, and team muscle memory that make switching painful. This tree eliminates tools that don't fit your source control host, infrastructure model, or team scale, leaving only the options genuinely viable for your situation.
Which cloud provider should I use — AWS, Azure, or Google Cloud?
Which cloud provider should I use — AWS, Azure, or Google Cloud?
Answer a few questions to identify the most suitable cloud platform for your workload.
Container Orchestration Platform Selection
Container Orchestration Platform Selection
Container orchestration is foundational infrastructure — the platform you choose shapes how you deploy, scale, network, and operate every service you run. This tree eliminates options that don't match your operational maturity, cloud provider commitment, and workload complexity, so you land on the platform that fits your team today without over-engineering for a scale you haven't reached.