dag: Observability Stack Selection version: 1.0.0 image: https://images.unsplash.com/photo-1584949091598-c31daaaa4aa9?w=1200&q=80 description: Observability tooling accumulates dashboards, alert configurations, and on-call runbooks that are expensive to migrate. Choosing the wrong platform means either paying for capabilities you don't need or hitting walls when debugging complex production incidents. This tree eliminates options that don't match your cloud footprint, operational capacity, and primary debugging use case. tags: devops, observability, monitoring, sre, infrastructure, operations entry: Q1 mode: elimination Q1: Is your entire production infrastructure on a single cloud provider? hint: If all your compute, databases, and services run in one cloud, a cloud-native monitoring service (CloudWatch on AWS, Cloud Monitoring on GCP) can provide reasonable coverage with zero setup. Multi-cloud or hybrid setups need a vendor-neutral observability layer that can ingest telemetry from all environments. A: AWS exclusively [CLOUDWATCH, DATADOG, NEW-RELIC, PROMETHEUS] B: GCP exclusively [DATADOG, PROMETHEUS, HONEYCOMB, NEW-RELIC] C: Multi-cloud, Azure, on-premise, or mixed [DATADOG, PROMETHEUS, HONEYCOMB, NEW-RELIC] Q2: Does your team have the capacity to self-host, operate, and scale the observability infrastructure alongside your product? hint: Prometheus plus Grafana is powerful and free, but running it at production scale — high-availability Prometheus, long-term storage (Thanos or Cortex), Alertmanager clustering, and Grafana LDAP integration — is a substantial ongoing engineering investment. If you don't have a dedicated platform team, a managed SaaS solution will pay for itself quickly. A: YES — dedicated platform team, comfortable running our own stack [PROMETHEUS, CLOUDWATCH] B: NO — fully managed SaaS only [DATADOG, HONEYCOMB, NEW-RELIC, CLOUDWATCH] Q3: Is your primary observability use case debugging complex distributed system failures — tracing requests across microservices, finding unknown-unknown failures with high-cardinality data? hint: Traditional metrics and dashboards tell you that something is wrong; distributed tracing and high-cardinality event exploration tell you why. If your primary pain is debugging a 500ms latency spike that affects only users in a specific region using a specific client version, you need Honeycomb or Datadog APM — not just a dashboard of average response times. A: YES — distributed tracing and high-cardinality event exploration are the primary need [HONEYCOMB, DATADOG] B: NO — infrastructure dashboards, uptime alerting, and standard metrics are the primary need [PROMETHEUS, CLOUDWATCH, DATADOG, NEW-RELIC] Q4: Is per-host or per-service SaaS pricing a significant budget constraint? hint: Datadog and New Relic both charge per monitored host, which scales aggressively for large fleets. At 50+ hosts, per-host SaaS pricing commonly exceeds $5,000–$15,000/month. Prometheus is open-source with compute and storage costs only. Honeycomb charges per event rather than per host, which may be more or less expensive depending on your event volume and team size. A: YES — budget is a significant constraint [PROMETHEUS, CLOUDWATCH] B: NO — per-host SaaS pricing is acceptable at our scale [DATADOG, NEW-RELIC, HONEYCOMB] [DATADOG]: Datadog color: #632CA6 description: Datadog is the most comprehensive commercial observability platform available — it unifies infrastructure metrics, APM traces, logs, real user monitoring (RUM), synthetic monitoring, database monitoring, and security signals in a single pane of glass with a shared tagging model across all telemetry types. Its agent-based collection, 700+ integrations, and auto-instrumentation libraries cover virtually every technology in a modern stack. Datadog's notebook and dashboard system is excellent, and its alert correlation (using monitors, composite monitors, and event-based correlations) helps reduce alert fatigue. The trade-off is cost: Datadog is among the most expensive observability platforms per host, and costs can grow quickly as you add products. The per-SKU pricing model (APM, logs, RUM, security are all separate line items) means teams commonly spend more than budgeted as they expand coverage. Despite the cost, Datadog is the default choice for engineering organisations that value a single, deeply integrated observability platform and can justify the investment. code: OBS_DATADOG [PROMETHEUS]: Prometheus + Grafana color: #E6522C description: Prometheus and Grafana are the open-source industry standard for infrastructure and application metrics. Prometheus scrapes metrics from instrumented applications and exporters (node_exporter for Linux, kube-state-metrics for Kubernetes, and hundreds of community exporters for databases, proxies, and services), storing them in a time-series database with a powerful query language (PromQL). Grafana provides dashboards, alerting, and — with Loki — log aggregation. The total software cost is zero; you pay only for compute and storage. Prometheus integrates natively with Kubernetes through ServiceMonitor and PodMonitor custom resources (via the Prometheus Operator), making it the default choice for Kubernetes-native observability. For long-term metrics storage beyond Prometheus's local retention, Thanos or Cortex provide highly available, object-storage-backed solutions. The investment required to run this stack reliably at scale is real but well-documented, and the ecosystem is mature enough that most operational patterns have established solutions. code: OBS_PROMETHEUS [CLOUDWATCH]: AWS CloudWatch color: #FF9900 description: AWS CloudWatch is the native monitoring service for AWS infrastructure — it automatically collects metrics from EC2, RDS, Lambda, ELB, ECS, EKS, and virtually every other AWS service with no agent installation or configuration required. CloudWatch Logs centralises log streams from all AWS services and Lambda functions. CloudWatch Alarms integrate directly with SNS for notifications and Auto Scaling for remediation actions. For workloads exclusively on AWS, CloudWatch provides adequate infrastructure visibility with zero operational overhead. Its limitations become apparent for application-level observability: the metrics and dashboards are adequate but not exceptional, distributed tracing requires X-Ray (a separate service with separate pricing), and the log query language (CloudWatch Logs Insights) is less powerful than Loki or Elasticsearch. CloudWatch is the right choice for AWS-only shops that need solid infrastructure monitoring without paying for a separate SaaS platform — and a reasonable starting point before deciding whether to add Datadog or Prometheus for deeper observability. code: OBS_CLOUDWATCH [HONEYCOMB]: Honeycomb color: #F5A623 description: Honeycomb is built around a fundamentally different observability philosophy: instead of pre-aggregating metrics into dashboards, it stores every event in full fidelity and allows arbitrary slicing by any combination of fields at query time. This high-cardinality, high-dimensionality approach makes it possible to ask questions like "show me the p99 latency for requests from users on iOS 17.2 in the EU-WEST-1 region whose request included product ID 'abc123' in the last 15 minutes" — a query impossible in traditional metrics-based systems. Honeycomb's BubbleUp feature automatically surfaces which dimensions correlate with a latency spike or error rate increase, dramatically reducing the time to diagnose complex production incidents. It integrates with OpenTelemetry, making it straightforward to instrument existing services. Honeycomb is the right choice for teams with complex microservice architectures, where unknown-unknown failures are the hardest debugging challenge and where a metrics dashboard would never surface the root cause. It is not a replacement for infrastructure monitoring — pair it with CloudWatch or node_exporter for host-level metrics. code: OBS_HONEYCOMB [NEW-RELIC]: New Relic color: #008C99 description: New Relic is a full-stack observability platform with a strong heritage in APM (Application Performance Monitoring) — its auto-instrumentation agents for Java, .NET, Python, Ruby, Node.js, PHP, and Go can provide deep application visibility with minimal code changes. The New Relic One platform unifies APM, infrastructure monitoring, distributed tracing, browser monitoring, and synthetic checks with a shared entity model and NRQL (New Relic Query Language) for ad-hoc analysis. New Relic shifted to a consumption-based pricing model in 2020 — you pay for data ingest (GB per month) plus seats for full users — which can be more predictable than Datadog's per-host model for large fleets but requires careful management of data volume. New Relic is a strong choice for organisations with polyglot application stacks that need deep language-level APM without manual instrumentation effort, and for teams migrating from legacy APM tools (AppDynamics, Dynatrace) who want a modern platform with a familiar feature set. code: OBS_NEW_RELIC