dag: Deployment Strategy Selection version: 1.0.0 image: https://images.unsplash.com/photo-1558618666-fcd25c85cd64?w=1200&q=80 description: Deployment strategy is one of the highest-leverage engineering decisions for system reliability — the wrong choice turns every release into a stressful event, while the right choice makes deployments routine. This tree routes you to the pattern that matches your infrastructure capabilities, rollback requirements, and confidence level in each release. tags: devops, deployment, reliability, sre, infrastructure, engineering entry: Q1 Q1: Can your deployment target run two versions of the application simultaneously with a load balancer routing traffic between them? hint: Zero-downtime deployments require your infrastructure to support parallel versions — at minimum two instances and a load balancer or service mesh. If you have a single server, shared mutable state between instances (file system, in-process cache), or a database migration that requires both versions to be offline simultaneously, answer No. yes -> Q2 no -> [BIG-BANG] Q2: Do you need to validate the new version against a fraction of real production traffic before committing to a full rollout? hint: Traffic-splitting validation is valuable when the code change carries meaningful risk and real-user behaviour is the only reliable signal — new algorithms, pricing changes, significant UX rewrites, or anything where staging environment testing can't reproduce true production load patterns. yes -> Q3 no -> Q4 Q3: Can your infrastructure split traffic by percentage — routing, say, 5% of requests to the new version while 95% continue to the old? hint: Weighted routing requires a feature flag system (LaunchDarkly, Unleash), a service mesh (Istio, Linkerd), or a load balancer that supports traffic splitting (AWS ALB weighted target groups, NGINX split clients). If your infrastructure only supports an all-or-nothing cut-over, answer No. yes -> [CANARY] no -> [FEATURE-FLAGS] Q4: Do you need instant rollback — reverting to the previous version in seconds by switching a load balancer, with no redeployment required? hint: Blue-green provides the fastest rollback but requires double the infrastructure capacity for the switchover window. If infrastructure cost is a constraint, or if your deployment pipeline is fast enough that redeploying the previous image in 2–5 minutes is acceptable, a rolling update is more cost-efficient. yes -> [BLUE-GREEN] no -> [ROLLING] [BLUE-GREEN]: Blue-Green Deployment color: #0077CC description: Blue-green deployment maintains two identical production environments — blue (currently live) and green (idle) — and releases by switching the load balancer from blue to green. The previous environment stays intact and reachable for an immediate rollback: if problems surface post-switch, a single load balancer or DNS change reverts traffic in seconds without any redeployment. This pattern eliminates deployment risk for end users and makes rollback a trivial operation. The cost is double the infrastructure capacity for the switchover window, and it works cleanly only when database migrations are backwards-compatible — both environments share the same database, so the schema must support both the old and new application version simultaneously. Teams on AWS typically implement this with Elastic Beanstalk environment swaps, ECS blue/green deployments, or ALB weighted target group switching. code: DEPLOY_BLUE_GREEN [CANARY]: Canary Release color: #F5A623 description: A canary release routes a small, controlled percentage of production traffic to the new version while the majority continues hitting the stable version. Metrics — error rate, latency percentiles, business KPIs — are monitored on the canary slice; if they remain within acceptable bounds the percentage is gradually increased (1% → 5% → 25% → 100%) until full promotion. If metrics degrade, the canary is pulled back with minimal user impact. This is the most rigorous deployment pattern for high-traffic systems where even a brief full-scale degradation would be costly. It is the standard approach at companies like Netflix, Google, and Stripe. Canary releases require weighted traffic routing (service mesh, feature flags, or ALB weighted target groups) and automated rollout and rollback triggered by SLO-aligned alerts. The flip side: canary validation extends your deployment window from minutes to hours or days for large changes. code: DEPLOY_CANARY [ROLLING]: Rolling Update color: #7B68EE description: A rolling update replaces application instances one at a time (or in small batches), ensuring that at least some capacity remains on the old version throughout the deployment. Kubernetes uses rolling updates by default — the deployment controller increments new pods while decrementing old ones, respecting the configured `maxSurge` and `maxUnavailable` budgets. The key advantage over blue-green is that rolling updates require no additional infrastructure capacity — you upgrade in place. The trade-off is a longer window during which old and new code run simultaneously, which requires backwards-compatible APIs and database schemas. Rollback is also slower than blue-green: reverting requires redeploying the previous image rather than simply rerouting traffic. Rolling updates are the right default for most Kubernetes-hosted services where immediate rollback is not a hard requirement. code: DEPLOY_ROLLING [FEATURE-FLAGS]: Feature Flag Release color: #27AE60 description: Feature flag releases decouple deployment from release — the new code ships to all servers at once but is hidden behind a flag, giving product and engineering full control over when and to whom the feature becomes visible. Flags can target specific users, organisations, beta cohorts, or percentage rollouts through a feature management platform (LaunchDarkly, Unleash, Flagsmith, or a homegrown system). This pattern is particularly powerful for gradual rollouts, A/B tests, kill switches, and migrations where you want to reach internal users first, then beta customers, then general availability — all without additional deployments. The principal risk is flag debt: features whose flags are never removed accumulate as conditional code paths that complicate testing and increase cognitive overhead. Establish a convention for flag lifecycle — create, reach GA, remove within one quarter — before adopting this at scale. code: DEPLOY_FEATURE_FLAGS [BIG-BANG]: Big Bang Deployment color: #E74C3C description: A big bang deployment stops the old version, deploys the new version, and restarts — resulting in a planned maintenance window with user-visible downtime. While often labelled an anti-pattern, it is the correct and practical choice in several contexts: single-server deployments without a load balancer, database schema changes that are not backwards-compatible and require a coordinated cutover, or systems with in-process state that cannot coexist between versions. The key discipline for big bang deployments is thorough preparation: validate in a staging environment that mirrors production, script and test the rollback procedure in advance, communicate the maintenance window to users, and execute using a single idempotent script rather than a sequence of manual steps under time pressure. Schedule during the lowest-traffic window and define a hard go/no-go decision point before the deployment begins. code: DEPLOY_BIG_BANG