Skip to content

EN_IT_Deploy

somaz edited this page Jun 2, 2026 · 1 revision

IT Terminology: Deployment Strategies

23. What are Deployment Strategies?

A deployment strategy is a pattern that defines how a new version of an application replaces the existing version when it is rolled out to production. The choice of strategy directly determines downtime, rollback speed, resource cost, and the level of risk exposed to users. There is no single correct answer; the right strategy depends on the service's SLO, traffic characteristics, and how much infrastructure cost you can afford.

The core consideration is the trade-off between "are users affected during the deployment" and "how quickly can you roll back when something goes wrong." Reducing downtime and speeding up rollback usually requires more resources, because the old and new versions must run side by side.

Recreate

The simplest approach: tear down the entire old version, then bring up the new one.

  • How it works: Terminate all old-version instances, then start the new-version instances.
  • Downtime: Yes (service is unavailable between old shutdown and new startup).
  • Rollback speed: Slow (the old version has to be recreated again).
  • Resource cost: Low (only one version runs at a time).
  • Risk: High user impact because downtime is unavoidable, but there are no compatibility issues from old and new versions running simultaneously.
  • When to use: Internal tools where downtime is acceptable, batch jobs, or cases where old and new cannot coexist (e.g., incompatible DB schema).

Rolling Update

Instances are replaced gradually, one (or a few) at a time. This is the default strategy for a Kubernetes Deployment.

  • How it works: Bring up some new-version Pods, verify they are healthy, remove some old-version Pods, and repeat until everything is replaced.
  • Downtime: None (a baseline number of instances is always serving traffic).
  • Rollback speed: Medium (you have to roll backward, so it is not instant).
  • Resource cost: Low to medium (only maxSurge worth of extra resources is needed).
  • Risk: Old and new versions coexist during the rollout, so backward compatibility is required. If a problem is detected late, a large portion may already be replaced.
  • When to use: Most stateless web services that require zero downtime.

Blue-Green

Fully provision both the old (Blue) and new (Green) environments simultaneously, then switch routing all at once.

  • How it works: Provision the Green (new) environment separately and fully, verify it, switch the load balancer/router to Green, then keep Blue on standby briefly before cleanup.
  • Downtime: None (the cutover is a single routing switch).
  • Rollback speed: Very fast (just point routing back to Blue).
  • Resource cost: High (both environments run at full capacity simultaneously — temporarily double).
  • Risk: Low (Green can be thoroughly verified before cutover, and rollback is instant). Stateful elements such as DB migrations still need separate handling.
  • When to use: Services where fast rollback and pre-cutover verification matter, when you have the resource headroom.

Canary

Expose the new version to a small subset of users/traffic first, then gradually increase the share if no problems appear. The name comes from the canary in a coal mine (an early warning of danger).

  • How it works: Route 5% of traffic to the new version, confirm metrics are healthy, then expand step by step to 25% -> 50% -> 100%.
  • Downtime: None.
  • Rollback speed: Fast (setting the canary share back to 0% immediately stops the impact).
  • Resource cost: Medium (old and new run in parallel, but the new version is only a fraction).
  • Risk: Very low (even if something breaks, only a small subset of users is affected). It does require traffic-splitting and metric-observation infrastructure.
  • When to use: When you want to validate against real production traffic while minimizing user-impact risk.

A/B Testing

Run two versions (A and B) simultaneously, but the goal is comparing business metrics (conversion rate, click-through rate, etc.) rather than verifying technical stability.

  • How it works: Route to version A or B based on user segments (cookie, region, device, etc.), then compare metrics statistically.
  • Difference from Canary: Canary asks "is the new version safe?" (technical); A/B asks "which version produces better outcomes?" (business). The routing basis also differs — Canary is usually a random percentage, A/B is based on user attributes.
  • Resource cost: Medium (two versions run in parallel).
  • Risk: Low (both are often already-validated versions).
  • When to use: When you want to validate the business impact of a feature or UI change with data.

Strategy Comparison

Strategy Downtime Rollback Speed Resource Cost Risk Typical Use-Case
Recreate Yes Slow Low High (outage) Internal tools, batch, no coexistence
Rolling Update None Medium Low to medium Medium Zero-downtime stateless web services
Blue-Green None Very fast High (2x) Low Fast rollback / pre-cutover verification
Canary None Fast Medium Very low Gradual validation on production traffic
A/B Testing None Fast Medium Low Comparing business metrics

Kubernetes Relevance

The default strategy for a Kubernetes Deployment is RollingUpdate, controlled via maxSurge and maxUnavailable to tune replacement speed and availability.

  • maxSurge: The number (or %) of Pods that can be created above the desired replica count. Higher values replace faster but consume more extra resources.
  • maxUnavailable: The number (or %) of Pods that can be unavailable during the update. Setting it to 0 keeps the desired number of available Pods at all times.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%        # temporarily scale up to 12.5 Pods
      maxUnavailable: 0    # never reduce the number of available Pods (zero downtime)
  template:
    spec:
      containers:
        - name: web
          image: registry.example.com/web:v2

A plain Deployment only supports Recreate and RollingUpdate. Blue-Green and Canary are implemented via Service routing switches, Ingress weighting, or dedicated controllers such as Argo Rollouts and Flagger.


24. What is a Feature Flag?

A feature flag (also called a feature toggle) is a conditional-branch mechanism that lets you turn a specific feature on or off at runtime without changing code. Its core value is decoupling deployment from release. Even after code is deployed to production, if the flag is off, users never see it — so you can decide "when to ship the code" and "when to turn the feature on" independently.

if feature_flags.is_enabled("new_checkout", user=current_user):
    return render_new_checkout()
else:
    return render_legacy_checkout()

Flag Categories

Following Martin Fowler's taxonomy, flags differ in lifespan and change frequency, and therefore in how they should be managed.

Type Purpose Lifespan Change Frequency
Release Toggle Hide unfinished/in-validation features in production Short (removed once the feature stabilizes) Low
Ops Toggle Operational control such as disabling a heavy feature under load (kill switch) Long or permanent As needed
Experiment Toggle (A/B) Branch by user segment for A/B testing During the experiment Automatic / frequent
Permission Toggle Expose a feature only to certain users/plans (e.g., premium) Long-lived / permanent Per-user

Runtime Toggling

Flag values are changed in real time from external configuration (a flag-management service or config store) without redeploying code. The ability to instantly turn off a suspect new feature without redeploying (a kill switch) is a major operational advantage.

Gradual Rollout

  • Percentage-based (% rollout): Activate progressively for 1% -> 10% -> 50% -> 100% of users while watching metrics.
  • Segment-based (user segment): Expand exposure in stages — internal employees -> beta users -> specific regions/plans.
  • Unlike infrastructure-level Canary, feature-flag-based gradual rollout can control exposure finely at the application level, per user.

Flag Debt and Cleanup Discipline

Because a flag is fundamentally a conditional branch in code, leaving expired flags around causes code complexity and the number of test cases to explode. This is called flag debt.

  • Discipline: A Release Toggle must be removed once the feature stabilizes (delete both the flag and the dead branch code).
  • Attach metadata such as creation date, owner, and planned expiry to each flag, and clean them up periodically.
  • Use tools or lint rules that automatically detect and warn about stale flags.

Tools

  • LaunchDarkly: The leading commercial SaaS. Sophisticated targeting, experimentation, and audit features.
  • Unleash: Open source (self-hostable), with a wide range of SDKs.
  • Flagsmith: Open source / SaaS hybrid, centered on a REST API.

25. What is GitOps?

GitOps is an operating model that treats a Git repository as the single source of truth, declaratively managing the desired state of infrastructure and applications, and automatically synchronizing that state into the actual environment. In one line: "the desired state lives in Git, and the cluster converges toward it."

The Four OpenGitOps Principles

The CNCF OpenGitOps project defines GitOps with four principles.

  1. Declarative: The system's desired state is described declaratively rather than as imperative procedures (e.g., Kubernetes manifests, Helm, Kustomize).
  2. Versioned and Immutable: The desired state is stored in version control (Git) so changes are tracked, and each state is preserved immutably.
  3. Pulled Automatically: Software agents automatically pull the desired state (it is not pushed in from the outside).
  4. Continuously Reconciled: Agents continuously observe the actual state and constantly reconcile it to match the desired state.

Git as the Single Source of Truth

Because every change goes through a Git commit (usually a Pull Request), who, when, what, and why something changed is automatically recorded, enabling review, approval, and audit. Rollback is simply a git revert to a previous commit.

Push vs Pull Model

Aspect Push Model Pull Model (GitOps)
Operation CI pushes kubectl apply etc. into the cluster from outside An in-cluster agent polls/watches Git and applies changes itself
Credentials CI holds cluster admin credentials (exposed externally) Credentials stay inside the cluster (minimal external exposure)
Drift handling Requires separate detection The agent always reconciles, so drift is auto-corrected
Safety Relatively lower Safer (no credential exposure + automatic reconciliation)

The pull model is safer because it keeps powerful cluster credentials inside the cluster rather than exposing them to an external CI system. The agent also continuously reconciles toward the desired state, automatically reverting drift caused by manual changes.

Drift Detection & Auto-Sync

  • Drift: A situation where the actual state diverges from Git's desired state, e.g., someone changing things directly with kubectl edit.
  • A GitOps controller continuously compares the actual state against Git, surfaces the difference, and — when auto-sync is enabled — forcibly restores (self-heals) to the Git state.

ArgoCD vs Flux

Item ArgoCD Flux
Origin/Governance CNCF Graduated (started by Intuit) CNCF Graduated (started by Weaveworks)
UI Powerful built-in web UI with app graph No built-in UI (separate, e.g., Weave GitOps)
Unit of config Application CRD GitRepository + Kustomization/HelmRelease, etc.
Multi-tenancy Based on Projects Based on namespaces/RBAC
Character Visualization- and ops-friendly, dashboard-centric Lightweight and modular, composed of GitOps Toolkit components

Relation to IaC

GitOps inherits the principles of IaC (Infrastructure as Code) — declarative and version-controlled — but goes one step further by adding continuous reconciliation. IaC is usually triggered by a human (e.g., terraform apply), whereas GitOps has an agent continuously watch Git and automatically converge. In short: "Git = the desired state, and the agent = the loop that constantly matches the system to that state."


26. What is Progressive Delivery?

Progressive delivery is an evolved deployment approach that combines Canary deployment with automated metric analysis and automatic rollback. Beyond simply increasing traffic in small steps, it automatically evaluates real-time metrics at each step — advancing to the next step if they pass, and instantly rolling back without human intervention if a regression is detected.

In one line: "Canary + automated analysis + automatic rollback." Progressive delivery takes the Canary method and replaces the human judgment of "should we advance to the next step" with metric-driven automatic judgment.

Automated Metric Analysis

At each canary step, the new version's health is evaluated with quantitative metrics.

  • Success rate: The ratio of healthy responses (2xx/3xx).
  • Latency: p95 / p99 response time.
  • Error rate: The ratio of 5xx responses and exception rates.

These metrics are collected over an analysis window and compared against thresholds. If they pass, the traffic share is increased; if a regression is detected, it rolls back automatically.

How It Builds on Canary

Recreate / Rolling
        v (add gradual exposure)
     Canary  --- human reads metrics and decides manually
        v (add automated analysis + automatic rollback)
Progressive Delivery --- metric-driven automatic promotion/rollback

Tools

  • Argo Rollouts: Provides a Rollout CRD that replaces the Kubernetes Deployment. It declaratively defines Canary/Blue-Green steps and analysis, evaluating metrics (e.g., from Prometheus) via an AnalysisTemplate to promote or roll back automatically.
  • Flagger: A progressive-delivery controller in the Flux ecosystem. It wraps an existing Deployment to run Canary analysis, integrating with service meshes/Ingress (Istio, Linkerd, NGINX, etc.).
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: web
spec:
  strategy:
    canary:
      steps:
        - setWeight: 20            # send 20% of traffic to the new version
        - pause: { duration: 5m }  # observe metrics for 5 minutes
        - analysis:                # must pass automated analysis to proceed
            templates:
              - templateName: success-rate
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100

Relation to SLO / Error-Budget Gating

By tying the thresholds of automated metric analysis to an SLO (Service Level Objective), you can automatically block and roll back the moment a deployment threatens the SLO. You can also automate gating so that deployments proceed only while the error budget is sufficient and stop once the budget is exhausted. This makes the balance between deployment speed and reliability a metric-driven automatic decision rather than a matter of human intuition.


Reference

Clone this wiki locally