Back to Insights
SA
Sumit Arora

Full-Stack Architect

Brisbane, Australia
February 2026
20 min readDevOpsPart 4 of 9

DevOps, Automation, and Production Discipline

How code goes from a developer's laptop to serving real users — and why every step of that journey must be automated, repeatable, and reversible.

The Core Principle

In a production system, there is one rule above all others: if it is not automated, it will break. Manual processes depend on memory, attention, and consistency — three things that humans are terrible at under pressure. Automation does not get tired at 3 AM. It does not skip steps when rushed. It does not forget to check the database connection before deploying.

DevOps is the culture, the tools, and the discipline that makes this automation possible. It is not a job title (although it has become one). It is a set of practices that ensure software can be built, tested, deployed, and operated reliably.

1

The CI/CD Pipeline — From Commit to Production

Every line of code travels this path

CI/CD stands for Continuous Integration and Continuous Delivery (or Deployment). It is the automated assembly line that takes a developer's code change and moves it through testing, building, and deployment stages until it reaches production. Here is what a real pipeline looks like for a microservice in a compliance platform.

Stage 1: Code Push & Pull Request

What happens

Developer pushes code to a feature branch. Creates a Pull Request (PR).

Automation

GitHub Actions triggers automatically on PR creation.

Why it matters

No code reaches the main branch without review and automated checks.

Stage 2: Automated Testing

What happens

Lint checks (code style), unit tests, integration tests (using Testcontainers for real database/queue testing), static analysis (SonarQube for code quality and security vulnerabilities).

Automation

All tests run in parallel. If any fail, the PR is blocked from merging.

Why it matters

Catch bugs before they reach any environment. A test that catches a bug here saves a 2 AM incident later.

Stage 3: Build & Package

What happens

Compile code → Build Docker image → Push to container registry (ECR). Image is tagged with the Git commit hash for traceability.

Automation

Triggered automatically when PR is merged to the main branch.

Why it matters

Every deployment artifact is immutable and traceable to a specific code commit.

Stage 4: Deploy to Staging

What happens

Deploy the new Docker image to the staging environment. Run end-to-end tests (Cypress for frontend, API integration tests for backend). Staging mirrors production configuration.

Automation

Automatic deployment to staging after successful build.

Why it matters

Staging is your last safety net before production. It must be as close to production as possible.

Stage 5: Deploy to Production

What happens

Manual approval gate → Rolling deployment via ECS → Health check verification → Connection draining from old containers → Old containers terminated.

Automation

The deployment itself is automated. Only the approval is manual.

Why it matters

A human decides WHEN to deploy. The system decides HOW. This separation prevents both "forgot to run tests" and "deployed at the worst possible time".

The Golden Rule of Pipelines

A developer should be able to merge a PR and walk away. The pipeline handles everything else. If the pipeline fails, the deployment stops automatically. If the deployment fails health checks, it rolls back automatically. No human intervention required for the unhappy path.

2

Infrastructure as Code — Everything Is a Git Commit

Terraform, CloudFormation, and why clicking buttons in a console is dangerous

In a production environment, every piece of infrastructure — VPCs, subnets, security groups, load balancers, databases, DNS records, SSL certificates, IAM roles — is defined in code files (typically Terraform HCL or AWS CloudFormation YAML). This code is stored in Git, reviewed in pull requests, and applied through automated pipelines.

Without IaC (Manual)

  • • "Who changed the security group last Tuesday?"
  • • "The staging environment doesn't match production"
  • • "We can't recreate this environment — nobody remembers all the settings"
  • • "Someone accidentally deleted the NAT Gateway"
  • • Configuration drift: production slowly diverges from what you think it is

With IaC (Automated)

  • • Every change is in Git history with who, when, and why
  • • "terraform plan" shows exactly what will change before you apply
  • • Recreate any environment in minutes from code
  • • Deleted something? Revert the Git commit and re-apply
  • • Staging and production are provably identical (same code, different variables)

# Example: Terraform defining an RDS database for a compliance platform

resource "aws_db_instance" "compliance_primary" {
  identifier           = "compliance-db-primary"
  engine               = "postgres"
  engine_version       = "15.4"
  instance_class       = "db.r6g.xlarge"
  allocated_storage    = 100
  multi_az             = true          # Automatic failover to standby AZ
  
  db_subnet_group_name = aws_db_subnet_group.data.name
  vpc_security_group_ids = [aws_security_group.rds.id]
  
  backup_retention_period = 14         # 14 days of automated backups
  storage_encrypted       = true       # Encryption at rest
  
  performance_insights_enabled = true  # Query performance monitoring
  
  tags = {
    Environment = var.environment
    Service     = "compliance-platform"
    ManagedBy   = "terraform"
  }
}

The Terraform Workflow

terraform plan → Shows exactly what will change (create 2 resources, modify 1, destroy 0). You review this like a code diff. terraform apply → Executes the changes. terraform state → Tracks what Terraform has created so it knows the current state. Remote state with locking → Stored in S3 with DynamoDB lock so two people cannot apply changes simultaneously.

3

Deployment Strategies — How to Ship Without Breaking Things

Rolling, blue-green, canary, and when to use each

Rolling Deployment (Most Common)

Replace containers one at a time. New container starts → health check passes → old container drains connections → old container terminates. Repeat until all containers are running the new version.

Zero downtimeAutomatic rollback on health check failureUsed by: ECS Fargate, Kubernetes
Blue-Green Deployment

Run two identical environments: Blue (current production) and Green (new version). Deploy to Green. Test it. Then switch the load balancer to point to Green. If something is wrong, switch back to Blue instantly.

Instant rollback (just switch back)Costs 2x infrastructure during deploymentBest for: critical services, database migrations
Canary Deployment

Deploy the new version to a small percentage of traffic (e.g., 5%). Monitor error rates and latency. If metrics are healthy, gradually increase to 25%, 50%, 100%. If metrics degrade, automatically roll back.

Lowest riskCatches issues that only appear under real trafficBest for: user-facing services with high traffic
4

Secrets Management — The Number One Security Failure

Never hardcode credentials. Never commit secrets to Git. Here is what you do instead.

A "secret" is any credential that grants access to something: database passwords, API keys, JWT signing keys, encryption keys, SSH keys, third-party service tokens. The number one cause of security breaches is secrets being stored where they should not be — in source code, in configuration files committed to Git, in plain text environment variables.

ApproachSecurity LevelHow It Works
Hardcoded in source code🔴 Critical vulnerabilityAnyone who sees the code sees the password. If the repo is ever public, game over.
Environment variables (plain)🟡 Better but riskyNot in code, but visible in container configs, process listings, and crash dumps.
AWS Secrets Manager / Vault🟢 Production standardSecrets stored encrypted, retrieved at runtime via API. Access controlled by IAM. Audit logged. Rotation automated.
Dynamic secrets (Vault)🟢 Best practiceVault generates a temporary database credential that expires in 1 hour. Even if stolen, it is useless after expiry.

Real-World Consequence

Banking and financial institutions have faced major breaches from hardcoded credentials exposed in repositories. In regulated industries like compliance, a secret leak can result in unauthorised access to sensitive personal data, identity documents, and financial records — triggering mandatory breach notifications, regulatory investigations, and potential criminal liability.

5

Database Migrations — The Scariest Part of Any Deployment

Changing a running database without downtime

Deploying new application code is relatively safe — if it breaks, you roll back the container. But database schema changes are different. You cannot easily "un-add" a column from 2 million rows. Database migrations must be planned as carefully as surgery.

Safe Migrations (Non-Breaking)

Can be deployed via automated pipeline

  • • Adding a new nullable column
  • • Adding a new index
  • • Adding a new table
  • • Adding a new enum value

Dangerous Migrations (Breaking)

Require manual planning and a maintenance window

  • • Dropping a column that old code still reads
  • • Renaming a column
  • • Changing a column type
  • • Adding a NOT NULL constraint to an existing column

The Expand-Contract Pattern

Need to rename a column from client_name to customer_name? Do it in three deployments: (1) Add the new column, write to both. (2) Migrate existing data, read from new column. (3) Remove the old column. Each step is independently deployable and rollback-safe. It takes longer, but it never breaks production.

6

The Production Readiness Checklist

Before any service goes live, these must be in place

CategoryRequirementWhy
DeploymentCI/CD pipeline with automated testsNo manual deployments to production
DeploymentRollback procedure tested and documentedYou will need it at the worst possible moment
MonitoringHealth check endpoints (liveness + readiness)Load balancer and orchestrator depend on these
MonitoringDashboards for key metricsYou need to see system state at a glance
MonitoringAlerts with severity levels and runbooksEvery alert must be actionable
LoggingStructured JSON logs with correlation IDsYou will need to trace requests across services
SecurityNo secrets in code or environment variablesUse Secrets Manager or Vault
SecurityHTTPS everywhere, encryption at restCompliance requirement in regulated industries
ReliabilityMulti-AZ deploymentSurvives an availability zone outage
ReliabilityAutomated backups with tested restoreA backup you have never tested is not a backup
DocumentationRunbooks for every alertOn-call engineer needs clear instructions
DocumentationArchitecture decision records (ADRs)Why decisions were made, not just what

Automation Is Not Optional — It Is Survival

Every manual step in your deployment process is a future incident waiting to happen. Every secret stored in plain text is a breach waiting to be discovered. Every database migration without a rollback plan is a career-defining moment waiting to arrive at 3 AM.

The best platform engineers are not the ones who can fix production fastest — they are the ones who automate so thoroughly that production rarely breaks in the first place. That is the discipline. That is the craft.

Note: The architecture examples in this series reference LexAML, a real-world AML/CTF compliance platform. The diagrams shown are high-level representations shared for educational purposes.

This content is compiled from various industry sources, official documentation, and practical experience gained across production environments. Your experience may differ based on your organisation, tech stack, and industry context.

We are continuously developing and fine-tuning this content. If something differs from your understanding, or if you have suggestions for improvement, we would genuinely appreciate hearing from you.

Reach out: sumit@getpostlabs.io