Back to Insights
SA
Sumit Arora

Full-Stack Architect

Brisbane, Australia
February 2026
18 min readPlatform EngineeringPart 1 of 9

What Platform Engineering Really Means — And What Nobody Tells You in College

Building software is one thing. Keeping it alive, stable, and serving thousands of users at 3 AM on a Sunday — that is an entirely different discipline. This is what platform engineering is about.

The Gap Nobody Talks About

In university, you learn to write code. You learn algorithms, data structures, maybe some web development. You build a project, demo it, and you are done. But in the real world, shipping code is where the work begins, not where it ends.

Imagine a compliance platform — something like an AML/CTF system that real estate agents, lawyers, and accountants use every day to meet government regulations. It runs microservices, processes sensitive customer data, connects to government verification APIs, and must maintain a 7-year audit trail. Now imagine it goes down on a Tuesday morning when 200 firms are trying to submit compliance reports. That is the problem platform engineering solves.

1

What Is Platform Engineering?

The real definition — not the buzzword version

Platform engineering is the discipline of building and maintaining the foundation that software runs on. It is not about writing the application code itself — it is about ensuring that code can be deployed reliably, scaled when demand increases, recovered when things break, and monitored so you know what is happening at all times.

Think of it like this: if developers are the architects and builders of a house, platform engineers are the people who ensure the electricity works, the plumbing does not burst, the fire alarms are installed, and the foundation can handle an earthquake. Without them, the house might look beautiful but it will not survive its first storm.

A Working Definition

Platform engineering covers infrastructure provisioning, deployment automation, monitoring and observability, incident response, reliability, security hardening, and cost optimisation of the systems that production applications run on.

In a real compliance platform, this means ensuring that the authentication service never goes down (because locked-out users cannot submit regulatory reports), that the audit trail database maintains integrity across millions of records, that message queues processing risk assessments do not silently drop messages, and that a deployment of a bug fix does not take down the entire system.

2

What a Real Production System Looks Like

A compliance platform as a working example

To make this concrete, let us walk through what a production-grade compliance platform looks like from a platform engineering perspective. This is not hypothetical — it reflects the kind of architecture companies actually build for regulated industries.

Frontend Layer

Angular or React SPA served via CDN (CloudFront). Static assets cached at edge locations globally. Users hit the nearest server, not your origin.

API Gateway

All requests pass through an API Gateway that validates JWT tokens, enforces rate limits per tenant, validates request schemas, and routes to the correct microservice.

Microservices (6-12 services)

Each domain (identity, compliance cases, screening, reporting, audit, notifications) runs as a separate Docker container on ECS Fargate or Kubernetes. Each can scale independently.

Data Layer (4+ databases)

PostgreSQL for structured data, MongoDB for dynamic workflows, Redis for caching and sessions, OpenSearch for audit trail querying. Each has its own backup, scaling, and monitoring strategy.

Message Queues

SQS or RabbitMQ for async operations — screening checks, risk calculations, notification delivery, report generation. Messages must never be silently lost.

Infrastructure

VPC with public/private/data subnets across multiple availability zones. NAT gateways, load balancers, WAF rules, SSL certificates, DNS routing — all managed as code via Terraform.

The Complexity That Is Not Obvious

That is 6-12 microservices, 4 databases, message queues, a CDN, an API gateway, load balancers, container orchestration, and external API integrations — all running simultaneously across multiple availability zones. A single human cannot monitor all of this manually. This is why platform engineering exists as a dedicated discipline.

3

The Five Pillars of Platform Engineering

What you actually do every day

Every platform engineering role, regardless of company size or industry, revolves around these five areas. Job titles vary — you might be called a Platform Engineer, Site Reliability Engineer (SRE), Cloud Engineer, or DevOps Engineer — but the work maps to these same pillars.

Pillar 1: Infrastructure as Code (IaC)

Everything — VPCs, subnets, databases, load balancers, DNS records, firewall rules — is defined in code (Terraform, CloudFormation, Pulumi). Nothing is created by clicking buttons in a console. This means infrastructure is version-controlled, reviewable, repeatable, and recoverable.

Why it matters: If your production database gets accidentally deleted, you can recreate the exact same infrastructure from code in minutes. If someone changes a firewall rule incorrectly, you can see exactly what changed in the Git history and revert it.

Pillar 2: CI/CD and Deployment Automation

Every code change goes through an automated pipeline: lint, test, build, push to container registry, deploy to staging, run integration tests, then deploy to production with a manual approval gate. No one SSHs into a server and runs commands manually.

Why it matters: In a compliance platform, a bad deployment could mean firms cannot submit regulatory reports — which has legal consequences. Automated pipelines with approval gates, canary deployments, and instant rollback are not optional luxuries; they are requirements.

Pillar 3: Monitoring and Observability

You cannot fix what you cannot see. Every service emits metrics (CPU, memory, request latency, error rates), logs (structured JSON with correlation IDs), and traces (distributed tracing across microservices). Dashboards show real-time system health. Alerts fire when thresholds are breached.

Why it matters: When a message queue starts backing up because the screening service is slow, you need to know about it before users start complaining. Observability means you can trace a single user request across 6 microservices and identify exactly where it got stuck.

Pillar 4: Reliability and Fault Tolerance

Things will break. The database will have a connection spike. An external API will timeout. A container will run out of memory. Platform engineering means designing for failure: multi-AZ deployments, health checks, circuit breakers, retry logic with exponential backoff, dead letter queues for failed messages, and automated failover.

Why it matters: In a compliance system, if the government's identity verification service (DVS) goes down, cases must be queued and flagged — never silently skipped. A platform engineer designs the system so that external failures do not cascade into internal failures.

Pillar 5: Security and Compliance

Defence in depth: WAF at the edge, API gateway authentication, service-to-service authentication, method-level role-based access control, row-level security in databases, encryption at rest and in transit, secrets management (Vault or AWS Secrets Manager), and regular security scanning of container images.

Why it matters: A compliance platform stores sensitive personal information and financial data. A single breach does not just cost money — it can result in regulatory action, loss of licences, and criminal liability. Security is not a feature; it is a foundational layer.

4

What Companies Actually Want

Decoded from real job postings

We analysed dozens of real job postings for Cloud Engineer, DevOps Engineer, Platform Engineer, and SRE roles from companies like Google, Citi, Barclays, GE HealthCare, Workday, and others. Here is what consistently appears — and what it actually means in practice.

What the Job Posting SaysWhat It Actually MeansWhich Pillar
Experience with IaC tools (Terraform, CloudFormation)You can define entire cloud environments in code and deploy them reproduciblyInfrastructure as Code
CI/CD pipeline experience (Jenkins, GitHub Actions)You can build automated build-test-deploy workflows end to endDeployment Automation
Monitoring tools (Prometheus, Grafana, Datadog, CloudWatch)You can set up dashboards, alerts, and trace requests across servicesObservability
Container orchestration (Docker, Kubernetes, EKS)You can package, deploy, scale, and troubleshoot containerised applicationsInfrastructure + Reliability
Incident response and root cause analysisYou can diagnose production problems under pressure and prevent recurrenceReliability
Strong scripting (Python, Bash)You can automate operational tasks — not just write application codeAutomation
AWS services (EC2, VPC, IAM, RDS, S3, Lambda)You understand how cloud services fit together to form a production systemInfrastructure
Security best practices (IAM, RBAC, secrets management)You can implement least-privilege access, manage secrets safely, enforce encryptionSecurity
High-availability and fault-tolerant systemsYou design systems that keep working when components failReliability
ITIL processes (incident, change, release management)You follow structured processes for handling production changes and incidentsOperations Discipline

The Pattern

Notice how every requirement maps back to the five pillars. Companies are not asking for random skills — they are asking for people who can keep production systems alive. The tools change (Terraform today, something else tomorrow), but the pillars remain constant.

5

Same Work, Different Titles

Understanding the job market

The industry has not standardised job titles. These roles overlap significantly, and what matters is the work, not the title on your business card.

Cloud Engineer

Focus: AWS/Azure/GCP infrastructure design and management

Salary range (India): ₹6-35 LPA depending on experience

DevOps Engineer

Focus: CI/CD pipelines, automation, bridging dev and ops

Most common title in job postings globally

Platform Engineer

Focus: Internal developer platforms, golden paths, self-service infrastructure

Growing fastest — the evolution of DevOps

Site Reliability Engineer (SRE)

Focus: System reliability, SLOs/SLAs, incident management, error budgets

Coined by Google. Highest emphasis on reliability metrics

6

A Day in the Life

What you actually do — not what LinkedIn says

9:00 AM

Check overnight alerts. A CloudWatch alarm fired at 2:17 AM — RDS connection pool hit 85% capacity. It auto-resolved but you investigate why.

9:30 AM

Review a pull request on a Terraform change — a teammate is adding a new SQS queue for the notification service. You check IAM permissions, dead letter queue config, and encryption settings.

10:30 AM

A developer reports that their service is getting 502 errors in staging. You check the ALB target group health checks, find the new container is failing its readiness probe because it depends on a config value that was not set in staging.

11:30 AM

Sprint planning. The team wants to migrate from self-managed OpenSearch to OpenSearch Serverless to reduce costs. You estimate the work and identify risks (query compatibility, ingestion patterns).

1:00 PM

Write a runbook for the new deployment pipeline. When this service is deployed, what should the on-call engineer check? What are the rollback steps? What dashboards should they watch?

2:30 PM

Production deployment. You watch the rolling update in ECS — new containers come up, health checks pass, old containers drain connections and terminate. Zero downtime. You verify key metrics for 15 minutes post-deploy.

3:30 PM

Work on automating SSL certificate rotation. Currently it is manual every 90 days — you are building a Lambda function that handles it automatically and alerts if renewal fails.

4:30 PM

Update Grafana dashboards. The new screening service needs panels for queue depth, processing latency, and external API response times. You add alert rules: if queue depth exceeds 1000 for 5 minutes, page the on-call engineer.

Notice what is missing?

Not a single task involves writing application features. This is infrastructure work, operational work, reliability work. It requires deep technical knowledge but applied to a fundamentally different problem: keeping things running, not building new things.

7

This Series

What we cover in Parts 2-5

This article is Part 1 of a 9-part series on platform engineering and cybersecurity. Each article goes deep into a specific pillar, with real-world examples and practical guidance.

The Bottom Line

Platform engineering is not glamorous. Nobody writes blog posts about the deployment that went perfectly or the alert that fired and was resolved before users noticed. But every successful software product you use — every banking app, every streaming service, every compliance platform — has platform engineers working behind the scenes to keep it alive.

If you are the kind of person who wants to understand how things work at a systems level, who gets satisfaction from building things that are resilient and reliable, and who can stay calm when production is on fire — this might be your career.

Note: The architecture examples in this series reference LexAML, a real-world AML/CTF compliance platform. The diagrams shown are high-level representations shared for educational purposes.

This content is compiled from various industry sources, official documentation, and practical experience gained across production environments. Your experience may differ based on your organisation, tech stack, and industry context.

We are continuously developing and fine-tuning this content. If something differs from your understanding, or if you have suggestions for improvement, we would genuinely appreciate hearing from you.

Reach out: sumit@getpostlabs.io