Back to Insights
SA
Sumit Arora

Full-Stack Architect

Brisbane, Australia
February 2026
25 min readPlatform EngineeringPart 5 of 9

What a Platform Engineer, SRE, or Cloud Engineer Actually Knows

This is not about getting hired. This is about what the craft looks like from the inside — the knowledge that builds over years, the depth that separates someone who manages infrastructure from someone who truly understands it, and how the role evolves as experience accumulates.

Three Titles, One Craft

Platform Engineer, Site Reliability Engineer (SRE), Cloud Engineer — these titles appear on different job postings but describe overlapping skill sets. The boundaries are blurry and vary by company. What they share is this: they are all responsible for the systems that production applications run on — the infrastructure, the reliability, the automation, the security, and the operational discipline that keeps everything alive.

What follows is our perspective on what this role actually looks like — the layers of knowledge that build over time, the depth at each stage, and what changes as you grow from someone learning the basics to someone designing systems that thousands of people depend on.

1

The Knowledge Layers — What You Actually Need to Understand

Ranked by how frequently they appear in real production work

After studying how companies build and operate production platforms — from compliance systems to financial services to SaaS products — certain skills appear consistently. Not because someone wrote them in a job description, but because production systems demand them.

Cloud Infrastructure (AWS, Azure, or GCP)95% of production environments

Understanding how cloud services fit together — VPC, EC2/ECS, RDS, S3, IAM, Lambda, CloudWatch. Not just knowing what each service does, but understanding why you would choose one over another and how they interact in a real architecture.

Infrastructure as Code (Terraform, CloudFormation)85% of production environments

Defining infrastructure in version-controlled code. Modules, state management, remote backends, plan/apply workflows. The ability to look at a Terraform file and understand the entire infrastructure it creates.

Containers and Orchestration (Docker, Kubernetes, ECS)80% of production environments

Building container images, writing Dockerfiles, understanding orchestration — services, deployments, health probes, resource limits. Knowing why a container keeps restarting and how to diagnose it.

CI/CD Pipelines80% of production environments

Designing build-test-deploy pipelines. Understanding each stage, automated testing gates, deployment strategies (rolling, blue-green, canary), and what rollback actually means in practice.

Scripting and Automation (Python, Bash)75% of production environments

Not software engineering — operational automation. Writing scripts that rotate secrets, clean up stale resources, generate reports, and automate repetitive infrastructure tasks.

Monitoring and Observability70% of production environments

Metrics, logs, traces. Designing dashboards that show system health at a glance. Writing alert rules that wake you up when it matters — and not when it does not. Understanding SLIs, SLOs, and error budgets.

Security and Access Control65% of production environments

IAM policies, secrets management, encryption at rest and in transit, security groups, WAF configuration, vulnerability scanning. Defence in depth — not as a buzzword, but as a design principle.

Incident Management and Reliability55% of production environments

On-call practices, runbooks, post-incident reviews. Designing systems that degrade gracefully under failure. Circuit breakers, retry logic, dead letter queues. Understanding that failure is not a bug — it is a certainty to design for.

Networking Fundamentals50% of production environments

TCP/IP, DNS, VPC design, subnets, routing, security groups, NACLs. When traffic is not flowing, networking knowledge is what separates a 5-minute fix from a 5-hour investigation.

Database Operations50% of production environments

Connection pooling, replication, failover, backup/restore, migration strategies. Not writing SQL — understanding how databases behave under load and what happens when the primary goes down.

2

What Changes With Experience — The Same Problem, Different Depth

How the same skill evolves as you grow

A junior platform engineer and a senior one both "know Terraform." But what that means in practice is wildly different. Here is what depth looks like across experience levels — not measured by years, but by the complexity of problems you can solve.

DomainEarly (Learning the Tools)Mid (Solving Real Problems)Senior (Designing Systems)
InfrastructureCan provision an EC2 instance. Understands what a VPC is. Follows tutorials.Designs multi-AZ VPCs from scratch. Writes Terraform modules. Understands cost implications.Defines infrastructure standards for the organisation. Reviews architecture proposals. Makes build-vs-buy decisions.
DeploymentCan trigger a CI/CD pipeline. Understands what "deploy" means.Designs the pipeline. Implements rolling deployments with health checks and automatic rollback.Defines deployment strategy across the org. Chooses between rolling, blue-green, canary based on risk profile.
MonitoringCan read a Grafana dashboard. Understands what a metric is.Designs dashboards. Writes meaningful alert rules. Implements SLOs.Defines the observability strategy. Sets error budgets that balance reliability with velocity.
IncidentsFollows a runbook step by step. Can escalate.Leads investigations. Writes runbooks. Conducts post-incident reviews.Designs incident management culture. Builds systems that are resilient by design.
SecurityUses secrets manager instead of hardcoding credentials.Designs IAM policies with least privilege. Configures WAF. Runs vulnerability scans.Defines security architecture. Reviews for Essential Eight, ISO 27001, SOC 2 compliance.
AutomationWrites bash scripts that automate a single task.Builds reusable automation across teams. Automates cert rotation, cleanup, reporting.Designs the automation strategy. Builds internal developer platforms.

The Pattern

Early career: you learn what the tools do. Mid career: you learn when to use them and why. Senior: you learn when not to use them and design the systems that make the right choice obvious for everyone else.

3

Platform Engineer vs SRE vs Cloud Engineer — What Is Actually Different?

Same family, different emphasis

Platform Engineer

Builds and maintains the internal platform that developers use to deploy and run their applications. Focuses on developer experience, self-service tooling, and infrastructure abstraction.

Thinks about: "How do I make it easy and safe for 50 developers to deploy without needing to understand VPC routing?"

Site Reliability Engineer (SRE)

Applies software engineering practices to operations problems. Defines SLOs, manages error budgets, designs for reliability, leads incident response. Originally coined at Google.

Thinks about: "What is the acceptable failure rate for this service, and what happens when we exceed it?"

Cloud Engineer

Designs, builds, and manages cloud infrastructure. Deep expertise in one or more cloud providers. Focuses on architecture, cost optimisation, security, and cloud-native patterns.

Thinks about: "Should this be ECS Fargate or EKS? What are the cost, operational, and security trade-offs?"

The Reality

In most organisations — especially startups and mid-size companies — one person does all three. The titles differ, but the person writing Terraform on Monday is the same person debugging a production incident on Tuesday and setting up Grafana dashboards on Wednesday.

4

Certifications as Knowledge Markers

Not proof you can do the job — proof you understand the domain

Certifications do not make you an engineer. But they force you to learn the breadth of a domain systematically, which is valuable when building foundational knowledge. Here is a progression that mirrors how understanding deepens.

Month 1-2

Foundations

AWS Cloud Practitioner — Understand how cloud services fit together

AWS Solutions Architect Associate — Design production architectures

Month 3

Infrastructure

HashiCorp Terraform Associate — Infrastructure as Code, state, modules

Month 4-5

Security & Operations

HashiCorp Vault Associate — Secrets management, encryption, dynamic secrets

AWS SysOps Administrator or DevOps Engineer — Operational depth

Month 6+

Specialisation

CKA — Container orchestration depth

AWS Security Specialty — Cloud security architecture

Or CISSP / CompTIA Security+ if leaning into security

5

What the Progression Actually Looks Like

Not a career ladder — a widening of responsibility and judgement

Year 0-2: Learning the Craft

You are building foundational knowledge. You follow runbooks, you ask questions, you make mistakes in staging. You learn that production is not like your laptop. You discover that a 5-second database query in dev takes 45 seconds under real load.

What you know: How to provision infrastructure. How to read logs. How to follow a runbook. How to deploy code safely.

Typical compensation: ₹4-10 LPA (India) · $65-90K (AU) · $70-100K (US)

Year 2-5: Owning Problems

You stop following and start designing. You write the runbooks others follow. You lead incident investigations. When something breaks, people come to you — not because you know every answer, but because you know how to find it systematically.

What you know: Why this architecture was chosen over alternatives. How to debug across layers. How to design systems that fail gracefully.

Typical compensation: ₹15-35 LPA (India) · $110-160K (AU) · $130-180K (US)

Year 5-8: Shaping Systems

Your impact extends beyond your own work. You define standards that other teams follow. You review architecture proposals and spot the failure modes nobody else sees. You make decisions about which problems to solve with technology and which to solve with process.

What you know: When to build vs when to buy. How to balance reliability with development velocity. How to communicate technical risk to non-technical stakeholders.

Typical compensation: ₹35-60 LPA (India) · $160-220K (AU) · $180-280K (US)

Year 8+: Defining Culture

The systems you design outlast your tenure. You shape how the organisation thinks about reliability, security, and operational discipline. The most senior platform engineers do not write the most code — they create the environment where everyone else can do their best work safely.

What you know: How to build a culture of reliability. How to manage risk across an organisation. When to break the rules you wrote yourself.

Typical compensation: ₹60 LPA+ (India) · $220K+ (AU) · $250K+ (US)

6

What AI Changes — And What It Does Not

The craft is evolving, but the fundamentals are not going away

AI tools can write Terraform files, generate Dockerfiles, and suggest monitoring configurations. This changes the speed at which you can produce infrastructure code. It does not change the judgement required to decide what to build, or the discipline required to operate it safely.

AI Makes This Faster

  • • Generating boilerplate Terraform, Dockerfiles, pipeline configs
  • • Writing monitoring queries and alert rules
  • • Drafting runbooks and documentation
  • • Analysing logs for patterns
  • • Suggesting security configurations

AI Does Not Replace This

  • • Understanding why a system is designed a certain way
  • • Debugging a production incident at 2 AM under pressure
  • • Making risk trade-off decisions with incomplete information
  • • Designing for failure modes nobody has seen yet
  • • Building trust with teams that depend on your platform
  • • Knowing when the AI-generated config is subtly wrong

This Is a Craft Built on Trust

When companies give you access to their production infrastructure, they are trusting you with their business. Customer data, revenue systems, compliance platforms, financial transactions — all of it runs on the infrastructure you manage.

The engineers who grow the most in this field are not just technically skilled — they are reliable, disciplined, and communicate clearly under pressure. They automate what can be automated, document what they learn, and turn every failure into an improvement.

Whether you are just starting or have been in the field for years, the craft keeps evolving. That is what makes it worth pursuing.

Note: The architecture examples in this series reference LexAML, a real-world AML/CTF compliance platform. The diagrams shown are high-level representations shared for educational purposes.

This content is compiled from various industry sources, official documentation, and practical experience gained across production environments. Your experience may differ based on your organisation, tech stack, and industry context.

We are continuously developing and fine-tuning this content. If something differs from your understanding, or if you have suggestions for improvement, we would genuinely appreciate hearing from you.

Reach out: sumit@getpostlabs.io