What a Platform Engineer, SRE, or Cloud Engineer Actually Knows
This is not about getting hired. This is about what the craft looks like from the inside — the knowledge that builds over years, the depth that separates someone who manages infrastructure from someone who truly understands it, and how the role evolves as experience accumulates.
Three Titles, One Craft
Platform Engineer, Site Reliability Engineer (SRE), Cloud Engineer — these titles appear on different job postings but describe overlapping skill sets. The boundaries are blurry and vary by company. What they share is this: they are all responsible for the systems that production applications run on — the infrastructure, the reliability, the automation, the security, and the operational discipline that keeps everything alive.
What follows is our perspective on what this role actually looks like — the layers of knowledge that build over time, the depth at each stage, and what changes as you grow from someone learning the basics to someone designing systems that thousands of people depend on.
The Knowledge Layers — What You Actually Need to Understand
Ranked by how frequently they appear in real production work
After studying how companies build and operate production platforms — from compliance systems to financial services to SaaS products — certain skills appear consistently. Not because someone wrote them in a job description, but because production systems demand them.
Understanding how cloud services fit together — VPC, EC2/ECS, RDS, S3, IAM, Lambda, CloudWatch. Not just knowing what each service does, but understanding why you would choose one over another and how they interact in a real architecture.
Defining infrastructure in version-controlled code. Modules, state management, remote backends, plan/apply workflows. The ability to look at a Terraform file and understand the entire infrastructure it creates.
Building container images, writing Dockerfiles, understanding orchestration — services, deployments, health probes, resource limits. Knowing why a container keeps restarting and how to diagnose it.
Designing build-test-deploy pipelines. Understanding each stage, automated testing gates, deployment strategies (rolling, blue-green, canary), and what rollback actually means in practice.
Not software engineering — operational automation. Writing scripts that rotate secrets, clean up stale resources, generate reports, and automate repetitive infrastructure tasks.
Metrics, logs, traces. Designing dashboards that show system health at a glance. Writing alert rules that wake you up when it matters — and not when it does not. Understanding SLIs, SLOs, and error budgets.
IAM policies, secrets management, encryption at rest and in transit, security groups, WAF configuration, vulnerability scanning. Defence in depth — not as a buzzword, but as a design principle.
On-call practices, runbooks, post-incident reviews. Designing systems that degrade gracefully under failure. Circuit breakers, retry logic, dead letter queues. Understanding that failure is not a bug — it is a certainty to design for.
TCP/IP, DNS, VPC design, subnets, routing, security groups, NACLs. When traffic is not flowing, networking knowledge is what separates a 5-minute fix from a 5-hour investigation.
Connection pooling, replication, failover, backup/restore, migration strategies. Not writing SQL — understanding how databases behave under load and what happens when the primary goes down.
What Changes With Experience — The Same Problem, Different Depth
How the same skill evolves as you grow
A junior platform engineer and a senior one both "know Terraform." But what that means in practice is wildly different. Here is what depth looks like across experience levels — not measured by years, but by the complexity of problems you can solve.
| Domain | Early (Learning the Tools) | Mid (Solving Real Problems) | Senior (Designing Systems) |
|---|---|---|---|
| Infrastructure | Can provision an EC2 instance. Understands what a VPC is. Follows tutorials. | Designs multi-AZ VPCs from scratch. Writes Terraform modules. Understands cost implications. | Defines infrastructure standards for the organisation. Reviews architecture proposals. Makes build-vs-buy decisions. |
| Deployment | Can trigger a CI/CD pipeline. Understands what "deploy" means. | Designs the pipeline. Implements rolling deployments with health checks and automatic rollback. | Defines deployment strategy across the org. Chooses between rolling, blue-green, canary based on risk profile. |
| Monitoring | Can read a Grafana dashboard. Understands what a metric is. | Designs dashboards. Writes meaningful alert rules. Implements SLOs. | Defines the observability strategy. Sets error budgets that balance reliability with velocity. |
| Incidents | Follows a runbook step by step. Can escalate. | Leads investigations. Writes runbooks. Conducts post-incident reviews. | Designs incident management culture. Builds systems that are resilient by design. |
| Security | Uses secrets manager instead of hardcoding credentials. | Designs IAM policies with least privilege. Configures WAF. Runs vulnerability scans. | Defines security architecture. Reviews for Essential Eight, ISO 27001, SOC 2 compliance. |
| Automation | Writes bash scripts that automate a single task. | Builds reusable automation across teams. Automates cert rotation, cleanup, reporting. | Designs the automation strategy. Builds internal developer platforms. |
The Pattern
Early career: you learn what the tools do. Mid career: you learn when to use them and why. Senior: you learn when not to use them and design the systems that make the right choice obvious for everyone else.
Platform Engineer vs SRE vs Cloud Engineer — What Is Actually Different?
Same family, different emphasis
Builds and maintains the internal platform that developers use to deploy and run their applications. Focuses on developer experience, self-service tooling, and infrastructure abstraction.
Thinks about: "How do I make it easy and safe for 50 developers to deploy without needing to understand VPC routing?"
Applies software engineering practices to operations problems. Defines SLOs, manages error budgets, designs for reliability, leads incident response. Originally coined at Google.
Thinks about: "What is the acceptable failure rate for this service, and what happens when we exceed it?"
Designs, builds, and manages cloud infrastructure. Deep expertise in one or more cloud providers. Focuses on architecture, cost optimisation, security, and cloud-native patterns.
Thinks about: "Should this be ECS Fargate or EKS? What are the cost, operational, and security trade-offs?"
The Reality
In most organisations — especially startups and mid-size companies — one person does all three. The titles differ, but the person writing Terraform on Monday is the same person debugging a production incident on Tuesday and setting up Grafana dashboards on Wednesday.
Certifications as Knowledge Markers
Not proof you can do the job — proof you understand the domain
Certifications do not make you an engineer. But they force you to learn the breadth of a domain systematically, which is valuable when building foundational knowledge. Here is a progression that mirrors how understanding deepens.
Foundations
AWS Cloud Practitioner — Understand how cloud services fit together
AWS Solutions Architect Associate — Design production architectures
Infrastructure
HashiCorp Terraform Associate — Infrastructure as Code, state, modules
Security & Operations
HashiCorp Vault Associate — Secrets management, encryption, dynamic secrets
AWS SysOps Administrator or DevOps Engineer — Operational depth
Specialisation
CKA — Container orchestration depth
AWS Security Specialty — Cloud security architecture
Or CISSP / CompTIA Security+ if leaning into security
What the Progression Actually Looks Like
Not a career ladder — a widening of responsibility and judgement
Year 0-2: Learning the Craft
You are building foundational knowledge. You follow runbooks, you ask questions, you make mistakes in staging. You learn that production is not like your laptop. You discover that a 5-second database query in dev takes 45 seconds under real load.
What you know: How to provision infrastructure. How to read logs. How to follow a runbook. How to deploy code safely.
Typical compensation: ₹4-10 LPA (India) · $65-90K (AU) · $70-100K (US)
Year 2-5: Owning Problems
You stop following and start designing. You write the runbooks others follow. You lead incident investigations. When something breaks, people come to you — not because you know every answer, but because you know how to find it systematically.
What you know: Why this architecture was chosen over alternatives. How to debug across layers. How to design systems that fail gracefully.
Typical compensation: ₹15-35 LPA (India) · $110-160K (AU) · $130-180K (US)
Year 5-8: Shaping Systems
Your impact extends beyond your own work. You define standards that other teams follow. You review architecture proposals and spot the failure modes nobody else sees. You make decisions about which problems to solve with technology and which to solve with process.
What you know: When to build vs when to buy. How to balance reliability with development velocity. How to communicate technical risk to non-technical stakeholders.
Typical compensation: ₹35-60 LPA (India) · $160-220K (AU) · $180-280K (US)
Year 8+: Defining Culture
The systems you design outlast your tenure. You shape how the organisation thinks about reliability, security, and operational discipline. The most senior platform engineers do not write the most code — they create the environment where everyone else can do their best work safely.
What you know: How to build a culture of reliability. How to manage risk across an organisation. When to break the rules you wrote yourself.
Typical compensation: ₹60 LPA+ (India) · $220K+ (AU) · $250K+ (US)
What AI Changes — And What It Does Not
The craft is evolving, but the fundamentals are not going away
AI tools can write Terraform files, generate Dockerfiles, and suggest monitoring configurations. This changes the speed at which you can produce infrastructure code. It does not change the judgement required to decide what to build, or the discipline required to operate it safely.
AI Makes This Faster
- • Generating boilerplate Terraform, Dockerfiles, pipeline configs
- • Writing monitoring queries and alert rules
- • Drafting runbooks and documentation
- • Analysing logs for patterns
- • Suggesting security configurations
AI Does Not Replace This
- • Understanding why a system is designed a certain way
- • Debugging a production incident at 2 AM under pressure
- • Making risk trade-off decisions with incomplete information
- • Designing for failure modes nobody has seen yet
- • Building trust with teams that depend on your platform
- • Knowing when the AI-generated config is subtly wrong
This Is a Craft Built on Trust
When companies give you access to their production infrastructure, they are trusting you with their business. Customer data, revenue systems, compliance platforms, financial transactions — all of it runs on the infrastructure you manage.
The engineers who grow the most in this field are not just technically skilled — they are reliable, disciplined, and communicate clearly under pressure. They automate what can be automated, document what they learn, and turn every failure into an improvement.
Whether you are just starting or have been in the field for years, the craft keeps evolving. That is what makes it worth pursuing.
This is Part 5 of 9 in the Platform Engineering series.
Part 5: What a Platform Engineer, SRE, or Cloud Engineer Actually Knows ← You are here
Part 6: Networking Fundamentals for Platform EngineersPart 7: Cybersecurity Fundamentals — What It Means and Why It MattersPart 8: Cybersecurity in Practice — How Production Platforms Are ProtectedPart 9: Cybersecurity Careers — What the Industry Actually DoesNote: The architecture examples in this series reference LexAML, a real-world AML/CTF compliance platform. The diagrams shown are high-level representations shared for educational purposes.
This content is compiled from various industry sources, official documentation, and practical experience gained across production environments. Your experience may differ based on your organisation, tech stack, and industry context.
We are continuously developing and fine-tuning this content. If something differs from your understanding, or if you have suggestions for improvement, we would genuinely appreciate hearing from you.
Reach out: sumit@getpostlabs.io