Fault Tolerance and Incident Management — Designing for Failure
Things will break. The database will spike. The external API will timeout. A container will run out of memory. Platform engineering is not about preventing failure — it is about designing systems that survive failure gracefully.
Why This Matters
Consider a compliance platform where hundreds of firms rely on your system to submit government-mandated reports. The government identity verification API (DVS) goes down for 20 minutes. What happens? If your system was not designed for failure, every compliance case in progress fails. Users see errors. Reports cannot be submitted. Firms miss deadlines.
If the system was designed for failure, the screening service detects the timeout, queues the verification request, flags the case as "pending external verification", notifies the user, and retries automatically when DVS comes back. No data lost. No silent failures. No missed deadlines. That is fault tolerance.
The Seven Patterns of Fault Tolerance
Every production system uses some combination of these
These are not theoretical concepts — they are actual code patterns and infrastructure configurations that platform engineers implement every day. If you understand these seven patterns, you understand 90% of how production systems survive failures.
When a request fails, you do not just try again immediately. You wait, then try again. If it fails again, you wait longer. Then longer. This prevents overwhelming a service that is already struggling.
Attempt 1: Fails → Wait 1 second
Attempt 2: Fails → Wait 2 seconds
Attempt 3: Fails → Wait 4 seconds
Attempt 4: Fails → Wait 8 seconds + random jitter (0-1s)
Max retries reached → Send to Dead Letter Queue
The jitter is critical: If 1,000 requests all fail at the same time and all retry exactly 1 second later, you get a "thundering herd" — 1,000 simultaneous retries that overwhelm the recovering service. Random jitter spreads retries across the wait window.
Named after the electrical circuit breaker in your house. When a downstream service is failing, stop sending requests to it. This prevents cascading failures where one broken service brings down everything that depends on it.
CLOSED
Normal operation. Requests pass through. Failures are counted.
OPEN
Too many failures. Requests are rejected immediately (fast-fail). No load on the broken service.
HALF-OPEN
After a timeout, allow one test request through. If it succeeds → CLOSED. If it fails → OPEN again.
Real example: The screening service calls the government DVS API. If DVS returns errors 5 times in 30 seconds, the circuit breaker opens. For the next 60 seconds, all DVS calls immediately return a "service unavailable" response, and the compliance case is flagged for manual review instead of timing out and wasting resources.
When a message from a queue cannot be processed after multiple retries, it is moved to a special "dead letter queue" instead of being discarded. This ensures no message is silently lost, and engineers can investigate and reprocess failed messages later.
→ Main Queue: "Calculate risk score for case-4521"
→ Processing attempt 1: FAIL (screening service timeout)
→ Processing attempt 2: FAIL (screening service timeout)
→ Processing attempt 3: FAIL (screening service timeout)
→ Moved to Dead Letter Queue + Alert fired to on-call engineer
In compliance systems, this is non-negotiable
A silently dropped message could mean a sanctions screening check was never performed, or a suspicious matter report was never filed. Regulatory consequences can be severe. Dead letter queues ensure nothing disappears silently.
Named after the watertight compartments in a ship. If one compartment floods, the others stay dry. In software: isolate different parts of the system so that a failure in one does not bring down the others.
Examples: Separate thread pools per external API call (DVS pool, PEP screening pool, DFAT pool). If the DVS thread pool is exhausted, PEP screening still works. Separate database connection pools per service. Per-tenant rate limiting so one tenant's heavy usage does not affect others.
Pattern 5: Multi-AZ Failover
Database primary in AZ-a, standby in AZ-b. If AZ-a goes down, automatic failover to AZ-b. Applications reconnect within seconds. This is how AWS RDS Multi-AZ and DocumentDB replica sets work.
Pattern 6: Idempotency
Every operation can be safely retried without side effects. Creating the same compliance case twice with the same idempotency key results in one case, not two. Essential when retries and message redelivery are involved.
Pattern 7: Graceful Degradation
When a non-critical service is down, the core functionality still works. If the AI review service is unavailable, manual compliance reviews still function. If notifications fail, cases still get processed. Degrade, do not crash.
Incident Management — What Happens When Things Break
A structured approach to chaos
It is 2:17 AM. Your phone buzzes. PagerDuty alert: "P1 — CDD Service Error Rate Exceeds 15%." You are the on-call engineer this week. What do you do? Without a structured incident response process, the answer is "panic." With one, it is a well-rehearsed sequence of steps.
Step 1: Acknowledge and Assess
0-5 minutesAcknowledge the alert (so others know someone is on it). Open the monitoring dashboard. Check: Is this affecting users? What is the blast radius? Is it one service or multiple? Is it getting worse or stable? Check the error rate, latency, and queue depth dashboards.
Step 2: Triage and Classify
5-10 minutesClassify severity. P1: Multiple services down, data at risk, users completely blocked. P2: One service degraded, users partially affected. P3: Performance issue, no data risk. Communicate severity to the team and stakeholders. If P1, escalate immediately — do not try to be a hero alone.
Step 3: Investigate and Contain
10-30 minutesFollow the investigation path: Was there a recent deployment? (Check CI/CD pipeline history). Check logs for the failing service (filter by ERROR level, last 30 minutes). Check distributed traces for failing requests. Check database metrics (connections, slow queries). Check external API health (DVS, screening services). Check infrastructure (ECS task health, ALB targets).
Step 4: Mitigate
30-60 minutesFix the immediate problem. If it was a bad deployment → rollback to the last known good version. If a database query is causing load → kill the query and add it to investigation. If an external API is down → activate the circuit breaker and queue affected requests. If a container is OOM → increase memory limits and restart. The goal is to stop the bleeding, not to find the root cause.
Step 5: Verify Recovery
60-90 minutesConfirm that error rates are back to normal. Check that queued messages are being processed (DLQ is not growing). Verify that the rollback or fix is stable for at least 15 minutes. Update the incident channel: "Service restored at 3:42 AM. Monitoring for stability."
Step 6: Post-Incident Review (Blameless)
Next business dayWrite a post-incident report (also called a "post-mortem"). What happened? What was the timeline? What was the root cause? What mitigation worked? What can we do to prevent this from happening again? This must be blameless — the goal is learning, not blame. "The deployment process allowed untested code to reach production" is useful. "John broke production" is not.
What a Post-Incident Report Looks Like
A real example from a hypothetical compliance platform incident
Incident Report: CDD Service Degradation — 20 Feb 2026
Severity: P2 — Service Degraded
Duration: 47 minutes (02:17 — 03:04 AEST)
Impact: 23% of CDD case submissions returned 504 timeout errors. Approximately 180 users affected across 12 tenants.
Timeline
02:17 — CloudWatch alarm: CDD Service error rate > 10%
02:19 — On-call engineer acknowledges alert
02:22 — Dashboard confirms: RDS active connections at 48/50
02:25 — Root cause identified: A new batch reporting query (deployed 18:00 previous day) was holding long-running connections
02:32 — Mitigation: Killed the batch query. Identified it was missing a WHERE clause on tenant_id
02:38 — Connection pool recovered to 12/50
03:04 — All metrics confirmed normal for 25 minutes. Incident closed.
Root Cause
A batch reporting query deployed in the afternoon release was missing a tenant_id filter, causing it to scan the entire cases table (2.1M rows) instead of a single tenant's data (~4,000 rows). This held 36 database connections open for 8+ minutes each, exhausting the connection pool and causing new requests to queue and timeout.
Action Items
1. Add query execution time limits (statement_timeout = 30s) to RDS configuration — Owner: Platform team, Due: 22 Feb
2. Add mandatory tenant_id filter check to the code review checklist — Owner: Tech lead, Due: 21 Feb
3. Add CloudWatch alarm for "active connections > 40" with P2 severity — Owner: Platform team, Due: 21 Feb
4. Run the batch report on the read replica, not the primary database — Owner: Backend team, Due: 28 Feb
Why Post-Incident Reports Matter for Your Career
Writing clear post-incident reports is one of the most valued skills in platform engineering. It demonstrates that you can think systematically under pressure, communicate technical issues to non-technical stakeholders, and turn failures into improvements. If you are building a portfolio, write practice post-incident reports for hypothetical scenarios. It shows maturity that raw coding skills do not.
Runbooks — Your 2 AM Survival Guide
Pre-written instructions for when your brain is at 50% capacity
A runbook is a step-by-step document that tells the on-call engineer exactly what to do when a specific alert fires. At 2 AM, you are tired, stressed, and possibly unfamiliar with the service that is failing. A good runbook turns a 60-minute investigation into a 10-minute resolution.
RUNBOOK: CDD Service — High Error Rate Alert
=============================================
ALERT: cdd-service-error-rate-high
TRIGGER: Error rate > 5% for 5 consecutive minutes
SEVERITY: P2
STEP 1: Check if this correlates with a recent deployment
$ aws ecs describe-services --cluster prod --services cdd-service
→ Look at "deployments" — was there a rollout in the last 2 hours?
→ If yes: ROLLBACK first, investigate later
$ aws ecs update-service --cluster prod --service cdd-service \
--task-definition cdd-service:PREVIOUS_VERSION
STEP 2: Check database connections
→ CloudWatch → RDS → DatabaseConnections
→ If connections > 40/50: likely connection pool exhaustion
→ Check for long-running queries:
SELECT pid, now()-pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity WHERE state != 'idle'
ORDER BY duration DESC LIMIT 10;
STEP 3: Check external API health
→ CloudWatch → Custom Metrics → dvs-api-response-time
→ If DVS latency > 5s: circuit breaker should be open
→ Verify: check cdd-service logs for "CircuitBreaker OPEN"
STEP 4: Check message queue
→ SQS Console → cdd-risk-assessment queue
→ If ApproximateNumberOfMessages > 500: consumer is behind
→ Check DLQ: cdd-risk-assessment-dlq for failed messages
ESCALATION: If not resolved in 30 minutes, page the backend leadOn-Call Culture — What Nobody Tells You
The human side of incident management
Being on-call is a defining aspect of platform engineering. Here is what it actually looks like and what healthy on-call practices involve.
Healthy On-Call Practices
- • Rotations: 1 week on, 3-4 weeks off minimum
- • Compensation: extra pay or time off for on-call shifts
- • Runbooks for every alert — nobody should face an unknown alarm
- • Blameless post-incident reviews focused on system improvement
- • If you get paged more than 2-3 times per week, the system needs fixing, not more engineers
- • Clear escalation paths — you are never truly alone
Red Flags to Watch For
- • Same person always on-call (no rotation)
- • No runbooks — "just figure it out"
- • Alerts firing constantly with no action taken to reduce them
- • Blame culture after incidents
- • No compensation for after-hours work
- • "Hero culture" that rewards firefighting over prevention
The Mindset Shift
In college, failure means your code does not compile or your test does not pass. In production, failure means real people cannot do their jobs, businesses lose money, and regulatory deadlines are missed. Platform engineering is about accepting that failure is inevitable and building systems — and processes — that handle it gracefully.
The engineers who thrive in this field are not the ones who never make mistakes. They are the ones who build systems that survive mistakes, respond calmly under pressure, and turn every incident into an improvement.
Part 3: Fault Tolerance and Incident Management ← You are here
Part 4: DevOps, Automation, and Production DisciplinePart 5: What a Platform Engineer, SRE, or Cloud Engineer Actually KnowsPart 6: Networking FundamentalsPart 7: Cybersecurity FundamentalsPart 8: Cybersecurity in PracticePart 9: Cybersecurity CareersNote: The architecture examples in this series reference LexAML, a real-world AML/CTF compliance platform. The diagrams shown are high-level representations shared for educational purposes.
This content is compiled from various industry sources, official documentation, and practical experience gained across production environments. Your experience may differ based on your organisation, tech stack, and industry context.
We are continuously developing and fine-tuning this content. If something differs from your understanding, or if you have suggestions for improvement, we would genuinely appreciate hearing from you.
Reach out: sumit@getpostlabs.io