Full-Stack Architect

Brisbane, Australia

February 2026

22 min readReliabilityPart 3 of 9

Fault Tolerance and Incident Management — Designing for Failure

Things will break. The database will spike. The external API will timeout. A container will run out of memory. Platform engineering is not about preventing failure — it is about designing systems that survive failure gracefully.

Why This Matters

Consider a compliance platform where hundreds of firms rely on your system to submit government-mandated reports. The government identity verification API (DVS) goes down for 20 minutes. What happens? If your system was not designed for failure, every compliance case in progress fails. Users see errors. Reports cannot be submitted. Firms miss deadlines.

If the system was designed for failure, the screening service detects the timeout, queues the verification request, flags the case as "pending external verification", notifies the user, and retries automatically when DVS comes back. No data lost. No silent failures. No missed deadlines. That is fault tolerance.

The Seven Patterns of Fault Tolerance

Every production system uses some combination of these

These are not theoretical concepts — they are actual code patterns and infrastructure configurations that platform engineers implement every day. If you understand these seven patterns, you understand 90% of how production systems survive failures.

Pattern 1: Retry with Exponential Backoff

When a request fails, you do not just try again immediately. You wait, then try again. If it fails again, you wait longer. Then longer. This prevents overwhelming a service that is already struggling.

Attempt 1: Fails → Wait 1 second

Attempt 2: Fails → Wait 2 seconds

Attempt 3: Fails → Wait 4 seconds

Attempt 4: Fails → Wait 8 seconds + random jitter (0-1s)

Max retries reached → Send to Dead Letter Queue

The jitter is critical: If 1,000 requests all fail at the same time and all retry exactly 1 second later, you get a "thundering herd" — 1,000 simultaneous retries that overwhelm the recovering service. Random jitter spreads retries across the wait window.

Pattern 2: Circuit Breaker

Named after the electrical circuit breaker in your house. When a downstream service is failing, stop sending requests to it. This prevents cascading failures where one broken service brings down everything that depends on it.

CLOSED

Normal operation. Requests pass through. Failures are counted.

OPEN

Too many failures. Requests are rejected immediately (fast-fail). No load on the broken service.

HALF-OPEN

After a timeout, allow one test request through. If it succeeds → CLOSED. If it fails → OPEN again.

Real example: The screening service calls the government DVS API. If DVS returns errors 5 times in 30 seconds, the circuit breaker opens. For the next 60 seconds, all DVS calls immediately return a "service unavailable" response, and the compliance case is flagged for manual review instead of timing out and wasting resources.

Pattern 3: Dead Letter Queue (DLQ)

When a message from a queue cannot be processed after multiple retries, it is moved to a special "dead letter queue" instead of being discarded. This ensures no message is silently lost, and engineers can investigate and reprocess failed messages later.

→ Main Queue: "Calculate risk score for case-4521"

→ Processing attempt 1: FAIL (screening service timeout)

→ Processing attempt 2: FAIL (screening service timeout)

→ Processing attempt 3: FAIL (screening service timeout)

→ Moved to Dead Letter Queue + Alert fired to on-call engineer

In compliance systems, this is non-negotiable

A silently dropped message could mean a sanctions screening check was never performed, or a suspicious matter report was never filed. Regulatory consequences can be severe. Dead letter queues ensure nothing disappears silently.

Pattern 4: Bulkhead Isolation

Named after the watertight compartments in a ship. If one compartment floods, the others stay dry. In software: isolate different parts of the system so that a failure in one does not bring down the others.

Examples: Separate thread pools per external API call (DVS pool, PEP screening pool, DFAT pool). If the DVS thread pool is exhausted, PEP screening still works. Separate database connection pools per service. Per-tenant rate limiting so one tenant's heavy usage does not affect others.

Pattern 5: Multi-AZ Failover

Database primary in AZ-a, standby in AZ-b. If AZ-a goes down, automatic failover to AZ-b. Applications reconnect within seconds. This is how AWS RDS Multi-AZ and DocumentDB replica sets work.

Pattern 6: Idempotency

Every operation can be safely retried without side effects. Creating the same compliance case twice with the same idempotency key results in one case, not two. Essential when retries and message redelivery are involved.

Pattern 7: Graceful Degradation

When a non-critical service is down, the core functionality still works. If the AI review service is unavailable, manual compliance reviews still function. If notifications fail, cases still get processed. Degrade, do not crash.

Incident Management — What Happens When Things Break

A structured approach to chaos

It is 2:17 AM. Your phone buzzes. PagerDuty alert: "P1 — CDD Service Error Rate Exceeds 15%." You are the on-call engineer this week. What do you do? Without a structured incident response process, the answer is "panic." With one, it is a well-rehearsed sequence of steps.

Step 1: Acknowledge and Assess

0-5 minutes

Acknowledge the alert (so others know someone is on it). Open the monitoring dashboard. Check: Is this affecting users? What is the blast radius? Is it one service or multiple? Is it getting worse or stable? Check the error rate, latency, and queue depth dashboards.

Step 2: Triage and Classify

5-10 minutes

Classify severity. P1: Multiple services down, data at risk, users completely blocked. P2: One service degraded, users partially affected. P3: Performance issue, no data risk. Communicate severity to the team and stakeholders. If P1, escalate immediately — do not try to be a hero alone.

Step 3: Investigate and Contain

10-30 minutes

Follow the investigation path: Was there a recent deployment? (Check CI/CD pipeline history). Check logs for the failing service (filter by ERROR level, last 30 minutes). Check distributed traces for failing requests. Check database metrics (connections, slow queries). Check external API health (DVS, screening services). Check infrastructure (ECS task health, ALB targets).

Step 4: Mitigate

30-60 minutes

Fix the immediate problem. If it was a bad deployment → rollback to the last known good version. If a database query is causing load → kill the query and add it to investigation. If an external API is down → activate the circuit breaker and queue affected requests. If a container is OOM → increase memory limits and restart. The goal is to stop the bleeding, not to find the root cause.

Step 5: Verify Recovery

60-90 minutes

Confirm that error rates are back to normal. Check that queued messages are being processed (DLQ is not growing). Verify that the rollback or fix is stable for at least 15 minutes. Update the incident channel: "Service restored at 3:42 AM. Monitoring for stability."

Step 6: Post-Incident Review (Blameless)

Next business day

Write a post-incident report (also called a "post-mortem"). What happened? What was the timeline? What was the root cause? What mitigation worked? What can we do to prevent this from happening again? This must be blameless — the goal is learning, not blame. "The deployment process allowed untested code to reach production" is useful. "John broke production" is not.

What a Post-Incident Report Looks Like

A real example from a hypothetical compliance platform incident

Incident Report: CDD Service Degradation — 20 Feb 2026

Severity: P2 — Service Degraded

Duration: 47 minutes (02:17 — 03:04 AEST)

Impact: 23% of CDD case submissions returned 504 timeout errors. Approximately 180 users affected across 12 tenants.

Timeline

02:17 — CloudWatch alarm: CDD Service error rate > 10%

02:19 — On-call engineer acknowledges alert

02:22 — Dashboard confirms: RDS active connections at 48/50

02:25 — Root cause identified: A new batch reporting query (deployed 18:00 previous day) was holding long-running connections

02:32 — Mitigation: Killed the batch query. Identified it was missing a WHERE clause on tenant_id

02:38 — Connection pool recovered to 12/50

03:04 — All metrics confirmed normal for 25 minutes. Incident closed.

Root Cause

A batch reporting query deployed in the afternoon release was missing a tenant_id filter, causing it to scan the entire cases table (2.1M rows) instead of a single tenant's data (~4,000 rows). This held 36 database connections open for 8+ minutes each, exhausting the connection pool and causing new requests to queue and timeout.

Action Items

1. Add query execution time limits (statement_timeout = 30s) to RDS configuration — Owner: Platform team, Due: 22 Feb

2. Add mandatory tenant_id filter check to the code review checklist — Owner: Tech lead, Due: 21 Feb

3. Add CloudWatch alarm for "active connections > 40" with P2 severity — Owner: Platform team, Due: 21 Feb

4. Run the batch report on the read replica, not the primary database — Owner: Backend team, Due: 28 Feb

Why Post-Incident Reports Matter for Your Career

Writing clear post-incident reports is one of the most valued skills in platform engineering. It demonstrates that you can think systematically under pressure, communicate technical issues to non-technical stakeholders, and turn failures into improvements. If you are building a portfolio, write practice post-incident reports for hypothetical scenarios. It shows maturity that raw coding skills do not.

Runbooks — Your 2 AM Survival Guide

Pre-written instructions for when your brain is at 50% capacity

A runbook is a step-by-step document that tells the on-call engineer exactly what to do when a specific alert fires. At 2 AM, you are tired, stressed, and possibly unfamiliar with the service that is failing. A good runbook turns a 60-minute investigation into a 10-minute resolution.

RUNBOOK: CDD Service — High Error Rate Alert
=============================================

ALERT:    cdd-service-error-rate-high
TRIGGER:  Error rate > 5% for 5 consecutive minutes
SEVERITY: P2

STEP 1: Check if this correlates with a recent deployment
  $ aws ecs describe-services --cluster prod --services cdd-service
  → Look at "deployments" — was there a rollout in the last 2 hours?
  → If yes: ROLLBACK first, investigate later
    $ aws ecs update-service --cluster prod --service cdd-service \
      --task-definition cdd-service:PREVIOUS_VERSION

STEP 2: Check database connections
  → CloudWatch → RDS → DatabaseConnections
  → If connections > 40/50: likely connection pool exhaustion
  → Check for long-running queries:
    SELECT pid, now()-pg_stat_activity.query_start AS duration, query
    FROM pg_stat_activity WHERE state != 'idle'
    ORDER BY duration DESC LIMIT 10;

STEP 3: Check external API health
  → CloudWatch → Custom Metrics → dvs-api-response-time
  → If DVS latency > 5s: circuit breaker should be open
  → Verify: check cdd-service logs for "CircuitBreaker OPEN"

STEP 4: Check message queue
  → SQS Console → cdd-risk-assessment queue
  → If ApproximateNumberOfMessages > 500: consumer is behind
  → Check DLQ: cdd-risk-assessment-dlq for failed messages

ESCALATION: If not resolved in 30 minutes, page the backend lead

On-Call Culture — What Nobody Tells You

The human side of incident management

Being on-call is a defining aspect of platform engineering. Here is what it actually looks like and what healthy on-call practices involve.

Healthy On-Call Practices

• Rotations: 1 week on, 3-4 weeks off minimum
• Compensation: extra pay or time off for on-call shifts
• Runbooks for every alert — nobody should face an unknown alarm
• Blameless post-incident reviews focused on system improvement
• If you get paged more than 2-3 times per week, the system needs fixing, not more engineers
• Clear escalation paths — you are never truly alone

Red Flags to Watch For

• Same person always on-call (no rotation)
• No runbooks — "just figure it out"
• Alerts firing constantly with no action taken to reduce them
• Blame culture after incidents
• No compensation for after-hours work
• "Hero culture" that rewards firefighting over prevention

The Mindset Shift

In college, failure means your code does not compile or your test does not pass. In production, failure means real people cannot do their jobs, businesses lose money, and regulatory deadlines are missed. Platform engineering is about accepting that failure is inevitable and building systems — and processes — that handle it gracefully.

The engineers who thrive in this field are not the ones who never make mistakes. They are the ones who build systems that survive mistakes, respond calmly under pressure, and turn every incident into an improvement.

Next: Part 4 — DevOps, Automation, and Production Discipline

Part 1: What Platform Engineering Really Means Part 2: Monitoring & Observability

Part 3: Fault Tolerance and Incident Management ← You are here

Part 4: DevOps, Automation, and Production Discipline Part 5: What a Platform Engineer, SRE, or Cloud Engineer Actually Knows Part 6: Networking Fundamentals Part 7: Cybersecurity Fundamentals Part 8: Cybersecurity in Practice Part 9: Cybersecurity Careers

Note: The architecture examples in this series reference LexAML, a real-world AML/CTF compliance platform. The diagrams shown are high-level representations shared for educational purposes.

This content is compiled from various industry sources, official documentation, and practical experience gained across production environments. Your experience may differ based on your organisation, tech stack, and industry context.

We are continuously developing and fine-tuning this content. If something differs from your understanding, or if you have suggestions for improvement, we would genuinely appreciate hearing from you.

Reach out: sumit@getpostlabs.io