Full-Stack Architect

Brisbane, Australia

February 2026

20 min readObservabilityPart 2 of 9

Monitoring, Observability, and Why You Cannot Fix What You Cannot See

The three pillars of observability — metrics, logs, and traces — and how they work together to keep production systems alive. A practical guide for engineers who have never set up monitoring before.

The Scenario

It is Monday morning. A compliance officer at a law firm tries to submit a Suspicious Matter Report. The page loads slowly, then times out. She tries again. Same result. She calls support. Support checks the application — it seems fine from the frontend. But somewhere between the user clicking "Submit" and the report being saved to the database, something is broken.

Without observability, finding the problem is like debugging a car engine in a dark room with no torch. With proper observability, you can see every component, every connection, every request in real time. This article explains how to build that visibility from scratch.

The Three Pillars of Observability

Metrics, Logs, and Traces — each tells a different story

Observability is not the same as monitoring. Monitoring tells you what is broken (CPU is at 95%). Observability tells you why it is broken (the CDD service is running a poorly-optimised query that scans 2 million rows because the index is missing). To achieve observability, you need all three pillars working together.

Pillar 1: Metrics

Numbers over time

Metrics are numerical measurements collected at regular intervals. They answer the question: "How is the system performing right now, and how does that compare to 5 minutes ago, 1 hour ago, or last week?"

Metric Type	Example	What It Tells You
CPU Utilisation	CDD Service at 78%	Service is under heavy load — may need to scale
Memory Usage	Audit Service at 1.2GB / 2GB	Approaching memory limit — risk of OOM kill
Request Latency (p95)	Screening API: 2.3 seconds	5% of requests are unacceptably slow
Error Rate	4.2% of requests returning 5xx	Something is failing inside the service
Queue Depth	SQS risk-assessment queue: 847 messages	Consumer cannot keep up with producer
Database Connections	RDS active connections: 42/50	Connection pool almost exhausted
Cache Hit Rate	Redis workflow template cache: 94%	Most requests served from cache — good
Disk IOPS	RDS read IOPS: 3,200/sec	Heavy read load — check for missing indexes

Tools: Prometheus (collection + storage), Grafana (dashboards), AWS CloudWatch (native AWS metrics), Datadog (managed platform).

Pillar 2: Logs

Events with context

Logs are timestamped records of events that happen inside your system. Every request, error, warning, and significant state change produces a log entry. In production systems, logs must be structured (JSON format) and include a correlation ID so you can trace a single request across multiple services.

// Bad log (unstructured — impossible to search or parse)
"Error processing request for user 12345"

// Good log (structured JSON — searchable, parseable, traceable)
{
  "timestamp": "2026-02-20T09:14:32.847Z",
  "level": "ERROR",
  "service": "cdd-service",
  "traceId": "abc-123-def-456",
  "tenantId": "pacific-coast",
  "userId": "user-789",
  "action": "calculateRiskScore",
  "entityId": "case-4521",
  "error": "Connection timeout to screening-service",
  "durationMs": 5003,
  "retryAttempt": 3
}

Why structured logs matter: When you have 12 microservices each producing thousands of log lines per minute, you need to search logs efficiently. "Show me all ERROR logs from the CDD service for tenant pacific-coast in the last hour where durationMs exceeds 3000" — this query is only possible with structured logs.

Tools: ELK Stack (Elasticsearch + Logstash + Kibana), AWS CloudWatch Logs, Splunk, Loki + Grafana, OpenSearch.

Pillar 3: Distributed Traces

The request journey

A single user action — "Submit CDD Case" — might touch 6 different services: API Gateway → CDD Service → Screening Service → Risk Engine → Audit Service → Notification Service. A distributed trace follows that single request across all 6 services, showing you exactly how long each step took, where it slowed down, and where it failed.

Example Trace: Submit CDD Case (Total: 4,247ms)

API Gateway

12ms

CDD Service

145ms

Screening Service (DVS)

3,200ms

Risk Engine

340ms

Audit Service

89ms

Notification Service

461ms

→ Bottleneck identified: Screening Service (DVS external API call) took 3.2 seconds — 75% of total request time.

Tools: OpenTelemetry (standard), Jaeger, Zipkin, AWS X-Ray, Datadog APM.

Health Checks — The Heartbeat of Your System

Liveness, readiness, and why the load balancer needs both

Every microservice exposes health check endpoints. The load balancer and container orchestrator use these to decide whether a container is alive and whether it should receive traffic. Getting these wrong is one of the most common causes of production incidents.

Liveness Check (/health/live)

"Is the process running and not deadlocked?"

Returns 200 OK if the application process is alive. Does not check dependencies (database, cache, queues).

If it fails: The orchestrator kills and restarts the container.

Readiness Check (/health/ready)

"Can this instance handle traffic right now?"

Checks database connection, cache connection, message queue connectivity. Returns 200 only if all dependencies are reachable.

If it fails: The load balancer stops sending traffic to this instance (but does not kill it).

Common Mistake That Causes Cascading Failures

If your liveness check depends on the database, and the database has a brief hiccup, the orchestrator will kill all your containers simultaneously (because they all fail the liveness check at the same time). Then they all try to restart and reconnect to the database simultaneously, creating a thundering herd that makes the database problem worse. Liveness checks should only verify the process itself, never external dependencies.

Alerting — Signal vs Noise

The difference between useful alerts and alert fatigue

Bad alerting is worse than no alerting. If your on-call engineer gets 50 alerts a night, they start ignoring all of them — including the one that actually matters. Good alerting follows a principle: every alert must be actionable. If an alert fires and the response is "do nothing and wait", it should not be an alert.

Severity	Condition	Response	Channel
P1 — Critical	Service completely down, data loss risk, or compliance reporting blocked	Wake the on-call engineer immediately	PagerDuty / Phone call
P2 — High	Service degraded (>10% error rate), database approaching limits	Respond within 30 minutes	Slack alert + PagerDuty
P3 — Medium	Queue depth growing, cache hit rate dropping, latency increasing	Investigate during business hours	Slack channel
P4 — Low	Certificate expiring in 14 days, disk usage at 70%	Create ticket, fix this week	Email / Dashboard

Golden Rule of Alerting

Alert on symptoms (users are affected), not causes (CPU is high). CPU at 90% is not necessarily a problem if response times are fine. But response times at 5 seconds even with low CPU — that is a real problem. Focus alerts on what the user experiences: error rates, latency, availability.

What to Monitor in a Real System

A layer-by-layer monitoring checklist

Load Balancer (ALB)

Request count per secondHTTP 4xx and 5xx error ratesTarget response time (p50, p95, p99)Healthy vs unhealthy target countConnection count and spillover

Application Services

CPU and memory per containerRequest latency by endpointError rate by serviceThread pool utilisationJVM heap usage (for Java services)Garbage collection frequency and duration

Message Queues (SQS)

Queue depth (messages waiting)Age of oldest messageMessages sent vs received rateDead letter queue count (failed messages)Consumer processing latency

Databases (RDS / DocumentDB)

Active connections vs max connectionsRead and write IOPSReplication lag (for read replicas)Slow query countStorage space remainingCPU and freeable memory

Cache (Redis)

Cache hit rate (target: >90%)Memory usage vs max memoryEviction rateConnection countCommand latency

External APIs

Response time (DVS, DFAT, PEP screening)Error rate and timeout rateRate limit remainingCircuit breaker state (open/closed/half-open)

SLIs, SLOs, and Error Budgets

How Google thinks about reliability — and how you should too

Google's SRE book introduced a framework that has become the industry standard for measuring reliability. It gives you a shared language between engineering and business for discussing "how reliable is reliable enough."

SLI — Service Level Indicator

A measurable metric of service quality. Example: "Percentage of API requests that complete in under 500ms."

Think of it as: the measurement.

SLO — Service Level Objective

A target value for an SLI. Example: "99.9% of API requests should complete in under 500ms over a 30-day window."

Think of it as: the target.

Error Budget

100% minus the SLO = the amount of failure you can tolerate. If your SLO is 99.9%, your error budget is 0.1% — that is about 43 minutes of downtime per month.

Think of it as: how much room you have to take risks (deployments, experiments) before reliability is threatened.

Why Error Budgets Change Behaviour

If you have error budget remaining, you can deploy aggressively, run experiments, and take risks. If the error budget is spent, you freeze deployments and focus entirely on reliability. This creates a natural balance between velocity (new features) and stability (keeping things running). It is one of the most powerful concepts in platform engineering.

Building Your Observability Stack

You do not need to implement everything on day one. Start with health checks and basic metrics (CPU, memory, error rate). Add structured logging with correlation IDs. Then add distributed tracing. Then build dashboards. Then configure alerts. Each layer adds visibility.

The goal is not to collect data — it is to reduce the time between "something is wrong" and "we know exactly what is wrong and how to fix it." In production systems, that time difference can mean the difference between a minor incident and a major outage.

Next: Part 3 — Fault Tolerance and Incident Management

Part 1: What Platform Engineering Really Means

Part 2: Monitoring & Observability ← You are here

Part 3: Fault Tolerance and Incident Management Part 4: DevOps, Automation, and Production Discipline Part 5: What a Platform Engineer, SRE, or Cloud Engineer Actually Knows Part 6: Networking Fundamentals Part 7: Cybersecurity Fundamentals Part 8: Cybersecurity in Practice Part 9: Cybersecurity Careers

Note: The architecture examples in this series reference LexAML, a real-world AML/CTF compliance platform. The diagrams shown are high-level representations shared for educational purposes.

This content is compiled from various industry sources, official documentation, and practical experience gained across production environments. Your experience may differ based on your organisation, tech stack, and industry context.

We are continuously developing and fine-tuning this content. If something differs from your understanding, or if you have suggestions for improvement, we would genuinely appreciate hearing from you.

Reach out: sumit@getpostlabs.io