Monitoring, Observability, and Why You Cannot Fix What You Cannot See
The three pillars of observability — metrics, logs, and traces — and how they work together to keep production systems alive. A practical guide for engineers who have never set up monitoring before.
The Scenario
It is Monday morning. A compliance officer at a law firm tries to submit a Suspicious Matter Report. The page loads slowly, then times out. She tries again. Same result. She calls support. Support checks the application — it seems fine from the frontend. But somewhere between the user clicking "Submit" and the report being saved to the database, something is broken.
Without observability, finding the problem is like debugging a car engine in a dark room with no torch. With proper observability, you can see every component, every connection, every request in real time. This article explains how to build that visibility from scratch.
The Three Pillars of Observability
Metrics, Logs, and Traces — each tells a different story
Observability is not the same as monitoring. Monitoring tells you what is broken (CPU is at 95%). Observability tells you why it is broken (the CDD service is running a poorly-optimised query that scans 2 million rows because the index is missing). To achieve observability, you need all three pillars working together.
Pillar 1: Metrics
Numbers over timeMetrics are numerical measurements collected at regular intervals. They answer the question: "How is the system performing right now, and how does that compare to 5 minutes ago, 1 hour ago, or last week?"
| Metric Type | Example | What It Tells You |
|---|---|---|
| CPU Utilisation | CDD Service at 78% | Service is under heavy load — may need to scale |
| Memory Usage | Audit Service at 1.2GB / 2GB | Approaching memory limit — risk of OOM kill |
| Request Latency (p95) | Screening API: 2.3 seconds | 5% of requests are unacceptably slow |
| Error Rate | 4.2% of requests returning 5xx | Something is failing inside the service |
| Queue Depth | SQS risk-assessment queue: 847 messages | Consumer cannot keep up with producer |
| Database Connections | RDS active connections: 42/50 | Connection pool almost exhausted |
| Cache Hit Rate | Redis workflow template cache: 94% | Most requests served from cache — good |
| Disk IOPS | RDS read IOPS: 3,200/sec | Heavy read load — check for missing indexes |
Tools: Prometheus (collection + storage), Grafana (dashboards), AWS CloudWatch (native AWS metrics), Datadog (managed platform).
Pillar 2: Logs
Events with contextLogs are timestamped records of events that happen inside your system. Every request, error, warning, and significant state change produces a log entry. In production systems, logs must be structured (JSON format) and include a correlation ID so you can trace a single request across multiple services.
// Bad log (unstructured — impossible to search or parse)
"Error processing request for user 12345"
// Good log (structured JSON — searchable, parseable, traceable)
{
"timestamp": "2026-02-20T09:14:32.847Z",
"level": "ERROR",
"service": "cdd-service",
"traceId": "abc-123-def-456",
"tenantId": "pacific-coast",
"userId": "user-789",
"action": "calculateRiskScore",
"entityId": "case-4521",
"error": "Connection timeout to screening-service",
"durationMs": 5003,
"retryAttempt": 3
}Why structured logs matter: When you have 12 microservices each producing thousands of log lines per minute, you need to search logs efficiently. "Show me all ERROR logs from the CDD service for tenant pacific-coast in the last hour where durationMs exceeds 3000" — this query is only possible with structured logs.
Tools: ELK Stack (Elasticsearch + Logstash + Kibana), AWS CloudWatch Logs, Splunk, Loki + Grafana, OpenSearch.
Pillar 3: Distributed Traces
The request journeyA single user action — "Submit CDD Case" — might touch 6 different services: API Gateway → CDD Service → Screening Service → Risk Engine → Audit Service → Notification Service. A distributed trace follows that single request across all 6 services, showing you exactly how long each step took, where it slowed down, and where it failed.
Example Trace: Submit CDD Case (Total: 4,247ms)
→ Bottleneck identified: Screening Service (DVS external API call) took 3.2 seconds — 75% of total request time.
Tools: OpenTelemetry (standard), Jaeger, Zipkin, AWS X-Ray, Datadog APM.
Health Checks — The Heartbeat of Your System
Liveness, readiness, and why the load balancer needs both
Every microservice exposes health check endpoints. The load balancer and container orchestrator use these to decide whether a container is alive and whether it should receive traffic. Getting these wrong is one of the most common causes of production incidents.
Liveness Check (/health/live)
"Is the process running and not deadlocked?"
Returns 200 OK if the application process is alive. Does not check dependencies (database, cache, queues).
If it fails: The orchestrator kills and restarts the container.
Readiness Check (/health/ready)
"Can this instance handle traffic right now?"
Checks database connection, cache connection, message queue connectivity. Returns 200 only if all dependencies are reachable.
If it fails: The load balancer stops sending traffic to this instance (but does not kill it).
Common Mistake That Causes Cascading Failures
If your liveness check depends on the database, and the database has a brief hiccup, the orchestrator will kill all your containers simultaneously (because they all fail the liveness check at the same time). Then they all try to restart and reconnect to the database simultaneously, creating a thundering herd that makes the database problem worse. Liveness checks should only verify the process itself, never external dependencies.
Alerting — Signal vs Noise
The difference between useful alerts and alert fatigue
Bad alerting is worse than no alerting. If your on-call engineer gets 50 alerts a night, they start ignoring all of them — including the one that actually matters. Good alerting follows a principle: every alert must be actionable. If an alert fires and the response is "do nothing and wait", it should not be an alert.
| Severity | Condition | Response | Channel |
|---|---|---|---|
| P1 — Critical | Service completely down, data loss risk, or compliance reporting blocked | Wake the on-call engineer immediately | PagerDuty / Phone call |
| P2 — High | Service degraded (>10% error rate), database approaching limits | Respond within 30 minutes | Slack alert + PagerDuty |
| P3 — Medium | Queue depth growing, cache hit rate dropping, latency increasing | Investigate during business hours | Slack channel |
| P4 — Low | Certificate expiring in 14 days, disk usage at 70% | Create ticket, fix this week | Email / Dashboard |
Golden Rule of Alerting
Alert on symptoms (users are affected), not causes (CPU is high). CPU at 90% is not necessarily a problem if response times are fine. But response times at 5 seconds even with low CPU — that is a real problem. Focus alerts on what the user experiences: error rates, latency, availability.
What to Monitor in a Real System
A layer-by-layer monitoring checklist
Load Balancer (ALB)
Application Services
Message Queues (SQS)
Databases (RDS / DocumentDB)
Cache (Redis)
External APIs
SLIs, SLOs, and Error Budgets
How Google thinks about reliability — and how you should too
Google's SRE book introduced a framework that has become the industry standard for measuring reliability. It gives you a shared language between engineering and business for discussing "how reliable is reliable enough."
SLI — Service Level Indicator
A measurable metric of service quality. Example: "Percentage of API requests that complete in under 500ms."
Think of it as: the measurement.
SLO — Service Level Objective
A target value for an SLI. Example: "99.9% of API requests should complete in under 500ms over a 30-day window."
Think of it as: the target.
Error Budget
100% minus the SLO = the amount of failure you can tolerate. If your SLO is 99.9%, your error budget is 0.1% — that is about 43 minutes of downtime per month.
Think of it as: how much room you have to take risks (deployments, experiments) before reliability is threatened.
Why Error Budgets Change Behaviour
If you have error budget remaining, you can deploy aggressively, run experiments, and take risks. If the error budget is spent, you freeze deployments and focus entirely on reliability. This creates a natural balance between velocity (new features) and stability (keeping things running). It is one of the most powerful concepts in platform engineering.
Building Your Observability Stack
You do not need to implement everything on day one. Start with health checks and basic metrics (CPU, memory, error rate). Add structured logging with correlation IDs. Then add distributed tracing. Then build dashboards. Then configure alerts. Each layer adds visibility.
The goal is not to collect data — it is to reduce the time between "something is wrong" and "we know exactly what is wrong and how to fix it." In production systems, that time difference can mean the difference between a minor incident and a major outage.
Part 2: Monitoring & Observability ← You are here
Part 3: Fault Tolerance and Incident ManagementPart 4: DevOps, Automation, and Production DisciplinePart 5: What a Platform Engineer, SRE, or Cloud Engineer Actually KnowsPart 6: Networking FundamentalsPart 7: Cybersecurity FundamentalsPart 8: Cybersecurity in PracticePart 9: Cybersecurity CareersNote: The architecture examples in this series reference LexAML, a real-world AML/CTF compliance platform. The diagrams shown are high-level representations shared for educational purposes.
This content is compiled from various industry sources, official documentation, and practical experience gained across production environments. Your experience may differ based on your organisation, tech stack, and industry context.
We are continuously developing and fine-tuning this content. If something differs from your understanding, or if you have suggestions for improvement, we would genuinely appreciate hearing from you.
Reach out: sumit@getpostlabs.io