Networking Fundamentals for Platform Engineers
Every request that hits your application travels through layers of networking. If you do not understand how TCP/IP works, how DNS resolves, or what happens inside a VPC — you are debugging blind when things break.
Why Platform Engineers Need Networking Knowledge
When a compliance officer submits a customer due diligence case and gets a timeout error, the problem could be anywhere in the network stack: DNS resolution failed, a security group is blocking traffic, a NAT gateway is saturated, the load balancer health check is misconfigured, or a TLS certificate expired. If you only understand application code, you will stare at perfectly working code while the real problem is three layers below.
You do not need to become a network engineer. But you need to understand enough to diagnose problems, design infrastructure, and have productive conversations with networking teams.
The Network Stack — Layers That Matter
The OSI model, simplified for platform engineering
Networking is organised in layers. Each layer has a specific job and talks to the layer above and below it. You will hear references to "Layer 4" or "Layer 7" constantly in platform engineering — here is what each one means and why you care about it.
| Layer | Name | Key Protocols | What It Does | Platform Engineer Cares Because |
|---|---|---|---|---|
| 7 | Application | HTTP, HTTPS, DNS, FTP, SMTP, XMPP, SIP | Where your application communicates | API calls, REST endpoints, WebSocket connections, DNS resolution — this is where most debugging starts |
| 6 | Presentation | TLS/SSL, JSON, XML | Data formatting and encryption | TLS certificate issues, HTTPS handshake failures, certificate expiry alerts |
| 5 | Session | TLS sessions, WebSocket | Maintains connections between devices | Connection pooling, session timeouts, keep-alive settings |
| 4 | Transport | TCP, UDP, SSL/TLS | Reliable (TCP) or fast (UDP) data delivery | Port numbers, connection limits, load balancer types (L4 vs L7), health checks |
| 3 | Network | IPv4, IPv6, ICMP, IGMP | Routing packets between networks | VPC design, subnets, route tables, NAT gateways, security groups, CIDR blocks |
| 2 | Data Link | ARP, Ethernet, VLAN | Frame delivery within a local network | Rarely touched directly in cloud, but ARP resolution matters for on-premise / hybrid |
| 1 | Physical | Ethernet cables, Wi-Fi, Bluetooth | Raw electrical/optical signals | Cloud abstracts this away — but matters for edge computing and IoT |
The Practical Truth
As a platform engineer working in cloud (AWS, Azure, GCP), you will spend 90% of your networking time on Layers 3, 4, and 7. Layer 3 is VPC/subnet/routing design. Layer 4 is TCP connections, port management, and NLB (Network Load Balancer). Layer 7 is HTTP routing, ALB (Application Load Balancer), API Gateway, and DNS.
TCP vs UDP — The Two Ways Data Travels
Understanding when reliability matters and when speed matters
Reliable, ordered delivery. Every packet is acknowledged. If a packet is lost, it is retransmitted.
Three-way handshake: SYN → SYN-ACK → ACK. This establishes a connection before any data is sent.
Used by: HTTP/HTTPS (all web traffic), database connections (PostgreSQL, MongoDB), SSH, SMTP (email), FTP
Platform context: Every API call between your microservices uses TCP. Database connection pooling is about managing TCP connections efficiently.
Fast, no guarantees. Packets are sent without acknowledgement. If a packet is lost, it is gone.
No handshake: Just send the data. Much lower latency than TCP.
Used by: DNS queries, STUN/TURN (WebRTC), video streaming, VoIP (SIP/SDP), SNMP (monitoring), game servers
Platform context: DNS resolution uses UDP for speed. Health check pings sometimes use UDP. Log shipping (syslog) often uses UDP.
Protocols You Will Actually Encounter
A practical reference for the protocols that appear in production
Data Communication
Address Resolution Protocol — maps IP addresses to MAC addresses on a local network. If ARP resolution fails, devices on the same subnet cannot talk to each other.
The addressing system of the internet. IPv4 uses 32-bit addresses (e.g., 10.0.1.50). IPv6 uses 128-bit addresses. VPCs use IPv4 CIDR blocks (e.g., 10.0.0.0/16) for subnet design.
Internet Control Message Protocol — the protocol behind "ping" and "traceroute". Used to test if a host is reachable and to diagnose routing issues.
Internet Group Management Protocol — manages multicast group memberships. Less common in cloud, but relevant for streaming and media applications.
Signalling & Real-Time
Session Initiation Protocol / Session Description Protocol — used in VoIP and video conferencing to set up, manage, and tear down real-time communication sessions.
Extensible Messaging and Presence Protocol — used for real-time messaging, presence information, and contact lists. Think chat applications and IoT device communication.
Application Protocols
Domain Name System — translates human-readable names (api.example.com) to IP addresses. If DNS fails, everything fails. Route 53 is AWS's DNS service.
The foundation of web communication. HTTPS adds TLS encryption. Every API call, every web page, every webhook uses HTTP(S). Status codes (200, 404, 500, 503) are your diagnostic language.
File Transfer Protocol. FTP is authenticated; TFTP (Trivial FTP) is unauthenticated and uses UDP. SFTP (SSH-based) is the modern secure alternative.
Used in WebRTC for NAT traversal — helping peer-to-peer connections work when devices are behind firewalls and NAT gateways.
Network Management
Simple Network Management Protocol — used to monitor and manage network devices (routers, switches, servers). SNMP traps are how network hardware reports problems. CloudWatch and Datadog often replace SNMP in cloud environments.
Wireless
Short-range wireless for IoT devices, sensors, peripherals. Bluetooth Low Energy (BLE) is critical for IoT platforms and edge computing.
Cloud Networking — VPC Design in Practice
How all these protocols come together in a real AWS environment
A Virtual Private Cloud (VPC) is your own isolated network inside AWS. Every production system runs inside a VPC. Understanding VPC design is where networking theory meets platform engineering practice.
# Typical VPC architecture for a compliance platform
VPC: 10.0.0.0/16 (65,536 IP addresses) │ ├── Public Subnets (internet-facing) │ ├── AZ-a: 10.0.1.0/24 → ALB, NAT Gateway, Bastion Host │ └── AZ-b: 10.0.2.0/24 → ALB (redundant), NAT Gateway │ ├── Private Subnets (application layer — NO direct internet access) │ ├── AZ-a: 10.0.10.0/24 → ECS Fargate tasks (CDD, Screening, Identity) │ └── AZ-b: 10.0.11.0/24 → ECS Fargate tasks (redundant) │ ├── Data Subnets (database layer — most restricted) │ ├── AZ-a: 10.0.20.0/24 → RDS Primary, ElastiCache, OpenSearch │ └── AZ-b: 10.0.21.0/24 → RDS Standby (Multi-AZ failover) │ Traffic flow: Internet → CloudFront (CDN) → WAF → ALB (public subnet) → ECS tasks (private subnet) → RDS/Redis (data subnet) ECS tasks → internet (for external APIs like DVS): Private subnet → NAT Gateway (public subnet) → Internet
Security Groups
Virtual firewalls attached to each resource. Stateful: if you allow inbound traffic, the response is automatically allowed. Define rules like "Allow TCP port 5432 from private subnets only" for database access.
NACLs
Network Access Control Lists — subnet-level firewall. Stateless: you must explicitly allow both inbound and outbound traffic. Acts as a second layer of defence behind security groups.
Route Tables
Define where traffic goes. Public subnet route: 0.0.0.0/0 → Internet Gateway. Private subnet route: 0.0.0.0/0 → NAT Gateway. Data subnet: no route to internet at all.
Diagnostic Tools — Your Networking Toolkit
Commands you will use when debugging connectivity problems
| Tool | Command Example | What It Tests | When to Use It |
|---|---|---|---|
| ping | ping 10.0.10.50 | ICMP reachability — is the host alive? | "Is this server even reachable from here?" |
| traceroute | traceroute api.example.com | Shows every network hop between you and the destination | "Where is the traffic being dropped?" |
| nslookup / dig | dig api.example.com | DNS resolution — does the name resolve to the right IP? | "Is DNS returning the correct address?" |
| curl | curl -v https://api.example.com/health | HTTP connectivity including TLS handshake and response | "Can I reach the API? What status code do I get?" |
| telnet / nc | nc -zv 10.0.20.10 5432 | TCP port connectivity — is the port open and accepting connections? | "Can my app reach the database port?" |
| ss / netstat | ss -tlnp | Shows active TCP connections and listening ports | "What ports is this container listening on?" |
| tcpdump | tcpdump -i eth0 port 443 | Captures raw network packets for deep inspection | "I need to see exactly what traffic is flowing" |
| mtr | mtr api.example.com | Combines ping + traceroute with continuous monitoring | "Is there intermittent packet loss on a specific hop?" |
| openssl | openssl s_client -connect api.example.com:443 | Tests TLS/SSL connection and certificate details | "Is the certificate valid? Is the TLS version correct?" |
DNS — The Internet's Phone Book (And Why It Breaks Everything)
The most underestimated piece of infrastructure
When you type "app.example.com" in a browser, DNS translates that name into an IP address (like 13.55.123.45) so your computer knows where to send the request. DNS failures are the single most common cause of "everything is down" incidents that are not actually application failures.
DNS Record Types Platform Engineers Manage
The TTL Trap
If you change a DNS record but the TTL was set to 86400 (24 hours), some users will still see the old IP address for up to 24 hours. Before any DNS migration, lower the TTL to 300 seconds (5 minutes) at least 24 hours in advance. This is one of the most common mistakes in production DNS changes.
Platform Engineering Series
This article is Part 6 of a 9-part series.
Note: The architecture examples in this series reference LexAML, a real-world AML/CTF compliance platform. The diagrams shown are high-level representations shared for educational purposes.
This content is compiled from various industry sources, official documentation, and practical experience gained across production environments. Your experience may differ based on your organisation, tech stack, and industry context.
We are continuously developing and fine-tuning this content. If something differs from your understanding, or if you have suggestions for improvement, we would genuinely appreciate hearing from you.
Reach out: sumit@getpostlabs.io