Observability
Knowing what your system is actually doing — before users tell you.
AI can write the code; it can't tell you checkout is silently failing for 3% of users on one browser. The durable skill is knowing what to measure and reading the signal under pressure — when it's on fire is the worst time to learn it.
Ready to test yourself on this kind of call?
Practice debugging under pressure →What you own
- ▪Deciding what's worth measuring (the targets/SLOs)
- ▪Judging signal vs noise during an incident
- ▪The root-cause call
- ▪What is actually user-facing
▸Hand to AI (4)
- Adding structured logging + trace context to handlers
- Writing alert rules from a described target
- Drafting dashboards
- Summarising a noisy log dump
What to learn (the durable stuff)
The three pillars
logs (what happened), metrics (how much / how often), traces (the path of one request).
Symptoms vs causes
an error spike is the symptom; the trace tells you the cause.
Percentiles over averages
p95/p99 latency (the slowest 5%/1% of requests) is what users feel; the average hides it.
What to alert on
alert on user-facing symptoms (error rate, latency), not every internal blip. Alert fatigue kills response.
Correlation/trace IDs and structured logs, so you can follow one request across services.
Sampling and cost
tracing everything is expensive; sample smartly.
Current tools (these change fast)
Practice this scenario
Your app got 20x slower after launch but CPU looks fine. You have request traces and DB query logs. How do you find the bottleneck — and what's the one graph you'd pull up first?
Practice debugging under pressure →