← All areas

Observability

Knowing what your system is actually doing — before users tell you.

AI can write the code; it can't tell you checkout is silently failing for 3% of users on one browser. The durable skill is knowing what to measure and reading the signal under pressure — when it's on fire is the worst time to learn it.

Ready to test yourself on this kind of call?

Practice debugging under pressure

What you own

  • Deciding what's worth measuring (the targets/SLOs)
  • Judging signal vs noise during an incident
  • The root-cause call
  • What is actually user-facing
Hand to AI (4)
  • Adding structured logging + trace context to handlers
  • Writing alert rules from a described target
  • Drafting dashboards
  • Summarising a noisy log dump

What to learn (the durable stuff)

The three pillars

logs (what happened), metrics (how much / how often), traces (the path of one request).

Symptoms vs causes

an error spike is the symptom; the trace tells you the cause.

Percentiles over averages

p95/p99 latency (the slowest 5%/1% of requests) is what users feel; the average hides it.

What to alert on

alert on user-facing symptoms (error rate, latency), not every internal blip. Alert fatigue kills response.

Correlation/trace IDs and structured logs, so you can follow one request across services.

Sampling and cost

tracing everything is expensive; sample smartly.

Current tools (these change fast)

SentryErrors + performance; fastest path to real signal.
Grafana + PrometheusOpen-source metrics and dashboards.
DatadogAll-in-one, powerful but can get pricey fast.
OpenTelemetryVendor-neutral instrumentation standard — learn this, not a single vendor SDK.

Practice this scenario

Your app got 20x slower after launch but CPU looks fine. You have request traces and DB query logs. How do you find the bottleneck — and what's the one graph you'd pull up first?

Practice debugging under pressure