Application Observability: A Developer’s Perspective

Odin (Ondřej Popelka)


TechMeetup Conference 2023

What I Do

  • Senior Backend Engineer at Keboola
  • Architecture, Service Design, API Design, Resources setup via Terraform, CI pipelines, Coding, Monitoring, Operations, 24/7 Support, Vacuuming, Washing the dishes, …

What We do

  • Data Operating System, Data Stack, Data processing platform.
  • If you have a (big)data problem, we’re likely to have it solved.
  • If you have 2+ information systems in your company that do not talk to each other, we make them talk.

Keboola Logo

What does DevOps do?

  • SRE gives us the Kubernetes cluster.
  • SRE gives us the networking (private clusters).
  • SRE gives us the monitoring tools.
  • SRE watches us that we do not do anything obviously stupid.
  • The UI consumes the API blueprint.

For everything else, there is Devops

Random Numbers

  • 20 domain services (PHP8, NodeJs, GO + lots of Python & PHP7 + bits of Java, PHP5, R)
  • 1 monolith service (~7 more domains)
  • 1000+ integrations
  • 120+ kubernetes nodes, 9 production stacks, 3 clouds (AWS, Azure, GCP)
  • 280+ requests per second, 24+ million / day
  • 260.000+ asynchronous jobs a day – ranging from 1 second to 24 hours
  • 1.500.000+ LoC code, 13 developers, 4 SRE

Environment

  • High heterogeneity,
  • High load variability,
  • High request length variability,
  • Uneven distribution of requests,

Must have

  • High automation, High reliability, High observability

Easy part

Latency is the king

  • Latency is what the user feels.
  • Measure XXth percentile (p90, p75, p50).
  • Big difference means that the service is unstable.

Good latency

Still Easy part

  • Graph of obviously bad latency

Bad latency

Still Easy part ?

  • Graph of obviously bad latency: No Bad latency
  • APDEX Monitoring – Application Performance Index Apdex

Error rate is the Queen

  • The very first metric is Error rate
  • First to look at when something goes wrong
  • First to monitor with APDEX No Error-ish service

“Weird” API endpoints

  • If the request fails, it’s actually a valid situation.
  • Error rate can be very high, but never 100%.
  • Always monitor individual endpoints not services!
  • “Negative” metric – there must be at least some requests succeeding.
  • I do appreciate tips on how to monitor these.

Diagnosing

Old Lady

  • Latency breakdown;
  • Breakdown of time spent in “3rd party” services:

Latency Breakdown

FlameGraph is God

  • It is absolutely crucial that they are cross-service.
  • One request: Flamegraph

FlameGraph cont.

  • Break down of time spent by the business logic: Flamegraph

FlameGraph cont.

  • Includes time in DB by 3rd party services: Flamegraph

What are good metrics ?

Silence

  • Incident proven:
    • 250+ incidents per month,
    • Fail, fail, fail, succeed…
  • After an incident:
    • Find what metric/alarm should’ve triggered;
    • Find metrics that shouldn’t have triggered;

Metrics give suspicion × Flamegraphs and traces give insight

How to get a good metric?

  • Must be representative of the end-user experience.
  • At the same time it can be totally Meaningless™.
  • ex. “Iteration time”:
    • When divided by the number of jobs it represents the upper bound of the time between a job is received on internal queue and forwarded to the worker to be switched to the processing state and picked up by the processing engine.
    • It should be between 0.1 and 5
    • Why not 7 ?
  • Beware of changes in code that affect the metric!

When watch the metrics?

  • When incidents are triggered;
  • Ideally every second morning;
  • After deploy and During database migrations; Versions

What are the best dashboards?

Eventually all end up like this:

Dashboard

The best ones are those that do not eat your battery when you’re on 24/7

Who’s watching the costs?


Budget alerts

  • Everything else is wrong.
  • Applies to personal pet projects too.
  • Budget alerts also apply to the cost of the monitoring.

Hard Part

  • Asynchronous jobs
    • Containers that run from seconds to up to days
  • Endless loop
    • Non-interactive daemons that run for days to months
    • Queue workers, stream processors, …

Some other time…

Thanks

Questions & Comments ?

linkedin.com/in/odinuv


Keboola Logo Vacancy

keboola.com/about/jobs