paint-brush
Observability and Monitoring in a nutshellby@dejanualex
282 reads

Observability and Monitoring in a nutshell

by dejanualexMarch 20th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Monitoring — understand the state of the system, based on gathering predefined sets of metrics or logs. Observability can anticipate the system's health based on the data it generates (logs, metrics, traces) Push vs. Pull vs. Push model, the agents push their data to the monitoring system whereas pull model the system pulls data from the agents. Push-based approach (Nagios, Zabbix) the central monitoring system knows quite a lot about the metrics. Agentless approach (e.g. Splunk forwarder, AppDynamics app agents)

Company Mentioned

Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Observability and Monitoring in a nutshell
dejanualex HackerNoon profile picture

“If you can’t measure, you can’t improve it” - some famous person



Understanding the differences


  • Monitoring — understand the state of the system, based on gathering predefined sets of metrics or logs.

  • Observability — infer the state of a system, based on exploring properties and patterns not defined in advance.


Why do we need monitoring?

Monitoring should address two questions: what’s broken, and why? What vs. why is one of the most important distinctions in doing good monitoring with maximum signal and minimum noise.


 Example symptoms and causes-Google SRE book



Why do we need observability?

Basically, monitoring relies on capturing and displaying the data providing a restricted view of the system, whereas observability can anticipate the system's health based on the data it generates (logs, metrics, traces).




Lots of software jobs (especially SRE) include different monitoring tech stacks, one might argue that you can make a living only from mastering those specific technologies.



Tooling landscape


The tooling landscape might seem daunting. And at a first glance, it looks overwhelming, especially since each technology comes with a specific nomenclature like forwarder, indexer, exporter, data-source, controller, etc. When navigating through all these matters, we need to know the basics.


System metrics vs application metrics


Usually, system metrics capture infrastructure-related metrics such as CPU and memory consumption, disk I/O, network I/O, whereas application metrics refer to error rates, requests per minute, average response times.


Agent vs agentless


At times it might be needed that some kind of agent to be deployed on your system (e.g. Splunk forwarder, AppDynamics app agents), and in some cases there’s no need for an agent, for example, Prometheus which uses an HTTP pull model to populate a time-series database.


Push vs. Pull monitoring


Push model, the agents push their data to the monitoring system whereas pull model the system pulls data from the agents. The key difference is that in the push-based approach (Nagios, Zabbix) the central monitoring system knows quite a lot about the metrics whereas in the pull-based approach (Prometheus, Datadog) the main monitoring system knows nothing or very little about the metrics which are coming in.



Tooling landscape



Monitoring can mean a lot of things


As a piece of advice, it’s important to understand that monitoring might be different from one company to another, nothing is written in stone.





One way to measure the observability in an organization is to check the following aspects:


  • Alerting: How many alerts are generated per week? What percentage of alerts are handled “out of hours”?
  • Monitoring system configuration: Is the monitoring system under version control? How many Pull Requests/Change Requests are made to the repository containing the monitoring system?
  • On-call rotation: Are the alerts fairly distributed and addressed by all teams (Guide to understand your OPS)?
  • “The phone should not ring” - M.B, because ” Paging a human is a quite expensive use of an employee’s time. If an employee is at work, a page interrupts their workflow. If the employee is at home, a page interrupts their personal time, and perhaps even their sleep” - Google SRE book