“If you can’t measure, you can’t improve it” - some famous person Understanding the differences — the state of the system, based on gathering predefined sets of metrics or logs. Monitoring understand — the state of a system, based on exploring properties and patterns not defined in advance. Observability infer Why do we need monitoring? Monitoring should address two questions: what’s broken, and why? vs. is one of the most important distinctions in doing good monitoring with maximum signal and minimum noise. What why Why do we need observability? Basically, monitoring relies on capturing and displaying the data providing a restricted view of the system, whereas observability can anticipate the system's health based on the data it generates (logs, metrics, traces). Lots of software jobs (especially SRE) include different monitoring tech stacks, one might argue that you can make a living only from mastering those specific technologies. The tooling landscape might seem daunting. And at a first glance, it looks overwhelming, especially since each technology comes with a specific nomenclature like forwarder, indexer, exporter, data-source, controller, etc. When navigating through all these matters, we need to know the basics. System metrics vs application metrics Usually, system metrics capture infrastructure-related metrics such as CPU and memory consumption, disk I/O, network I/O, whereas application metrics refer to error rates, requests per minute, average response times. Agent vs agentless At times it might be needed that some kind of agent to be deployed on your system (e.g. Splunk forwarder, AppDynamics app agents), and in some cases there’s no need for an agent, for example, Prometheus which uses an HTTP pull model to populate a time-series database. Push vs. Pull monitoring Push model, the agents push their data to the monitoring system whereas pull model the system pulls data from the agents. The key difference is that in the push-based approach (Nagios, Zabbix) the central monitoring system knows quite a lot about the metrics whereas in the pull-based approach (Prometheus, Datadog) the main monitoring system knows nothing or very little about the metrics which are coming in. Tooling landscape Metric collection: , , Prometheus Stackdriver InfluxDB Log aggregation: , Fluentd Logstash Tracing: , , OpenTelemetry Jager Zipkin Performance monitoring: , , AppDynamics NewRelic Dynatrace Dashboarding and visualization: , Grafana Kibana Monitoring can mean a lot of things As a piece of advice, it’s important to understand that monitoring might be different from one company to another, nothing is written in stone. One way to measure the in an organization is to check the following aspects: observability Alerting: How many alerts are generated per week? What percentage of alerts are handled “out of hours”? Monitoring system configuration: Is the monitoring system under version control? How many Pull Requests/Change Requests are made to the repository containing the monitoring system? On-call rotation: Are the alerts fairly distributed and addressed by all teams ( )? Guide to understand your OPS “The phone should not ring” - M.B, because ” Paging a human is a quite expensive use of an employee’s time. If an employee is at work, a page interrupts their workflow. If the employee is at home, a page interrupts their personal time, and perhaps even their sleep” - Google SRE book