If you use any metrics, logs, or traces, that telemetry persists in some data source. If your primary use case is analytics over long term data, then data source latency may not be of significant concern to you (assuming you’re okay with waiting for the data!). But if you’re using said data source for monitoring your systems by looking at dashboards, searching, alerting etc., you should pay close attention to it.

Searching for error logs.

Checking traces.

Looking at that pre-built Grafana dashboard you imported.

You’d like these operations to be fast, wouldn’t you? Well, you’re really just aiming for low read latency. However, again, these are not the only possible cases where you’re reading data.

I’ve built and run alerting systems for a few years. They can be extremely finicky, and I hope this post makes it easier to understand some aspects about them.

Can your alerting read fast enough?

Any alerting on data needs to read it. Fast alert evaluation necessitates fast data retrieval. Complex queries, or queries for long time ranges can often take seconds. This latency is only for one part of the overall alerting - The fetched data still needs to be processed as per the configured rule, create an alert if it meets the conditions, followed by sending it out. Overall MTTD (Mean Time To Detect) thus depends on the time taken by each of these composite steps. If the read latency exceeds the read timeout, it will lead to missing the alert evaluation entirely!

Sidenote: Unless you need to, please don’t set your Grafana metrics dashboards to auto-refresh!

  1. Every auto refresh makes a read request to the data source for ALL panels, even if the data hasn’t changed.
  2. All the read requests are made for the time range selected for the dashboard, meaning even if you were concerned with just 1 panel, you’re making way more calls than needed, for an unnecessary time period.
  3. If you do have to set auto-refresh on, set it to your scrape interval (the periodic interval at which data is collected). You will not be getting new data before said interval. Set the time range to as minimal as you can, so the overall data queried is less.

Are you even getting the data?

Consider an alert rule that says “raise an alert if requests per minute < 50 for 1 minute”.

If the rule looks at the latest data, then while running at 8:15 PM, it expects data for 8:15 PM.

However, it is very unlikely you’re actually ingesting data in real-time, especially if it’s a pull mechanism, like standard Prometheus configuration. Push mechanisms can also batch requests before sending them (such as OpenTelemetry). Then, there is always the actual write to data source itself, before said telemetry can be consumed.

Be wary of write latency/ingestion lag, especially if your alerts are real-time/almost real-time (such as the above case, i.e. alert rule being evaluated at 8:15 PM uses data from 8:15, or even 8:14 PM). Even if your alerts usually work “fine” under these strict time bounds, an ingestion lag can lead to missing rule evaluations, or incorrect evaluations.

The ingestion lag may not necessarily be due to a badly performing data source, but simply the nature of the telemetry. A very good example of this is AWS CloudWatch metrics, which specify ingestion lag of 2 minutes (link to the AWS Blog):

Due to some implementation/architecture limitations, metric data may always be ingested in CloudWatch with a two-minute delay, so the alarm never initiates.

Some ways to work around this are:

  • Offset the alert rule by an acceptable duration, say, 2 minutes. The rule now considers the data received 2 minutes ago as the “latest”.
  • Evaluate for a longer period before raising an alert. Tweaking the rule above to “raise an alert if requests per minute < 50 for 5 minutes” makes the rule more resilient. Note that for this to work without offsets, the query needs data of a range larger than the last 5 minutes! If this query only has the latest 5 minutes of data without an offset, combined with an ingestion lag of 2 minutes, the query would only get data for 3 minutes, meaning the alert never fires! Having a longer evaluation interval is in general a good practice to avoid alert noise, particularly when only a sustained failure needs attention.

There is another scenario that may occur while writing data - backpressure buildup, where the data source is unable to accept all the incoming data, leading to queue buildup. That’s why it’s a good idea to monitor your data source itself (if you can; you may not have access to the data source telemetry in some cases, particularly if relying on a third party provider). Its telemetry should be sent to an independent data source, instead of sending to the primary data source itself, which would lead to a cyclic dependency and may lead to missing out on issues with the data source itself.

Summary

  • High read/write latency may lead to incorrect/delayed/no alerts.
  • Monitoring your data source is a recommended practice to avoid blind spots in overall system health.

Feel free to reach out if you’d like to discuss this further!