Tackling alert noise - Lessons from building and using an alerting system

If you’ve dealt with alerts for any non-trivial amount of time, you’ve likely dealt with “alert noise”. Too many alerts. Breaking flow state. Or worse, breaking sleep. Breaking either isn’t nice (Breaking bad. ba dum tss!). It’s not bad just for sanity, but for the very purpose with which one sets alerts – Monitoring (or debugging. We’ll get to the difference) (No, leave my beloved em dash alone. I added it myself. There is no AI in this blog).

Having built an alerting system for serving enterprises, dogfooding it to monitor our own systems, and looking at the alerts set up by folks, I’ve gathered some… lessons. I hope they can serve you well. Most concepts are universal and apply regardless of the type of alert – metrics/events/logs/traces. The overarching goal is simple:

Your time is crucial.

Different notifications for different importance

Not all alerts are equal. Not all of them should be configured to trigger PagerDuty/OpsGenie and disturb you with a phone call. The rule of thumb is:

“If this is triggered, should I be woken up at 3 am to look at this?”

No? Then it can wait. It can go as a message to be checked LATER to a channel of your liking (email, slack, JIRA, PagerDuty but no call, pick your poison). Your time is crucial. These “phone calls”/high priority alerts should be the exception, not the norm. If you’re “used to” high priority alerts, your problem isn’t alerts. It’s reflective of an unstable, unpredictable system that needs urgent fixing.

I emphasise “later” because it brings us to the next point.

Cleanup alerts regularly

If an alert isn’t urgent and is to be checked “later”, there are very high chances that “later” never comes. To quote Gorillaz, tomorrow comes today.

There’s a scheduled meeting. There’s an urgent feature to ship out today. There’s a “quick call”.

Things happen all the time. It’s easy to forget those “later” alerts. But it desensitises you to those alerts. Soon, you stop even looking at the channel because “nothing ever really happens”. Until one day you’re doing an RCA and realise there were leading indicators of the issue that were missed in the clutter.

Setup a regular cadence (say, weekly) to review alerts. If an alert hasn’t been addressed/isn’t looked at, remove it. Keep your channels clear of unnecessary alerts. Your time is crucial.

Cleaning up is one part. You’ll also want to

Review and reconfigure alerts

Good monitoring is a proactive process, not a one and done activity. As you get, or equally importantly, NOT get expected alerts, you should see if they can be better.

“We got high memory alert 10 minutes after it reached 80%, by which time it was already 90%.”
“Hmm, this alert on 60% CPU triggers too often and we don’t really do anything about it.”
“Oh, we missed the OOM container because the alert waits for too long.”

Finetune your alerts to get them sooner/later so they’re more useful. As incidents unavoidably happen, you may learn new failure modes. Setup alerts on them. New services and infrastructure onboarded? You guessed it, new alerts. Onboarding isn’t complete until you can tell how your system is behaving.

Keep alerts as light as you can

It’s possible to set alerts such that they alert you for each service, endpoint, pod, availability zone, etc. They can be extremely useful to minimise alert configurations and keep things in one place. They can also inadvertently shut down your entire monitoring system.

Consider the following promQL which uses the custom metric status_5xx, and has labels zone, service, endpoint, pod.

status_5xx > 0

Because there is no filtering by labels, and no aggregation either, this promQL will return ALL timeseries with value > 0. That may not be a matter with a few labels and values, but cardinality climbs rapidly. With 2 zones, 5 services in each zone, 3 pods per service, and 50 endpoints per service, the possible number of combinations reaches 1500. If this is what your alert rule looks like, in a worst case scenario, you will end up with 1500 alerts all together. Sounds a bit… excessive.

Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should.

You can’t go through 1500 alerts when an incident is ongoing. Your time is crucial. Plus, in this flurry of alerts, you will, in all likelihood, miss other alerts that went off.

A simpler alert would be

sum(increase(status_5xx[1m])) by (zone, service) > 0

Here, we have aggregated over the zone and service labels to see new 5xx errors per minute. The maximum possible number of timeseries here is 10. Much more manageable than 1500. When an alert comes, you will see which zone and service is affected. For drilling down further, you can aggregate by endpoint and pod if required. But they are no longer part of the alert, hence reducing the “noise”.

Have alerts on leading indicators where possible

Leading indicators are portent signals of a potential incident. Handling the situation before it reaches the user is the best case scenario, and the essence of monitoring - knowing your systems and being proactive. If you’re resolving something after it happened, you’re debugging and figuring out what happened. Logs and traces are mostly lagging indicators, since they’re generated after something has happened. Metrics, however, can often be leading indicators.

Free disk space decreasing? What will happen when it reaches zero?
API latency suddenly increasing? Is the database ok?
Memory usage spike? Was it a burst of requests, or some API has a memory leak?

Of course, not everything has a leading indicator. A node suddenly going down is likely out of your hands. But where possible, use leading indicators to help yourself. Your time is crucial.

Long term health? Try Service Level Objectives (SLOs) instead

If you don’t want to be pinged every single time a few requests fail, or latency is slightly higher than usual, then SLOs might be better for you.

SLOs can be used to describe the performance of a service over a sustained period of time. They tolerate a certain degree of “badness” before they are broken. For example, a request based SLO of “99% of all requests in an hour must have latency <200ms”, or a window based SLO of “In a day, availability should be 99.9% for atleast 99.5% 1-minute windows” (it’s a mouthful, I know).

SLOs like this have the benefit of being applicable across different conditions without any changes. For example, the request based SLO works whether there are 100 requests in an hour or 10000. With both, the number of high latency requests “allowed” rises correspondingly. Similarly, the window based SLO divides the day into 1440 1-minute windows, of which 7 minutes can have a lower availability than the target. This avoids raising repeated alerts for random small drops in availablity, unless it is sustained for long or happening frequently.

Selecting the type of SLO is highly dependent on the requirements, but when done correctly, can reduce the number of alerts you set up. As a side-effect, SLOs are amazing for tracking service performance. Setting aggressive SLOs internally helps avoid breaching SLAs (Service Level Agreements) with other parties too.

Whoa really Ujjwal? You’re telling me Service Level Objectives are great for tracking objectives at the service level? Who would’ve thought.

You know what else would be great? If you LIKE SHARE SUBSCR… Oh wait this isn’t YouTube. Anywhoo. Hope this helps your alerting journey. Stay tuned for more. Feel free to reach out on socials!