Alert not sent for No Data

ravikiranswe · March 20, 2023, 4:13pm

Hello experts, I have an alert when the CPU utilization > 95%. It worked quite well in 8.x but around Christmas I upgraded to 9.3.2, after which fortunately we had no incidents but today we didn’t receive an alert when one of our servers had 99% utilization and grafana didn’t receive any data. I googled and found this: NO DATA alert is not firing · Issue #60283 · grafana/grafana · GitHub so I upgraded to 9.4.3 but still it doesn’t work.

My query has multiple servers, so I tested with just the server that was at fault and it kind of worked. It alerted once and then went back to Normal (even though there is still No Data), which in itself also is strange. However its not practical for me to create an alert each for all the servers we have

mjacobson · March 23, 2023, 3:34am

Hi @ravikiranswe! A couple questions might help get to the bottom of this:

Did you migrate from legacy alerting to the new Grafana Alerting during any of these upgrades or were you already using Grafana Alerting on 8.x?
Can you share your alert rule configuration? Either with a screenshot or an export.

ravikiranswe · March 23, 2023, 8:06am

Hi @mjacobson

Yes we did a couple of upgrades last year including a legacy to the unified alerting system when were on 8.x. However the alert rule itself is newly made in 9.x
Alert rule looks like this:

100 - (100 * sum by (instance) (irate(windows_cpu_time_total{environment=~“xxx”,hostname!~“yyyy”,mode=“idle”}[2m])) / sum by (instance) (irate(windows_cpu_time_total{environment=~“xxx”,hostname!~“yyyy”}[2m])))

Not sure if it’s relevant, but when I select View rule I get

Query not available
Cannot display the query preview. Some of the data sources used in the queries are not available.

One more thing I noticed (when I tested with individual alert for each server) was for No Data even thought the rule says wait for 3 min, it waits for 10 minutes and then sends the No Data alert. Looks like it has to have the query empty in Query & Results to send the alert

mjacobson · March 23, 2023, 6:16pm

Nothing stands out from the query itself, I’ll probably need more information on the rest of the alert rule definition. Things such as:

If this is a Grafana-managed alert then the rest of the expressions / alert condition. Is it a classic condition, reduce, something else.
The configured No Data behaviour:

image746×358 18 KB

The above should give a better idea of what the expected behaviour will be when the datasource returns no data.

Other information that would help are any logs (might need to enable debug logs) to do with that alert changing states / firing.

One more thing I noticed (when I tested with individual alert for each server) was for No Data even thought the rule says wait for 3 min, it waits for 10 minutes and then sends the No Data alert. Looks like it has to have the query empty in Query & Results to send the alert

As opposed to an alert changing state from Normal to Firing, the timings for when notifications are actually sent are dependent on the notification policy. What are timings and group by for the matching policy?

For the Query not available bug, this has been fixed in v9.4.7: https://github.com/grafana/grafana/pull/64198

ravikiranswe · March 27, 2023, 12:40pm

Attaching screenshot of another example rule

So this query returns the number of processors for 8 of our test servers
Next I logged into one of the servers and stopped the windows_exporter (simulating a high CPU)
Well then nothing happened. No alert was ever sent
Timing is it runs every 1 min and waits for 3 minutes. There are no other nested stuff anywhere
After another 15 minutes I checked and the server just disappeared from the list, that’s it.
My expectation was that I would receive a No Data alert, which works when there is only 1 server in the alert condition

ravikiranswe · March 31, 2023, 12:38pm

@mjacobson any update? It is frankly quite easy to simulate the problem. I upgraded to latest grafana and was able to reproduce it again.
I dont want to create 40+ alerts for production and 100+ for test environments. If I am doing it the wrong way, please advice

ravikiranswe · April 25, 2023, 12:55pm

This is really blocking us from making Grafana our official monitoring tool. Please get back if you plan on looking into it, otherwise I need to look for alternate monitoring tools

georgerobinson · April 25, 2023, 3:46pm

Hi! I think there is some confusion about how multi-dimensional alerting works.

When there are multiple series and one of those series “disappears” Grafana does send a No Data alert because one series “disappearing” while the other series still existing means that the “disappeared” series has now resolved. If however the entire query returns no series then the alert is No Data because there are no series of any kind.

This is more or less modelled on how alerting work in Prometheus, and Grafana borrows a lot of design choices from Prometheus. That means a missing series is not a firing alert in Prometheus either.

Perhaps you can try something like this (Absent Alerting for Scraped Metrics – Robust Perception | Prometheus Monitoring Experts) and should work in both Prometheus and Grafana!

ravikiranswe · April 28, 2023, 7:10am

Resolved is resolved, how can “disappearing”/missing series be resolved? It sounds more like a limitation. Anyways, what you are suggesting is when a server gets so busy that prometheus is not able to scrape that server or server doesnt respond to a scrape, I need to use up function instead. I need to check if that works

georgerobinson · April 28, 2023, 8:54am

If a series “disappears” and never comes back it is resolved. Take for example a K8s pod that has been migrated to another server because the first server has failed. Unless a StatefulSet, the containers created on the other server will have different IDs, while the containers on the original server will “disappear” along with their IDs.

In the case of Prometheus metrics this looks like the old series disappearing (see N9fdv1 and W1NQXN in the following diagram) forever and new series (fHQcDf and oKZt8U) being created.

|---------------| Time ---------------------> |
| Container ID  | Metric                      |
| N9fdv1        |  1  2  3  4  x  x  x  x ... |
| W1NQXN        |  1  2  3  4  x  x  x  x ... |
| fHQcDf        |  x  x  x  x  1  2  3  4 ... |
| oKZt8U        |  x  x  x  x  1  2  3  4 ... |

If you know that missing series will be reconciled (i.e. the series comes back after 10 minutes) you can increase the time window of your query and even tell Grafana to fill the gaps with 0.

For example, here a series (In Yellow) disappeared for about 3 minutes, but because the window on the alert is 10 minutes, there is still data in the time range. You can increase this window to tolerate mising data up to a known upper bound.

You can also increase the time looked back in the rate (here from 1m to 5m to avoid gaps if you know there is an upper bound on the series coming back).

Topic		Replies	Views
Alert status "no data" even though query returns data Alerting alerting	2	1114	June 15, 2022
We need a alert mail when there are no datapoints	4	1385	September 4, 2020
No Data Grace Period for Alerts Configuration alerting	0	745	June 20, 2018
Grafana 8 alerting - unable to create alert for no Data Alerting alerting	18	7044	September 6, 2022
Unable to Send Alerts in Grafana 4.1.1 Grafana	38	4053	August 24, 2017

Alert not sent for No Data

Related topics