Let’s say that I have a metric which shows me CPU Utilization and if the utilization crosses 80%, an alert should be fired.
I have created an alert which evaluates my query every 1m for 5m.
This basically means that my query will be evaluated 5 times in 5 mins (1m + 1m + 1m + 1m + 1m) before it sends an alert to me.
Now, let’s say that for the first 3 evaluation, CPU utilization was above 80% but for the next 2 evaluation, it went down to 60%. So, this tells me that in last 5 mins, for the majority of the time (3 out of 5 times) my CPU utilization was more than I wanted.
What I need is to find a way that Grafana fires an alert to me on the basis on that majority. If 3 out of 5 times, my condition is violated, then fire an alert otherwise I don’t need any alert.
This is more or less what the average function does in a Reduce expression, but instead of looking at the past 5 evaluations, it takes the avarage of each data point returned by the query. This means that you can write a query that averages the CPU usage of the last 5 minute, in 1 minute intervals, and then alert on the average of the average.
Hi! You might need to change some of the dashboard query when writing your alert query as queries cannot always be copied 1:1 if the alert needs to do something different from the visualization.
Hi @georgerobinson I have written (1 - avg(rate(node_cpu_seconds_total{mode=“idle”}[$__rate_interval])) by (instance)) * 100 which finds out my CPU utilization %
I have put an alert query on this by using Reduce and Math function and then set the alert evaluation period for every 1m for 5m.
Now can you please help me where to change the query in order to receive the outcome?
Thanks @georgerobinson will try this. Also, is there any way in Grafana through which we can count the number of times my threshold was violated in a day?
Right now, the query of my panel is avg by (instance) (rate(node_cpu_seconds_total{mode=“idle”}[$__rate_interval])) * 100 which tells me the idle % of my CPU.
I have put an alert with the help of reduce function: -
With the help of this, I am able to get the count of all the data points that can be seen in the specified time range (i.e., 5 mins which I have set).
But I don’t want to see all the data points here. As you can see that I have put a threshold at 99.78%, I just want to see the count/number of times my time series data crossed that threshold and then put an alert of that threshold.
So that the alert can notify me how many times in last 5 mins my threshold has been violated.
I have also tried to use Classic condition but the output was 0.