Hi,
I am trying to create alerts using Grafana for Spring Boot metrics scraped from Prometheus. The use-case is to alert for exceptions thrown from each service. I’m using the http_server_requests_seconds_count metric and mentioned below is the breakdown of the PromQL query I’m using to create the graphs.
- First I’m excluding all the metrics which don’t throw an exception.
http_server_requests_seconds_count{application="my-service-1",exception!="None"}
- Next I’ve applied the
rate()
function since the default metric just provides a monotonous value.
rate(http_server_requests_seconds_count{application="my-service-1",exception!="None"}[5m])
- Then I’ve used the following condition to trigger an alert. (Using
max()
function as thesum()
and thecount()
functions take the data-points into consideration, which is not my requirement)
WHEN max() OF query(A,5m,now) IS ABOVE 0.02
EVALUATE every 1m FOR 5m
The above setup works fine an sends a notification whenever the alert condition is met. However,I’m facing several problems with this approach.
- I need the actual count of exceptions instead of a rate
I’ve tried the following approach to solve this. But, it still gives a monotonous value until a new exception is thrown.
count_over_time(http_server_requests_seconds_count{application="my-service-1",exception!="None"}[5m])
- I’m getting a series for each exception and unless the alerting state has gone back to Ok , Grafana will not send a notification for a second time the condition is met from a different series.
I thought maybe if I can get a spike per exception for each series, and the graph stays at 0 for the rest of the time, I can solve this issue. So, I’ve tried reducing the time interval for the rate()
function but, it seems like I can reduce it up until 1 minute only. Eventhough it resolves the problem a bit, whenever a second exception comes from another series in between that 1 minute, it won’t send a new notification.
rate(http_server_requests_seconds_count{application="my-service-1",exception!="None"}[1m])
WHEN max() OF query(A,1m,now) IS ABOVE 0.02
EVALUATE every 1m FOR 0m
How can I address the above issues and get Grafana to alert per new exception and also send the count instead of a rate?
(I’m using Grafana v6.5.3)
Appreciate you kind help!