I’ve been experimenting with Grafana alerting on 5.1.0 with t2.medum (2 cores 4gb ram) grafana-server and t2.medium Aurora backend. I’m finding allot of scaling issues and i’m not sure what should be on the issue board and what should not.
My test setup:
i create a dashboard with 100 simple 1 series graphs to the Metrictank backend and have 1 alert. This should rule out graphite/Metrictank as a bottleneck as Metrictank will cache the query and return quickly. I then ramp-up adding a 2nd/3rd … 50th dashboard.
- The queue is far too bursty? What do i mean by this. If everybody in your org puts an alert to run every 60s they all go into the queue at exactly 2:00:00 and miss their window if the queue is overrun. This has the effect of causing alerts to never fire. It seems a little after 5000 alerts this starts happening. For example
timeout 360 tail -f /var/log/grafana/grafana.log |grep ‘Alerting Benchmark9 test16 alert’
t=2018-05-01T21:41:10+0000 lvl=dbug msg=“Scheduler: Putting job on to exec queue” logger=alerting.scheduler name=“Alerting Benchmark9 test16 alert” id=15862
t=2018-05-01T21:41:10+0000 lvl=dbug msg=“Job Execution completed” logger=alerting.engine timeMs=111.415 alertId=15862 name=“Alerting Benchmark9 test16 alert” firing=false attemptID=1
t=2018-05-01T21:43:10+0000 lvl=dbug msg=“Scheduler: Putting job on to exec queue” logger=alerting.scheduler name=“Alerting Benchmark9 test16 alert” id=15862
t=2018-05-01T21:43:10+0000 lvl=dbug msg=“Job Execution completed” logger=alerting.engine timeMs=119.315 alertId=15862 name=“Alerting Benchmark9 test16 alert” firing=false attemptID=1
t=2018-05-01T21:44:10+0000 lvl=dbug msg=“Scheduler: Putting job on to exec queue” logger=alerting.scheduler name=“Alerting Benchmark9 test16 alert” id=15862
t=2018-05-01T21:44:11+0000 lvl=dbug msg=“Job Execution completed” logger=alerting.engine timeMs=-476.701 alertId=15862 name=“Alerting Benchmark9 test16 alert” firing=false attemptID=1
Terminated
- The alerts list UI is un-usable at scale, opening https://grafana.yadda/alerting/list is not possible once you scale.
- Should this queue be adjustable? https://github.com/grafana/grafana/blob/master/pkg/services/alerting/engine.go#L46
- Clustering does not exist (known issue, Walmartlabs has a fork that works hopefully one day they can merge it!). If you are wondering what the alternate solution is, simply set only one node to ‘execute_alerts’ in your config.
Has anybody else gone above 5000 alerts in Grafana? What is your experience ?