Hi,
We are using InfluxDB’s Continuous Queries to automatically downsample metrics to Retention Policies with a bigger lifespan.
These are the RPs:
create retention policy A on telegraf duration 1w replication 1 shard duration 168h
create retention policy B on telegraf duration 8w replication 1 shard duration 1344h
create retention policy C on telegraf duration 53w replication 1 shard duration 8904h
create retention policy D on telegraf duration INF replication 1 shard duration 520w
The metric collectors (Telegraf for this cpu
metric) send their data to “A”.
Here is an example of a set of CQs on one metric measurement cpu
(we run hundreds of them, performs excellent):
CREATE CONTINUOUS QUERY cpu_mean15m ON telegraf RESAMPLE EVERY 30m FOR 3h BEGIN SELECT mean(usage_guest) AS usage_guest, mean(usage_guest_nice) AS usage_guest_nice, mean(usage_idle) AS usage_idle, mean(usage_iowait) AS usage_iowait, mean(usage_irq) AS usage_irq, mean(usage_nice) AS usage_nice, mean(usage_softirq) AS usage_softirq, mean(usage_steal) AS usage_steal, mean(usage_system) AS usage_system, mean(usage_user) AS usage_user INTO telegraf.B.cpu FROM telegraf.A.cpu GROUP BY time(15m),* fill(none) END
CREATE CONTINUOUS QUERY cpu_mean2h ON telegraf RESAMPLE FOR 6h BEGIN SELECT mean(usage_guest) AS usage_guest, mean(usage_guest_nice) AS usage_guest_nice, mean(usage_idle) AS usage_idle, mean(usage_iowait) AS usage_iowait, mean(usage_irq) AS usage_irq, mean(usage_nice) AS usage_nice, mean(usage_softirq) AS usage_softirq, mean(usage_steal) AS usage_steal, mean(usage_system) AS usage_system, mean(usage_user) AS usage_user INTO telegraf.C.cpu FROM telegraf.A.cpu GROUP BY time(2h),* fill(none) END
CREATE CONTINUOUS QUERY cpu_mean1d ON telegraf RESAMPLE FOR 2d BEGIN SELECT mean(usage_guest) AS usage_guest, mean(usage_guest_nice) AS usage_guest_nice, mean(usage_idle) AS usage_idle, mean(usage_iowait) AS usage_iowait, mean(usage_irq) AS usage_irq, mean(usage_nice) AS usage_nice, mean(usage_softirq) AS usage_softirq, mean(usage_steal) AS usage_steal, mean(usage_system) AS usage_system, mean(usage_user) AS usage_user INTO telegraf.D.cpu FROM telegraf.A.cpu GROUP BY time(1d),* fill(none) END
We templated the RP, so the enduser has to choose A,B,C or D
All panels on our main dashboards use the templated RP.
The problem we have with this, is that it is too easy for users to make mistakes.
When they choose A
, and select last 7 days
, the queries become way too slow (tens of seconds). They’d have to manually select B
or C
from the Templating dropdown we made for Retention Policies, to get normal performance. And when zooming in on the data (for example by dragging a region), they’d have to remember to switch back at a certain point.
The bad performance is not related to disk io, it is purely the number crunching of going through each second (measurement per second) within 7 days of data of several servers (‘tags’). Even when using large ‘time buckets’, for example only 10 results per graph returned by the server, performance is not improved. We see a high CPU spike on the influxdb server, low io-wait, low io traffic.
Since all the downsampling by the CQ’s happens in near real-time, Grafana should select a different RP automatically when a bigger interval is selected.
Example: when $__interval
is bigger than 15 minutes, we want Grafana to use retention policy B
instead of A
.
Any ideas on how to accomplish this?
I’m not sure yet what to ask for as a feature request. A query postprocessor would probably be most flexible, but an addition to Grafana’s templating framework could also be an option.
Kind regards, TW