Grafana Agent Flow Cluster

matt328 · July 15, 2023, 3:12pm

I’m evaluating the LGTM stack, and I am deploying grafana agent in flow mode using the official helm chart. I would like to use a discovery.kubernetes component to discover nodes from which to scrape prometheus metrics. What I am seeing is grafana agent is deployed as a daemon set, so I get one instance on each node in my cluster. From my understanding (and this could be incorrect) but each of the 4 nodes will scrape metrics from all nodes in the cluster, resulting in duplicated metrics. What I’m seeing in logs are tons of out-of-order metrics warnings, and unusually high CPU usage although the LGTM stack is the only thing running. I suspect that 4 agents scraping all nodes and overwhelming mimir is the cause.

What I am wondering is what is a typical scrape interval? I started at 15s, but am suspecting that is far too often, and possibly not giving each agent a wide enough window to stagger scrapes and not stomp on each other when reporting metrics to mimir. What I think I could do is either via discovery.kubernetes, or discovery.relabel, filter out targets that are ‘other’ nodes, and only have each grafana agent responsible for scraping the node it is running on.

Is this a valid approach, or does it kind of defeat the purpose of clustering? Should I increase the interval to something like the default of 1m and end up with metrics being scraped every (interval / number of nodes) seconds? Any advice here is greatly appreciated.

Topic		Replies	Views
Grafana agent scrape config Grafana Cloud	5	3236	November 17, 2022
Agent stops scraping at scrape_interval=60s Prometheus	0	458	March 21, 2022
Grafana Agent scraping interval and metrics Grafana Agent	0	198	June 1, 2023
Grafana Agent won't scrape service Configuration configuration , helm , agent	5	872	December 8, 2022
How to fetch pod metrics with Grafana Agent Operator? Grafana Agent	2	274	June 13, 2023

Grafana Agent Flow Cluster

Related topics