Grafana-agent memory leak/usage when ingestion endpoints die?

pmansfield · June 30, 2023, 10:32am

Hello,
thanks for taking the time to read this.

we recently had a problem where the ingestion endpoints failed, and the grafana agents running on our EC2 instances were unable to send metrics and logs.

The agent then ate all the system memory until the OOM killer got invoked and started killing processes, sometimes it was the agent, sometimes it was other large processes. This caused a right headache as you can imagine.

Is there a way to prevent this from happening again, using a configuration setting in the agent perhaps? My other idea is to use cgroups, but ideally I want a simple quick fix of course!

thanks
Paul

pmansfield · June 30, 2023, 10:53am

This is what we saw in the logs

# egrep -i "oom|memory" /var/log/messages
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.326855] grafana-agent invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.474421]  oom_kill_process+0x223/0x420
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.478000]  out_of_memory+0x102/0x4c0
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.776955] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300913.261818] Out of memory: Kill process 9793 (grafana-agent) score 431 or sacrifice child
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300913.471967] oom_reaper: reaped process 9793 (grafana-agent), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

pmansfield · June 30, 2023, 10:55am

this is the process

# ps -ewfwl | grep grafa
4 S root      1990     1  1  80   0 - 346297 -     Jun29 ?        00:23:34 /usr/bin/grafana-agent --config.file /etc/grafana-agent/agent-config.yml -config.expand-env

pmansfield · July 10, 2023, 4:34pm

despite the bot marking as resolved, it isn’t.

I’m happy to receive any ideas.

I read the docs about the command line options and there was nothing obvious that I can do to control this behaviour.

I am pondering learning about cgroups to try and mitigate this behaviour.

jangaraj · July 10, 2023, 6:55pm

You can try to configure backoff behaviour, e. g. for logs Configuration | Grafana Loki documentation
I guess all data between retries are kept in the memory, so just drop them (metrics, logs, traces) earlier, e. g. max_period: 30sec, max_retries: 3. Of course you will loose those dropped data.

Topic		Replies	Views
Grafana Free Tier RATE_LIMITED traces Grafana Cloud agent , tempo	1	441	August 2, 2022
Hosted Prometheus - how to view ingest errors Grafana Cloud agent , metrics , explore	1	386	September 20, 2022
Changing ports for Grafana agent binary (Hosted Prometheus Metrics) Installation	3	119	May 30, 2023
BUG: integrations/process_exporter scrape read_bytes_total counter / write_bytes_total counter value always 0 Grafana Agent	0	121	April 9, 2023
Default k8s agent/agent-loki configs exceed free usage limit Grafana Cloud	2	354	July 14, 2021

Grafana-agent memory leak/usage when ingestion endpoints die?

Related topics