Grafana-agent memory leak/usage when ingestion endpoints die?

Hello,
thanks for taking the time to read this.

we recently had a problem where the ingestion endpoints failed, and the grafana agents running on our EC2 instances were unable to send metrics and logs.

The agent then ate all the system memory until the OOM killer got invoked and started killing processes, sometimes it was the agent, sometimes it was other large processes. This caused a right headache as you can imagine.

Is there a way to prevent this from happening again, using a configuration setting in the agent perhaps? My other idea is to use cgroups, but ideally I want a simple quick fix of course!

thanks
Paul

This is what we saw in the logs

# egrep -i "oom|memory" /var/log/messages
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.326855] grafana-agent invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.474421]  oom_kill_process+0x223/0x420
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.478000]  out_of_memory+0x102/0x4c0
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.776955] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300913.261818] Out of memory: Kill process 9793 (grafana-agent) score 431 or sacrifice child
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300913.471967] oom_reaper: reaped process 9793 (grafana-agent), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

this is the process

# ps -ewfwl | grep grafa
4 S root      1990     1  1  80   0 - 346297 -     Jun29 ?        00:23:34 /usr/bin/grafana-agent --config.file /etc/grafana-agent/agent-config.yml -config.expand-env

despite the bot marking as resolved, it isn’t.

I’m happy to receive any ideas.

I read the docs about the command line options and there was nothing obvious that I can do to control this behaviour.

I am pondering learning about cgroups to try and mitigate this behaviour.

You can try to configure backoff behaviour, e. g. for logs Configuration | Grafana Loki documentation
I guess all data between retries are kept in the memory, so just drop them (metrics, logs, traces) earlier, e. g. max_period: 30sec, max_retries: 3. Of course you will loose those dropped data.