Hello,
thanks for taking the time to read this.
we recently had a problem where the ingestion endpoints failed, and the grafana agents running on our EC2 instances were unable to send metrics and logs.
The agent then ate all the system memory until the OOM killer got invoked and started killing processes, sometimes it was the agent, sometimes it was other large processes. This caused a right headache as you can imagine.
Is there a way to prevent this from happening again, using a configuration setting in the agent perhaps? My other idea is to use cgroups, but ideally I want a simple quick fix of course!
thanks
Paul
This is what we saw in the logs
# egrep -i "oom|memory" /var/log/messages
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.326855] grafana-agent invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.474421] oom_kill_process+0x223/0x420
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.478000] out_of_memory+0x102/0x4c0
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.776955] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300913.261818] Out of memory: Kill process 9793 (grafana-agent) score 431 or sacrifice child
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300913.471967] oom_reaper: reaped process 9793 (grafana-agent), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
this is the process
# ps -ewfwl | grep grafa
4 S root 1990 1 1 80 0 - 346297 - Jun29 ? 00:23:34 /usr/bin/grafana-agent --config.file /etc/grafana-agent/agent-config.yml -config.expand-env
despite the bot marking as resolved, it isn’t.
I’m happy to receive any ideas.
I read the docs about the command line options and there was nothing obvious that I can do to control this behaviour.
I am pondering learning about cgroups to try and mitigate this behaviour.
You can try to configure backoff behaviour, e. g. for logs Configuration | Grafana Loki documentation
I guess all data between retries are kept in the memory, so just drop them (metrics, logs, traces) earlier, e. g. max_period: 30sec, max_retries: 3. Of course you will loose those dropped data.