Network may be partioned, skip forgeting ingesters this round

tifling85 · November 17, 2021, 7:27am

Hello.
My question is related to issue 3360

"too many failed ingesters" using memberlist

opened 03:10PM - 19 Feb 21 UTC

closed 05:33PM - 12 Jul 21 UTC

**Describe the bug** Cluster is down while it should not. **To Reproduce** …Using Loki 2.1.0 The initial setup is 2 monolithic Loki 2.1.0 running with `replication_factor: 2`. I add 2 nodes to the cluster, they all show `ACTIVE` looking at `/ring`. I remove the first 2 nodes. They first show as `LEAVING` then they go `Unhealthy`. They never leave this state (could not find a relevant config option). At this point the cluster is down. Read or writes fail with something like: `level=warn ts=2021-02-19T14:42:46.766880514Z caller=logging.go:71 traceID=44198a5667db211f msg="POST /loki/api/v1/push (500) 147.959µs Response: \"at least 3 live replicas required, could only find 2\\n\"` Forgetting a single `Unhealthy` node using `/ring` buttons is enough to recover. **Expected behavior** 2 `ACTIVE` nodes is sufficient for the cluster to be healthy, so the cluster should not be down when this condition is met. Unhealthy nodes should leave the ring at some configurable point. **Environment:** - Infrastructure: ECS - Deployment tool: Terraform **Screenshots, Promtail config, or terminal output** ![image](https://user-images.githubusercontent.com/35925003/108521177-c0974a00-72cb-11eb-929b-5af0078ec7fa.png) ![image](https://user-images.githubusercontent.com/35925003/108521567-284d9500-72cc-11eb-80c5-c84213d5d24b.png)

I have two hosts with Loki in monolithic mode. KV store is Consul.

ingester:
  lifecycler:
    ring:
      kvstore:
        store: consul
        consul:
          host: localhost:8500
      heartbeat_timeout: 20s
      replication_factor: 1
  autoforget_unhealthy: true

I turn off one host and expect the remaining Loki to forget the unhealthy instance. But it doesn’t. In the logs I see the message:
autoforget have seen 1 unhealthy ingesters out of 2, network may be partioned, skip forgeting ingesters this round
The instance remains on the list with the “unhealthy” status. Receiving logs does not work.

Please explain to me why this is happening and how I can make the remaining Loki work.
Thanks!

system · November 17, 2022, 7:28am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Distributed Loki - Troubleshooting Ingester on ECS Grafana Loki	1	259	November 12, 2022
Cortex Ring is Unstable Grafana Loki	2	902	August 5, 2022
Test of network failure between 2 AZ with a 3 AZ distribution Loki architecture Grafana Loki	8	874	February 8, 2022
"empty ring" error with memberlist Grafana Loki	3	4343	November 18, 2022
Loglines about 20 min delayed Grafana Loki loki , grafana-ui , logs	9	230	March 23, 2024

Network may be partioned, skip forgeting ingesters this round

Related topics