Hello.
My question is related to issue 3360
opened 03:10PM - 19 Feb 21 UTC
closed 05:33PM - 12 Jul 21 UTC
**Describe the bug**
Cluster is down while it should not.
**To Reproduce**
… Using Loki 2.1.0
The initial setup is 2 monolithic Loki 2.1.0 running with `replication_factor: 2`.
I add 2 nodes to the cluster, they all show `ACTIVE` looking at `/ring`.
I remove the first 2 nodes. They first show as `LEAVING` then they go `Unhealthy`.
They never leave this state (could not find a relevant config option).
At this point the cluster is down. Read or writes fail with something like:
`level=warn ts=2021-02-19T14:42:46.766880514Z caller=logging.go:71 traceID=44198a5667db211f msg="POST /loki/api/v1/push (500) 147.959µs Response: \"at least 3 live replicas required, could only find 2\\n\"`
Forgetting a single `Unhealthy` node using `/ring` buttons is enough to recover.
**Expected behavior**
2 `ACTIVE` nodes is sufficient for the cluster to be healthy, so the cluster should not be down when this condition is met.
Unhealthy nodes should leave the ring at some configurable point.
**Environment:**
- Infrastructure: ECS
- Deployment tool: Terraform
**Screenshots, Promtail config, or terminal output**


I have two hosts with Loki in monolithic mode. KV store is Consul.
ingester:
lifecycler:
ring:
kvstore:
store: consul
consul:
host: localhost:8500
heartbeat_timeout: 20s
replication_factor: 1
autoforget_unhealthy: true
I turn off one host and expect the remaining Loki to forget the unhealthy instance. But it doesn’t. In the logs I see the message:
autoforget have seen 1 unhealthy ingesters out of 2, network may be partioned, skip forgeting ingesters this round
The instance remains on the list with the “unhealthy” status. Receiving logs does not work.
Please explain to me why this is happening and how I can make the remaining Loki work.
Thanks!
system
Closed
November 17, 2022, 7:28am
2
This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.