Node becoming unhealthy after connected. UDP?

Hello,
I’m trying to run a Loki cluster on AWS ECS with AWS Cloud Maps.
When a node joins the other one (i see both on /ring) but few seconds after that an healthcheck seems to fail and the node become unhealthy.
I suppose the connection is done over TCP but checks over UDP. Unfortunately i’m not able to open the same port over TCP and UDP. Is there a way to configure memberlist to use only TCP ?

Thank you.

I don’t believe it needs UDP. We are running Loki with ECS as well, haven’t had problem.

Couple of suggestions if you aren’t already doing these:

  1. Separate your writer and readers. The easiest way to do this would be to setup one ECS cluster but two autoscaling groups. You can give them different ECS_INSTANCE_ATTRIBUTES, and configure your ECS service to go to those instances with placement_constraints.

The reason for doing this is primarily because of WAL requirements for the writers. They need dedicated persistent volumes for WAL, and ECS simply doesn’t have that functionality. The best workaround I could think of is to run writers as DAEMON so they can have unique bind mounts from the host.

  1. You’ll need service discovery for both writers and readers for ring membership, make sure to use A record, don’t do SRV. CloudMap / Route53 is also limited to 8 service discovery record if you weren’t aware of this, so make sure you consider that when sizing your containers and make sure the scaling limit is 8.

You’ll also want to make sure your ECS services are using AWSVPC network mode because you need each writer and reader container to be individually discoverable, which means they each need unique IP to avoid port conflict. If you want to run more than a couple of containers on one host you’ll need to enable network interface trunking for your ECS hosts.

I had some additional discussions with someone else who’s also running Loki on ECS a while ago, there might be something in there that could be helpful: Loki 2.4.1 empty ring Code(500) error for "GET /loki/api/v1/labels" API on AWS ECS - #10 by tonyswumac

I was planning to use multiple writers with dynamic EBS volume with rexray drivers and using replica.
Interesting CloudMap limitation!

Is it possible to share your task definition ?

I’ve used rexray with ECS before, and I didn’t like it very much (your milage may vary of course), primarily because with EBS volume being tied to AZ I just found it rather janky.

I’ve attached our terraform definition for the reader and writer services below (excluding the auto scaling parts). We aren’t using the new backend component yet, but will probably move to that soon.

reader:

resource "aws_service_discovery_service" "loki_reader" {
  name    = "${var.loki_cluster_identifier}-reader"
  dns_config {
    namespace_id   = aws_service_discovery_private_dns_namespace.service_discovery.id
    routing_policy = "MULTIVALUE"
      dns_records {
          ttl  = 5
          type = "A"
      }
  }
  health_check_custom_config {
    failure_threshold = 3
  }
}

resource "aws_ecs_service" "loki_reader" {
  name                               = "${var.loki_cluster_identifier}-reader"
  cluster                            = module.ecs-cluster-loki.ecs_cluster_arn
  task_definition                    = aws_ecs_task_definition.loki_reader.arn
  desired_count                      = var.loki_reader_ecs_desired_count
  deployment_minimum_healthy_percent = 50
  scheduling_strategy                = "REPLICA"

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.loki_ecs_service.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.loki_reader_external_80.arn
    container_name   = "nginx"
    container_port   = 80
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.loki_reader_internal_80.arn
    container_name   = "nginx"
    container_port   = 80
  }

  service_registries {
    registry_arn   = aws_service_discovery_service.loki_reader.arn
    container_name = "${var.loki_cluster_identifier}-reader"
  }

  capacity_provider_strategy {
    capacity_provider = module.ecs-cluster-loki.capacity_provider_name
    weight            = 1
  }

  placement_constraints {
    type       = "memberOf"
    expression = "attribute:loki-instance-type == reader"
  }

  depends_on = [aws_lb.loki_internal_alb]

  lifecycle {
    ignore_changes = [desired_count]
  }
}

resource "aws_ecs_task_definition" "loki_reader" {
  family                  = "${var.loki_cluster_identifier}-reader"
  task_role_arn           = aws_iam_role.loki_ecs.arn
  execution_role_arn      = aws_iam_role.loki_ecs.arn
  network_mode            = "awsvpc"

  volume {
    name      = "nginx-config"
    host_path = "/opt/nginx/etc/nginx.conf"
  }

  volume {
    name      = "nginx-htpasswd"
    host_path = "/opt/nginx/etc/.htpasswd"
  }

  volume {
    name      = "loki-config"
    host_path = "/opt/loki/etc/loki.yml"
  }

  volume {
    name      = "loki-config-limits-overrides"
    host_path = "/opt/loki/etc/limits-overrides.yml"
  }

  container_definitions = jsonencode([
    {
      "name" = "nginx",
      "image" = var.nginx_image_version,
      "cpu" = 256,
      "memoryReservation" = 128,
      "essential" = true,
      "mountPoints" = [
        {
          "sourceVolume" = "nginx-config",
          "containerPath" = "/etc/nginx/nginx.conf",
          "readOnly" = true,
        },
        {
          "sourceVolume" = "nginx-htpasswd",
          "containerPath" = "/etc/nginx/.htpassword",
          "readOnly" = true,
        }
      ],
      "portMappings" = [
        {
          "hostPort" = 80,
          "containerPort" = 80,
          "protocol" = "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "json-file",
        "options": {
          "max-size": "1g",
          "max-file": "5",
          "tag": "nginx/{{.FullID}}"
        }
      }
    },
    {
      "name" = "${var.loki_cluster_identifier}-reader",
      "image" = var.loki_docker_image_version,
      "command" = [
        "-config.file=/etc/loki/loki.yml",
        "-target=read"
      ],
      "environment" = [
        {
          "name" = "ansible_playbook_md5",
          "value" = data.archive_file.ansible_playbook.output_md5
        },
        {
          "name" = "ansible_vars_md5",
          "value" = md5(local.ansible_vars)
        }
      ],
      "essential" = true,
      "cpu" = var.loki_reader_ecs_cpu_reservation,
      "memoryReservation" = var.loki_reader_ecs_memory_reservation,
      "mountPoints" = [
        {
          "sourceVolume" = "loki-config",
          "containerPath" = "/etc/loki/loki.yml",
          "readOnly" = true
        },
        {
          "sourceVolume" = "loki-config-limits-overrides",
          "containerPath" = "/etc/loki/limits-overrides.yml",
          "readOnly" = true
        }
      ],
      "portMappings" = [
        {
          "hostPort" = 3100,
          "containerPort" = 3100,
          "protocol" = "tcp"
        },
        {
          "hostPort" = var.loki_ring_gossip_port,
          "containerPort" = var.loki_ring_gossip_port,
          "protocol" = "tcp"
        }
      ],
      "privileged" = true,
      "ulimits" = [
        {
          "name" = "nofile",
          "softLimit" = 65536,
          "hardLimit" = 65536
        }
      ],
      "logConfiguration": {
        "logDriver": "json-file",
        "options": {
          "max-size": "1g",
          "max-file": "5",
          "tag": "loki-reader/{{.FullID}}"
        }
      }
    }
  ])

  depends_on = [null_resource.playbook_delay]
}

writer:

resource "aws_service_discovery_service" "loki_writer" {
  name    = "${var.loki_cluster_identifier}-writer"
  dns_config {
    namespace_id   = aws_service_discovery_private_dns_namespace.service_discovery.id
    routing_policy = "MULTIVALUE"
      dns_records {
          ttl  = 5
          type = "A"
      }
  }
  health_check_custom_config {
    failure_threshold = 3
  }
}

resource "aws_ecs_service" "loki_writer" {
  name                               = "${var.loki_cluster_identifier}-writer"
  cluster                            = module.ecs-cluster-loki.ecs_cluster_arn
  task_definition                    = aws_ecs_task_definition.loki_writer.arn
  deployment_minimum_healthy_percent = 50
  scheduling_strategy                = "DAEMON"

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.loki_ecs_service.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.loki_writer_external_80.arn
    container_name   = "nginx"
    container_port   = 80
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.loki_writer_internal_80.arn
    container_name   = "nginx"
    container_port   = 80
  }

  service_registries {
    registry_arn   = aws_service_discovery_service.loki_writer.arn
    container_name = "${var.loki_cluster_identifier}-writer"
  }

  placement_constraints {
    type       = "memberOf"
    expression = "attribute:loki-instance-type == writer"
  }

  depends_on = [aws_lb.loki_internal_alb]
}

resource "aws_ecs_task_definition" "loki_writer" {
  family                  = "${var.loki_cluster_identifier}-writer"
  task_role_arn           = aws_iam_role.loki_ecs.arn
  execution_role_arn      = aws_iam_role.loki_ecs.arn
  network_mode            = "awsvpc"

  volume {
    name      = "nginx-config"
    host_path = "/opt/nginx/etc/nginx.conf"
  }

  volume {
    name      = "nginx-htpasswd"
    host_path = "/opt/nginx/etc/.htpasswd"
  }

  volume {
    name      = "loki-config"
    host_path = "/opt/loki/etc/loki.yml"
  }

  volume {
    name      = "loki-config-limits-overrides"
    host_path = "/opt/loki/etc/limits-overrides.yml"
  }

  volume {
    name      = "loki-writer-local-storage"
    host_path = var.loki_writer_local_storage
  }

  container_definitions = jsonencode([
    {
      "name" = "nginx",
      "image" = var.nginx_image_version,
      "cpu" = 256,
      "memoryReservation" = 256,
      "essential" = true,
      "mountPoints" = [
        {
          "sourceVolume" = "nginx-config",
          "containerPath" = "/etc/nginx/nginx.conf",
          "readOnly" = true,
        },
        {
          "sourceVolume" = "nginx-htpasswd",
          "containerPath" = "/etc/nginx/.htpassword",
          "readOnly" = true,
        }
      ],
      "portMappings" = [
        {
          "hostPort" = 80,
          "containerPort" = 80,
          "protocol" = "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "json-file",
        "options": {
          "max-size": "1g",
          "max-file": "5",
          "tag": "nginx/{{.FullID}}"
        }
      }
    },
    {
      "name" = "${var.loki_cluster_identifier}-writer",
      "image" = var.loki_docker_image_version,
      "cpu" = 256,
      "memoryReservation" = 2048,
      "essential" = true,
      "command" = [
        "-config.file=/etc/loki/loki.yml",
        "-target=write"
      ],
      "mountPoints" = [
        {
          "sourceVolume" = "loki-config",
          "containerPath" = "/etc/loki/loki.yml",
          "readOnly" = true
        },
        {
          "sourceVolume" = "loki-config-limits-overrides",
          "containerPath" = "/etc/loki/limits-overrides.yml",
          "readOnly" = true
        },
        {
          "sourceVolume" = "loki-writer-local-storage",
          "containerPath" = var.loki_writer_local_storage,
          "readOnly" = false
        }
      ],
      "portMappings" = [
        {
          "hostPort" = 3100,
          "containerPort" = 3100,
          "protocol" = "tcp"
        },
        {
          "hostPort" = var.loki_ring_gossip_port,
          "containerPort" = var.loki_ring_gossip_port,
          "protocol" = "tcp"
        },
        {
          "hostPort" = 9095,
          "containerPort" = 9095,
          "protocol" = "tcp"
        }
      ],
      "privileged" = true,
      "ulimits" = [
        {
          "name" = "nofile",
          "softLimit" = 65536,
          "hardLimit" = 65536
        }
      ],
      "logConfiguration": {
        "logDriver": "json-file",
        "options": {
          "max-size": "1g",
          "max-file": "5",
          "tag": "loki-writer/{{.FullID}}"
        }
      }
    }
  ])
}

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.