Logs disappearing

I’m deploying Loki (distributed mode) along with Tempo, Grafana, Prometheus, Promtail, etc. using Helm.

I think I picked the right options for deploying the loki-distributed chart but something is obviously not working properly… I can see logs in Grafana but after about an hour, they disappear. Up until this morning, Grafana would just show no results. However, now I have been seeing this error in the Grafana UI and also in my querier pod:

level=error ts=2022-01-31T21:37:05.641170853Z caller=batch.go:699 msg=“error fetching chunks” err=“open /var/loki/chunks/ZmFrZS9lYmIzMWQ1NzU5ZmNhNGYzOjE3ZWIxZWUxODEzOjE3ZWIxZWUxODE0OjkyZDhlMTNm: no such file or directory”

I exec’d into the querier pod and checked /var/loki/chunks and it is empty in all 3 of my pods (I’m running a 6 node cluster).

The documentation for this chart leaves a lot to be desired so I was guessing at a lot of these values – I just enabled persistence wherever it was an option. I don’t know if that made sense because the default was strangely false everywhere.

What I am attempting to do is keep logs for 14 days – hence my limits_config and compactor

I have a chart of charts for this and my Loki settings in my values.yaml:

loki:
  loki:
    config: |
      auth_enabled: false

      server:
        http_listen_port: 3100

      distributor:
        ring:
          kvstore:
            store: memberlist

      memberlist:
        join_members:
          - {{ include "loki.fullname" . }}-memberlist

      ingester:
        lifecycler:
          ring:
            kvstore:
              store: memberlist
            replication_factor: 1
        chunk_idle_period: 30m
        chunk_block_size: 262144
        chunk_encoding: snappy
        chunk_retain_period: 1m
        max_transfer_retries: 0
        wal:
          dir: /var/loki/wal

      limits_config:
        enforce_metric_name: false
        reject_old_samples: true
        reject_old_samples_max_age: 168h
        max_cache_freshness_per_query: 10m
        retention_period: 336h

      {{- if .Values.loki.schemaConfig}}
      schema_config:
      {{- toYaml .Values.loki.schemaConfig | nindent 2}}
      {{- end}}
      storage_config:
        boltdb_shipper:
          active_index_directory: /var/loki/index
          cache_location: /var/loki/cache
          cache_ttl: 168h
          shared_store: filesystem
          index_gateway_client:
            server_address: dns:///obs-loki-index-gateway:9095
        filesystem:
          directory: /var/loki/chunks

      chunk_store_config:
        max_look_back_period: 0s

      table_manager:
        retention_deletes_enabled: false
        retention_period: 0s

      query_range:
        align_queries_with_step: true
        max_retries: 5
        split_queries_by_interval: 15m
        cache_results: true
        results_cache:
          cache:
            enable_fifocache: true
            fifocache:
              max_size_items: 1024
              validity: 24h

      frontend_worker:
        frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095

      frontend:
        log_queries_longer_than: 5s
        compress_responses: true
        tail_proxy_url: http://{{ include "loki.querierFullname" . }}:3100

      compactor:
        working_directory: /data/retention
        shared_store: filesystem
        compaction_interval: 10m
        retention_enabled: true
        retention_delete_delay: 2h
        retention_delete_worker_count: 150

      ruler:
        storage:
          type: local
          local:
            directory: /etc/loki/rules
        ring:
          kvstore:
            store: memberlist
        rule_path: /tmp/loki/scratch
        alertmanager_url: https://alertmanager.xx
        external_url: https://alertmanager.xx

  ingester:
    replicas: 3
    persistence:
      # -- Enable creating PVCs which is required when using boltdb-shipper
      enabled: true
      # -- Size of persistent disk
      size: 10Gi
      # -- Storage class to be used.
      # If defined, storageClassName: <storageClass>.
      # If set to "-", storageClassName: "", which disables dynamic provisioning.
      # If empty or set to null, no storageClassName spec is
      # set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
      storageClass: null
  
  distributor:
    replicas: 3
  
  querier:
    replicas: 3
    persistence:
      # -- Enable creating PVCs for the querier cache
      enabled: true
      # -- Size of persistent disk
      size: 10Gi
      # -- Storage class to be used.
      # If defined, storageClassName: <storageClass>.
      # If set to "-", storageClassName: "", which disables dynamic provisioning.
      # If empty or set to null, no storageClassName spec is
      # set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
      storageClass: null
    extraVolumes:
    - name: bolt-db
      emptyDir: {}
    extraVolumeMounts:
    - name: bolt-db
      mountPath: /var/loki

  ruler:
    enabled: false
    replicas: 1

  indexGateway:
    enabled: true
    replicas: 3
    persistence:
      # -- Enable creating PVCs which is required when using boltdb-shipper
      enabled: true
      # -- Size of persistent disk
      size: 10Gi
      # -- Storage class to be used.
      # If defined, storageClassName: <storageClass>.
      # If set to "-", storageClassName: "", which disables dynamic provisioning.
      # If empty or set to null, no storageClassName spec is
      # set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
      storageClass: null

  queryFrontend:
    replicas: 3

  gateway:
    replicas: 3

  compactor:
    enabled: true
    persistence:
      # -- Enable creating PVCs for the compactor
      enabled: true
      # -- Size of persistent disk
      size: 10Gi
      # -- Storage class to be used.
      # If defined, storageClassName: <storageClass>.
      # If set to "-", storageClassName: "", which disables dynamic provisioning.
      # If empty or set to null, no storageClassName spec is
      # set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
      storageClass: null
    serviceAccount:
      create: true

This is my loki config map as deployed on the cluster:

apiVersion: v1
data:
  config.yaml: |
    auth_enabled: false

    server:
      http_listen_port: 3100

    distributor:
      ring:
        kvstore:
          store: memberlist

    memberlist:
      join_members:
        - obs-loki-memberlist

    ingester:
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
      chunk_idle_period: 30m
      chunk_block_size: 262144
      chunk_encoding: snappy
      chunk_retain_period: 1m
      max_transfer_retries: 0
      wal:
        dir: /var/loki/wal

    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      max_cache_freshness_per_query: 10m
      retention_period: 336h
    schema_config:
      configs:
      - from: "2020-09-07"
        index:
          period: 24h
          prefix: loki_index_
        object_store: filesystem
        schema: v11
        store: boltdb-shipper
    storage_config:
      boltdb_shipper:
        active_index_directory: /var/loki/index
        cache_location: /var/loki/cache
        cache_ttl: 168h
        shared_store: filesystem
        index_gateway_client:
          server_address: dns:///obs-loki-index-gateway:9095
      filesystem:
        directory: /var/loki/chunks

    chunk_store_config:
      max_look_back_period: 0s

    table_manager:
      retention_deletes_enabled: false
      retention_period: 0s

    query_range:
      align_queries_with_step: true
      max_retries: 5
      split_queries_by_interval: 15m
      cache_results: true
      results_cache:
        cache:
          enable_fifocache: true
          fifocache:
            max_size_items: 1024
            validity: 24h

    frontend_worker:
      frontend_address: obs-loki-query-frontend:9095

    frontend:
      log_queries_longer_than: 5s
      compress_responses: true
      tail_proxy_url: http://obs-loki-querier:3100

    compactor:
      working_directory: /data/retention
      shared_store: filesystem
      compaction_interval: 10m
      retention_enabled: true
      retention_delete_delay: 2h
      retention_delete_worker_count: 150

    ruler:
      storage:
        type: local
        local:
          directory: /etc/loki/rules
      ring:
        kvstore:
          store: memberlist
      rule_path: /tmp/loki/scratch
      alertmanager_url: https://alertmanager.xx
      external_url: https://alertmanager.xx

This doesn’t quite seem right but I am not sure what the correct value is… However, the compactor period is 14 days so I don’t think the compactor is to blame here for these missing logs.

    compactor:
      working_directory: /data/retention

Why are my logs disappearing? Do I need to run the ruler? My traces are also disappearing in Tempo but I don’t know if that’s related…

Update 1:
I tried changing my querier to not use persistent volumes which I think just means it keeps an in memory cache instead of an on-disk cache. That doesn’t seem to help so far.

Update 2:
I just tested a log search for the last 12 hours for a particular namespace: {namespace="obs"} which should include the logs from the various Loki components. I get an error in Grafana that says:

Query error
open /var/loki/chunks/ZmFrZS9hMWVjYThlZDA0OGZkM2NjOjE3ZWIzMGE0ODVmOjE3ZWIzNzgyNTYyOjg2MDI4Nzgw: no such file or directory

So I check my ingester pods and I have 3 running:

obs-loki-ingester-0
obs-loki-ingester-1
obs-loki-ingester-2

I then ran kubectl exec -it obs-loki-ingester-1 -n obs -- ls -la /var/loki/chunks/ZmFrZS9hMWVjYThlZDA0OGZkM2NjOjE3ZWIzMGE0ODVmOjE3ZWIzNzgyNTYyOjg2MDI4Nzgw for each pod and I see obs-loki-ingester-1 does indeed have this file. So something is screwed up somewhere causing Loki to not be aware of this…

I have Grafana pointed to my querier service http://obs-loki-querier.obs:3100 as a data source. Is this not the correct spot to point at for my Grafana data source? Should my Grafana data source be pointing at obs-loki-querier-frontend.obs (the querier-frontend service)?

Update 3:
I tried pointing my Grafana data source to obs-loki-querier-frontend.obs (the querier-frontend service) and then the “Test” button under Data Sources fails… So that doesn’t seem to be the fix.

At this point I think I have determined that using filesystem storage is NOT an option for Loki in distributed mode.

Reference: Storage | Grafana Labs

Hello @justinstauffer, I am having the same issue and my grafana datasource points to the loki-loki-distributed-gateway. Which I think is the nginx aggregator for the different endpoints. But I guess it has nothing to do with the error. We are missing something else… Did you find out a solution for this?

I am seeing the same problem

On my case as @justinstauffer mentioned, I had to ensure that all components were using the aws/s3 storage. Here is the values.yaml I used, just in case it helps:

loki:
  schemaConfig:
    configs:
      - from: "2020-09-07"
        store: boltdb-shipper
        object_store: aws
        schema: v11
        index:
          prefix: loki_index_
          period: 24h

  storageConfig:
    boltdb_shipper:
      active_index_directory: /var/loki/index
      cache_location: /var/loki/cache
      cache_ttl: 168h
      shared_store: s3
    aws:
      endpoint: http://s3-endpoint
      bucketnames: loki-boltdb
      access_key_id: access
      secret_access_key: secret
      insecure: true
      s3forcepathstyle: true

querier:
  extraVolumes:
    - name: data
      emptyDir: {}
  extraVolumeMounts:
    - name: data
      mountPath: /var/loki

indexGateway:
  enabled: true
  persistence:
    enabled: true

ingester:
  persistence:
    enabled: true

compactor:
  enabled: true
  extraArgs:
    - -boltdb.shipper.compactor.shared-store=s3

ruler:
  enabled: true
  extraArgs:
    - -ruler.storage.type=s3
    - -ruler.storage.s3.endpoint=http://s3-endpoint
    - -ruler.storage.s3.access-key-id=access
    - -ruler.storage.s3.secret-access-key=secret
    - -ruler.storage.s3.insecure=true
    - -ruler.storage.s3.buckets=loki-ruler
    - -ruler.storage.s3.force-path-style=true

I ended up using a locally deployed MinIO service for an S3-compatible storage. If you try to use filesystem storage in distributed mode, nothing seems to stop you but it will behave strangely and not properly store logs.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.