Loki k8s single binary log retention configuration - not deleting logs

I have installed the Grafana Loki single binary in my Kubernetes cluster using the Helm chart. Everything works great except that my persistent storage (filesystem) is filling up. I have read the storage retention configuration docs from Grafana and many posts here and elsewhere about this. I believe that I have configured my Loki installation to remove logs using the compactor, but my persistent volume keeps filling up.

I am using version 3.1.0 of the Loki helm chart (loki-3.1.0.tgz) to install version 2.6.1 of the Loki image (grafana/loki:2.6.1)

Here is my values.yaml file that I am using to install Loki:

# fullnameOverride: loki

# global:
#   image:
#     registry: null

monitoring:
  dashboards:
    enabled: false
  rules:
    enabled: false
  alerts:
    enabled: false
  serviceMonitor:
    enabled: false    
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false    
    lokiCanary:
      enabled: false 

loki:
  image:
    # -- The Docker registry
    registry: harbor.fractilia.com/library
    # -- Docker image repository
    repository: grafana/loki
    # -- Overrides the image tag whose default is the chart's appVersion
    tag: 2.6.1
    # -- Docker image pull policy
    pullPolicy: IfNotPresent
  # Should authentication be enabled
  auth_enabled: false
  storage:
    type: filesystem


  compactor:
    shared_store: filesystem
    working_directory: /var/loki/boltdb-shipper-compactor
    compaction_interval: 10m
    retention_enabled: true
    retention_delete_delay: 1h
    retention_delete_worker_count: 100

  limits_config:
    retention_period: 2d

  storage_config:
    boltdb_shipper:
      active_index_directory: /var/loki/boltdb-shipper-active
      cache_location: /var/loki/boltdb-shipper-cache
      cache_ttl: 24h
      shared_store: filesystem
    filesystem:
      directory: /var/loki/chunks

  # commonConfig:
  #   path_prefix: /var/loki
  #   replication_factor: 1

  # server:
  #   log_level: debug

  # NOTE: We need the chunk_store_config and ingester setting, and I don't see another way of getting them into the config.
  config: |
    {{- if .Values.enterprise.enabled}}
    {{- tpl .Values.enterprise.config . }}
    {{- else }}
    auth_enabled: {{ .Values.loki.auth_enabled }}
    {{- end }}

    {{- with .Values.loki.server }}
    server:
      {{- toYaml . | nindent 2}}
    {{- end}}

    memberlist:
      join_members:
        - {{ include "loki.memberlist" . }}

    {{- if .Values.loki.commonConfig}}
    common:
    {{- toYaml .Values.loki.commonConfig | nindent 2}}
      storage:
      {{- include "loki.commonStorageConfig" . | nindent 4}}
    {{- end}}

    {{- with .Values.loki.limits_config }}
    limits_config:
      {{- tpl (. | toYaml) $ | nindent 4 }}
    {{- end }}

    {{- with .Values.loki.memcached.chunk_cache }}
    {{- if and .enabled .host }}
    chunk_store_config:
      chunk_cache_config:
        memcached:
          batch_size: {{ .batch_size }}
          parallelism: {{ .parallelism }}
        memcached_client:
          host: {{ .host }}
          service: {{ .service }}
    {{- end }}
    {{- end }}

    {{- if .Values.loki.schemaConfig}}
    schema_config:
    {{- toYaml .Values.loki.schemaConfig | nindent 2}}
    {{- else }}
    schema_config:
      configs:
        - from: 2022-01-11
          store: boltdb-shipper
          {{- if eq .Values.loki.storage.type "s3" }}
          object_store: s3
          {{- else if eq .Values.loki.storage.type "gcs" }}
          object_store: gcs
          {{- else }}
          object_store: filesystem
          {{- end }}
          schema: v12
          index:
            prefix: loki_index_
            period: 24h
    {{- end }}

    {{- if or .Values.minio.enabled (eq .Values.loki.storage.type "s3") (eq .Values.loki.storage.type "gcs") }}
    ruler:
      storage:
      {{- include "loki.rulerStorageConfig" . | nindent 4}}
    {{- end -}}

    {{- with .Values.loki.memcached.results_cache }}
    query_range:
      align_queries_with_step: true
      {{- if and .enabled .host }}
      cache_results: {{ .enabled }}
      results_cache:
        cache:
          default_validity: {{ .default_validity }}
          memcached_client:
            host: {{ .host }}
            service: {{ .service }}
            timeout: {{ .timeout }}
      {{- end }}
    {{- end }}

    {{- with .Values.loki.storage_config }}
    storage_config:
      {{- tpl (. | toYaml) $ | nindent 4 }}
    {{- end }}

    {{- with .Values.loki.query_scheduler }}
    query_scheduler:
      {{- tpl (. | toYaml) $ | nindent 4 }}
    {{- end }}

    {{- with .Values.loki.compactor }}
    compactor:
      {{- tpl (. | toYaml) $ | nindent 4 }}
    {{- end }}

    chunk_store_config:
      max_look_back_period: "0s"
    ingester:
      chunk_block_size: 262144
      chunk_idle_period: 30m
      chunk_retain_period: 1m
      lifecycler:
        ring:
          replication_factor: 1
      max_transfer_retries: 0
      wal:
        dir: /var/loki/wal  

    
  # TODO: There might be nothing to do here.
  # memberlist:
  #   abort_if_cluster_join_fails: false
  # join_members:
  # - loki-memberlist
  # - loki-memberlist.logging.svc.cluster.local

singleBinary:
  # -- Number of replicas for the single binary
  replicas: 1
  # -- Resource requests and limits for the single binary
  resources: {}
  # -- Node selector for single binary pods
  nodeSelector: {}
  persistence:
    # -- Size of persistent disk
    size: 500Gi
    # -- Storage class to be used.
    # If defined, storageClassName: <storageClass>.
    # If set to "-", storageClassName: "", which disables dynamic provisioning.
    # If empty or set to null, no storageClassName spec is
    # set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
    storageClass: "fame-storage-vsan-policy"

This creates this Loki config file:

apiVersion: v1
data:
  config.yaml: |
    auth_enabled: false
    chunk_store_config:
      max_look_back_period: 0s
    common:
      path_prefix: /var/loki
      replication_factor: 3
      storage:
        filesystem:
          chunks_directory: /var/loki/chunks
          rules_directory: /var/loki/rules
    compactor:
      compaction_interval: 10m
      retention_delete_delay: 1h
      retention_delete_worker_count: 100
      retention_enabled: true
      shared_store: filesystem
      working_directory: /var/loki/boltdb-shipper-compactor
    ingester:
      chunk_block_size: 262144
      chunk_idle_period: 30m
      chunk_retain_period: 1m
      lifecycler:
        ring:
          replication_factor: 1
      max_transfer_retries: 0
      wal:
        dir: /var/loki/wal
    limits_config:
      enforce_metric_name: false
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      retention_period: 2d
      split_queries_by_interval: 15m
    memberlist:
      join_members:
      - loki-memberlist
    query_range:
      align_queries_with_step: true
    schema_config:
      configs:
      - from: "2022-01-11"
        index:
          period: 24h
          prefix: loki_index_
        object_store: filesystem
        schema: v12
        store: boltdb-shipper
    server:
      grpc_listen_port: 9095
      http_listen_port: 3100
    storage_config:
      boltdb_shipper:
        active_index_directory: /var/loki/boltdb-shipper-active
        cache_location: /var/loki/boltdb-shipper-cache
        cache_ttl: 24h
        shared_store: filesystem
      filesystem:
        directory: /var/loki/chunks
      hedging:
        at: 250ms
        max_per_second: 20
        up_to: 3
kind: ConfigMap

It looks like this is configured to delete logs after 2 days, (I initially had 3), but the usage on in my persistent volume keeps going up even after a week of running.

Is there something that I am missing in this configuration to get log retention working correctly?

I believe that I have found the issue. It has to do with marking chunks with a read-only file system in the image.

I ran:
kubectl logs loki-0 -n logging

It gave me a number of log message like:
level=warn ts=2023-01-05T20:58:48.828870651Z caller=marker.go:214 msg=“failed to process marks” path=/var/loki/boltdb-shipper-compactor/retention/markers/1672871457891585205 err=“open /tmp/marker-view-2316940776: read-only file system”
level=warn ts=2023-01-05T20:58:48.828882132Z caller=marker.go:214 msg=“failed to process marks” path=/var/loki/boltdb-shipper-compactor/retention/markers/1672872057887901626 err=“open /tmp/marker-view-1660263759: read-only file system”

To fix this problem I added some extra volume configuration to the singleBinary section of my values.yaml file for Helm deployment:

singleBinary:
  # -- Number of replicas for the single binary
  replicas: 1
  # -- Resource requests and limits for the single binary
  resources: {}
  # -- Node selector for single binary pods
  nodeSelector: {}
  persistence:
    # -- Size of persistent disk
    size: 500Gi
    # -- Storage class to be used.
    # If defined, storageClassName: <storageClass>.
    # If set to "-", storageClassName: "", which disables dynamic provisioning.
    # If empty or set to null, no storageClassName spec is
    # set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
    storageClass: "fame-storage-vsan-policy"

  # -- Volume mounts to add to the single binary pods
  extraVolumeMounts:
  - name: temporary
    mountPath: /tmp 
  # -- Volumes to add to the single binary pods
  extraVolumes: 
  - name: temporary 
    emptyDir: {}

The growth in my persistent volume has stopped, so it looks like this is working.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.