I’m deploying Loki (distributed mode) along with Tempo, Grafana, Prometheus, Promtail, etc. using Helm.
I think I picked the right options for deploying the loki-distributed chart but something is obviously not working properly… I can see logs in Grafana but after about an hour, they disappear. Up until this morning, Grafana would just show no results. However, now I have been seeing this error in the Grafana UI and also in my querier pod:
level=error ts=2022-01-31T21:37:05.641170853Z caller=batch.go:699 msg=“error fetching chunks” err=“open /var/loki/chunks/ZmFrZS9lYmIzMWQ1NzU5ZmNhNGYzOjE3ZWIxZWUxODEzOjE3ZWIxZWUxODE0OjkyZDhlMTNm: no such file or directory”
I exec’d into the querier pod and checked /var/loki/chunks and it is empty in all 3 of my pods (I’m running a 6 node cluster).
The documentation for this chart leaves a lot to be desired so I was guessing at a lot of these values – I just enabled persistence wherever it was an option. I don’t know if that made sense because the default was strangely false
everywhere.
What I am attempting to do is keep logs for 14 days – hence my limits_config
and compactor
I have a chart of charts for this and my Loki settings in my values.yaml:
loki:
loki:
config: |
auth_enabled: false
server:
http_listen_port: 3100
distributor:
ring:
kvstore:
store: memberlist
memberlist:
join_members:
- {{ include "loki.fullname" . }}-memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 1
chunk_idle_period: 30m
chunk_block_size: 262144
chunk_encoding: snappy
chunk_retain_period: 1m
max_transfer_retries: 0
wal:
dir: /var/loki/wal
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
max_cache_freshness_per_query: 10m
retention_period: 336h
{{- if .Values.loki.schemaConfig}}
schema_config:
{{- toYaml .Values.loki.schemaConfig | nindent 2}}
{{- end}}
storage_config:
boltdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/cache
cache_ttl: 168h
shared_store: filesystem
index_gateway_client:
server_address: dns:///obs-loki-index-gateway:9095
filesystem:
directory: /var/loki/chunks
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
query_range:
align_queries_with_step: true
max_retries: 5
split_queries_by_interval: 15m
cache_results: true
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_items: 1024
validity: 24h
frontend_worker:
frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095
frontend:
log_queries_longer_than: 5s
compress_responses: true
tail_proxy_url: http://{{ include "loki.querierFullname" . }}:3100
compactor:
working_directory: /data/retention
shared_store: filesystem
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
ruler:
storage:
type: local
local:
directory: /etc/loki/rules
ring:
kvstore:
store: memberlist
rule_path: /tmp/loki/scratch
alertmanager_url: https://alertmanager.xx
external_url: https://alertmanager.xx
ingester:
replicas: 3
persistence:
# -- Enable creating PVCs which is required when using boltdb-shipper
enabled: true
# -- Size of persistent disk
size: 10Gi
# -- Storage class to be used.
# If defined, storageClassName: <storageClass>.
# If set to "-", storageClassName: "", which disables dynamic provisioning.
# If empty or set to null, no storageClassName spec is
# set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
storageClass: null
distributor:
replicas: 3
querier:
replicas: 3
persistence:
# -- Enable creating PVCs for the querier cache
enabled: true
# -- Size of persistent disk
size: 10Gi
# -- Storage class to be used.
# If defined, storageClassName: <storageClass>.
# If set to "-", storageClassName: "", which disables dynamic provisioning.
# If empty or set to null, no storageClassName spec is
# set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
storageClass: null
extraVolumes:
- name: bolt-db
emptyDir: {}
extraVolumeMounts:
- name: bolt-db
mountPath: /var/loki
ruler:
enabled: false
replicas: 1
indexGateway:
enabled: true
replicas: 3
persistence:
# -- Enable creating PVCs which is required when using boltdb-shipper
enabled: true
# -- Size of persistent disk
size: 10Gi
# -- Storage class to be used.
# If defined, storageClassName: <storageClass>.
# If set to "-", storageClassName: "", which disables dynamic provisioning.
# If empty or set to null, no storageClassName spec is
# set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
storageClass: null
queryFrontend:
replicas: 3
gateway:
replicas: 3
compactor:
enabled: true
persistence:
# -- Enable creating PVCs for the compactor
enabled: true
# -- Size of persistent disk
size: 10Gi
# -- Storage class to be used.
# If defined, storageClassName: <storageClass>.
# If set to "-", storageClassName: "", which disables dynamic provisioning.
# If empty or set to null, no storageClassName spec is
# set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
storageClass: null
serviceAccount:
create: true
This is my loki config map as deployed on the cluster:
apiVersion: v1
data:
config.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
distributor:
ring:
kvstore:
store: memberlist
memberlist:
join_members:
- obs-loki-memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 1
chunk_idle_period: 30m
chunk_block_size: 262144
chunk_encoding: snappy
chunk_retain_period: 1m
max_transfer_retries: 0
wal:
dir: /var/loki/wal
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
max_cache_freshness_per_query: 10m
retention_period: 336h
schema_config:
configs:
- from: "2020-09-07"
index:
period: 24h
prefix: loki_index_
object_store: filesystem
schema: v11
store: boltdb-shipper
storage_config:
boltdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/cache
cache_ttl: 168h
shared_store: filesystem
index_gateway_client:
server_address: dns:///obs-loki-index-gateway:9095
filesystem:
directory: /var/loki/chunks
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
query_range:
align_queries_with_step: true
max_retries: 5
split_queries_by_interval: 15m
cache_results: true
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_items: 1024
validity: 24h
frontend_worker:
frontend_address: obs-loki-query-frontend:9095
frontend:
log_queries_longer_than: 5s
compress_responses: true
tail_proxy_url: http://obs-loki-querier:3100
compactor:
working_directory: /data/retention
shared_store: filesystem
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
ruler:
storage:
type: local
local:
directory: /etc/loki/rules
ring:
kvstore:
store: memberlist
rule_path: /tmp/loki/scratch
alertmanager_url: https://alertmanager.xx
external_url: https://alertmanager.xx
This doesn’t quite seem right but I am not sure what the correct value is… However, the compactor period is 14 days so I don’t think the compactor is to blame here for these missing logs.
compactor:
working_directory: /data/retention
Why are my logs disappearing? Do I need to run the ruler? My traces are also disappearing in Tempo but I don’t know if that’s related…
Update 1:
I tried changing my querier to not use persistent volumes which I think just means it keeps an in memory cache instead of an on-disk cache. That doesn’t seem to help so far.
Update 2:
I just tested a log search for the last 12 hours for a particular namespace: {namespace="obs"}
which should include the logs from the various Loki components. I get an error in Grafana that says:
Query error
open /var/loki/chunks/ZmFrZS9hMWVjYThlZDA0OGZkM2NjOjE3ZWIzMGE0ODVmOjE3ZWIzNzgyNTYyOjg2MDI4Nzgw: no such file or directory
So I check my ingester pods and I have 3 running:
obs-loki-ingester-0
obs-loki-ingester-1
obs-loki-ingester-2
I then ran kubectl exec -it obs-loki-ingester-1 -n obs -- ls -la /var/loki/chunks/ZmFrZS9hMWVjYThlZDA0OGZkM2NjOjE3ZWIzMGE0ODVmOjE3ZWIzNzgyNTYyOjg2MDI4Nzgw
for each pod and I see obs-loki-ingester-1
does indeed have this file. So something is screwed up somewhere causing Loki to not be aware of this…
I have Grafana pointed to my querier service http://obs-loki-querier.obs:3100
as a data source. Is this not the correct spot to point at for my Grafana data source? Should my Grafana data source be pointing at obs-loki-querier-frontend.obs
(the querier-frontend service)?
Update 3:
I tried pointing my Grafana data source to obs-loki-querier-frontend.obs
(the querier-frontend service) and then the “Test” button under Data Sources fails… So that doesn’t seem to be the fix.