Ruler deployment crash looping

rajatvig · December 10, 2020, 4:38pm

Hi,

After enabling the Ruler deployment in Loki using Tanka/Jsonnet, the ruler pods are failing like

mkdir : no such file or directory\nerror creating index client\ngithub.com/cortexproject/cortex/pkg/chunk/storage.NewStore\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/storage/factory.go:176\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:287\ngithub.com/cortexproject/cortex/pkg/util/mod │
│ ruler-84d9c69b5-6nb5l

Using boltdb shipper and GCS store for Ruler.

Relevant ruler config

 ruler:
        alertmanager_url: http://alertmanager.monitoring.svc:9093
        ring:
            kvstore:
                consul:
                    host: consul-server.consul.svc:8500
                prefix: loki/rulers/
        storage:
            gcs:
                bucket_name: ...

Thanks
Rajat Vig

ewelch · December 10, 2020, 4:48pm

Can you paste your entire config file (masking any sensitive data), that error message implies a problem with a different section of the config I believe.

rajatvig · December 10, 2020, 4:50pm

apiVersion: v1
data:
  config.yaml: |
    chunk_store_config:
        chunk_cache_config:
            memcached:
                batch_size: 1024
                parallelism: 100
            memcached_client:
                consistent_hash: true
                host: memcached.loki.svc.cluster.local
                service: memcached-client
        max_look_back_period: 0
    distributor:
        ring:
            kvstore:
                consul:
                    consistent_reads: false
                    host: consul-server.consul.svc:8500
                    http_client_timeout: 20s
                    watch_burst_size: 1
                    watch_rate_limit: 1
                prefix: loki/collectors/
                store: consul
    frontend:
        compress_responses: true
        log_queries_longer_than: 10s
        max_outstanding_per_tenant: 4800
    frontend_worker:
        frontend_address: query-frontend.loki.svc.cluster.local:9095
        grpc_client_config:
            max_send_msg_size: 1.048576e+08
        parallelism: 2
    ingester:
        chunk_block_size: 262144
        chunk_idle_period: 15m
        lifecycler:
            heartbeat_period: 5s
            interface_names:
              - eth0
            join_after: 30s
            num_tokens: 512
            ring:
                heartbeat_timeout: 1m
                kvstore:
                    consul:
                        consistent_reads: true
                        host: consul-server.consul.svc:8500
                        http_client_timeout: 20s
                    prefix: loki/collectors/
                    store: consul
                replication_factor: 3
        max_transfer_retries: 60
    ingester_client:
        grpc_client_config:
            max_recv_msg_size: 6.7108864e+07
        pool_config:
            health_check_ingesters: true
        remote_timeout: 1s
    limits_config:
        enforce_metric_name: false
        ingestion_burst_size_mb: 30
        ingestion_rate_mb: 25
        ingestion_rate_strategy: global
        max_cache_freshness_per_query: 10m
        max_global_streams_per_user: 20000
        max_query_length: 12000h
        max_query_parallelism: 16
        max_streams_per_user: 0
        reject_old_samples: true
        reject_old_samples_max_age: 168h
    querier:
        query_ingesters_within: 2h
    query_range:
        align_queries_with_step: true
        cache_results: true
        max_retries: 5
        parallelise_shardable_queries: true
        results_cache:
            cache:
                memcached_client:
                    consistent_hash: true
                    host: memcached-frontend.loki.svc.cluster.local
                    max_idle_conns: 64
                    service: memcached-client
                    timeout: 500ms
                    update_interval: 1m
        split_queries_by_interval: 30m
    ruler:
        alertmanager_url: http://alertmanager.monitoring.svc:9093
        ring:
            kvstore:
                consul:
                    host: consul-server.consul.svc:8500
                prefix: loki/rulers/
        storage:
            gcs:
                bucket_name: ...
    schema_config:
        configs:
          - from: "2020-10-24"
            index:
                period: 24h
                prefix: loki_index_
            object_store: gcs
            schema: v11
            store: boltdb-shipper
    server:
        graceful_shutdown_timeout: 5s
        grpc_server_max_concurrent_streams: 1000
        grpc_server_max_recv_msg_size: 1.048576e+08
        grpc_server_max_send_msg_size: 1.048576e+08
        http_listen_port: 3100
        http_server_idle_timeout: 120s
        http_server_write_timeout: 1m
    storage_config:
        boltdb_shipper:
            shared_store: gcs
        gcs:
            bucket_name: ...
        index_queries_cache_config:
            memcached:
                batch_size: 1024
                parallelism: 100
            memcached_client:
                consistent_hash: true
                host: ...
                service: memcached-client
    table_manager:
        creation_grace_period: 3h
        poll_interval: 10m
        retention_deletes_enabled: false
        retention_period: 0
kind: ConfigMap
metadata:
  name: loki
  namespace: loki

Everything else is working fine, only when ruler is enabled it crashes with the message I posted earlier.

ewelch · December 10, 2020, 5:02pm

This is good information to know, thank you, because the error you posted is not one I would expect to see for a misconfigured ruler config… but here we are

The only thing you are missing which would be worth trying is adding rule_path: to your ruler config:

      ruler:
        alertmanager_url: http://alertmanager.monitoring.svc:9093
        ring:
            kvstore:
                consul:
                    host: consul-server.consul.svc:8500
                prefix: loki/rulers/
        storage:
            gcs:
                bucket_name: ...
        rule_path: /tmp/loki/rules-temp
        enable_api: true

Loki needs a temporary directory for evaluating rules, it’s not required to be persisted.

enable_api is only necessary if you would like to interact with your rules via API, I added it here to note that it’s not enabled by default currently.

rajatvig · December 10, 2020, 11:07pm

My bad earlier, the paste was with ruler disabled so it did not have all the entries.

apiVersion: v1
data:
  config.yaml: |
    chunk_store_config:
        chunk_cache_config:
            memcached:
                batch_size: 1024
                parallelism: 100
            memcached_client:
                consistent_hash: true
                host: memcached.loki.svc.cluster.local
                service: memcached-client
        max_look_back_period: 0
    distributor:
        ring:
            kvstore:
                consul:
                    consistent_reads: false
                    host: consul-server.consul.svc:8500
                    http_client_timeout: 20s
                    watch_burst_size: 1
                    watch_rate_limit: 1
                prefix: loki/collectors/
                store: consul
    frontend:
        compress_responses: true
        log_queries_longer_than: 10s
        max_outstanding_per_tenant: 4800
    frontend_worker:
        frontend_address: query-frontend.loki.svc.cluster.local:9095
        grpc_client_config:
            max_send_msg_size: 1.048576e+08
        parallelism: 2
    ingester:
        chunk_block_size: 262144
        chunk_idle_period: 15m
        lifecycler:
            heartbeat_period: 5s
            interface_names:
              - eth0
            join_after: 30s
            num_tokens: 512
            ring:
                heartbeat_timeout: 1m
                kvstore:
                    consul:
                        consistent_reads: true
                        host: consul-server.consul.svc:8500
                        http_client_timeout: 20s
                    prefix: loki/collectors/
                    store: consul
                replication_factor: 3
        max_transfer_retries: 60
    ingester_client:
        grpc_client_config:
            max_recv_msg_size: 6.7108864e+07
        pool_config:
            health_check_ingesters: true
        remote_timeout: 1s
    limits_config:
        enforce_metric_name: false
        ingestion_burst_size_mb: 30
        ingestion_rate_mb: 25
        ingestion_rate_strategy: global
        max_cache_freshness_per_query: 10m
        max_global_streams_per_user: 20000
        max_query_length: 12000h
        max_query_parallelism: 16
        max_streams_per_user: 0
        reject_old_samples: true
        reject_old_samples_max_age: 168h
    querier:
        query_ingesters_within: 2h
    query_range:
        align_queries_with_step: true
        cache_results: true
        max_retries: 5
        parallelise_shardable_queries: true
        results_cache:
            cache:
                memcached_client:
                    consistent_hash: true
                    host: memcached-frontend.loki.svc.cluster.local
                    max_idle_conns: 64
                    service: memcached-client
                    timeout: 500ms
                    update_interval: 1m
        split_queries_by_interval: 30m
    ruler:
        alertmanager_url: http://alertmanager.monitoring.svc:9093
        enable_alertmanager_v2: true
        enable_api: true
        enable_sharding: true
        ring:
            kvstore:
                consul:
                    host: consul-server.consul.svc:8500
                prefix: loki/rulers/
                store: consul
        rule_path: /tmp/rules
        storage:
            gcs:
                bucket_name: <ruler bucket>
            type: gcs
    schema_config:
        configs:
          - from: "2020-10-24"
            index:
                period: 24h
                prefix: loki_index_
            object_store: gcs
            schema: v11
            store: boltdb-shipper
    server:
        graceful_shutdown_timeout: 5s
        grpc_server_max_concurrent_streams: 1000
        grpc_server_max_recv_msg_size: 1.048576e+08
        grpc_server_max_send_msg_size: 1.048576e+08
        http_listen_port: 3100
        http_server_idle_timeout: 120s
        http_server_write_timeout: 1m
    storage_config:
        boltdb_shipper:
            shared_store: gcs
        gcs:
            bucket_name: <storage bucket>
        index_queries_cache_config:
            memcached:
                batch_size: 1024
                parallelism: 100
            memcached_client:
                consistent_hash: true
                host: memcached-index-queries.loki.svc.cluster.local
                service: memcached-client
    table_manager:
        creation_grace_period: 3h
        poll_interval: 10m
        retention_deletes_enabled: false
        retention_period: 0
kind: ConfigMap
metadata:
  name: loki
  namespace: loki

The ruler still crashes with the same error as before. From a rough reading of the code, I suspect it is loading the chunk config store.

level=error ts=2020-12-10T23:15:12.966824939Z caller=log.go:149 msg="error running loki" err="mkdir : no such file or directory\nerror creating index client\ngithub.com/cortexproject/cortex/pkg/chunk/storage.NewStore\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/storage/factory.go:176\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:287\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:103\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:204\nmain.main\n\t/src/loki/cmd/loki/main.go:130\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nerror initialising module: store\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:105\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:204\nmain.main\n\t/src/loki/cmd/loki/main.go:130\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"                                                                                                                                                                                                                                                  ```

rajatvig · December 10, 2020, 11:09pm

I am also assuming that the chunk store bucket and rules buckets are different. Also, as we are using workload identity, the ruler does not need any permissions over the chunks bucket. Let me try tweaking the permissions a bit.

rajatvig · December 11, 2020, 12:37am

Think I figured it out.

It isn’t the permissions but when using the boltdb-shipper, the ruler is not setting the arguments for boltdb.shipper.active-index-directory or boltdb.shipper.cache-location which the querier and ingester setup and mount to their PVC.

The question now I do have is, should I configure the ruler with a PVC and set a cache location like the querier?

I am assuming https://github.com/grafana/loki/commit/dcbfecf9e549f264e5c16b1eefbe1b4071e508c1 might also be required.

–
Rajat

rajatvig · December 11, 2020, 1:09am

Have it running after

patching the ruler args to use boltdb.shipper.cache-location set to /data/boltdb-cache
using the latest build image grafana/loki:master-3f99a07
mounting an emptydir mount to the container for /data

Though I am a little uncertain about stability and if the emptydir mount is valid to use.

If you want, I can create an issue in GitHub to help track it

Thanks
Rajat

ewelch · December 13, 2020, 1:54pm

Thanks so much for all the follow up @rajatvig, extremely helpful.

If you don’t mind creating an issue here would be very helpful, there is work we need to do here to improve this.

rajatvig · December 14, 2020, 2:46pm

Created https://github.com/grafana/loki/issues/3076

techfuntech · June 30, 2021, 12:53am

Can someone guide me, if we are using a default rule_path ([rule_path: <filename> | default = "/rules"]) does Loki still look for a tenant id? I am using helm chart and this is my target Revision 2.4.1
More details in this gitlab issue comment.

github.com/grafana/loki

Ruler not working with BoltDB Shipper

opened 02:45PM - 14 Dec 20 UTC

closed 01:30PM - 27 May 21 UTC

rajatvig

**Describe the bug** As detailed in the [forum issue](https://community.grafana….com/t/ruler-deployment-crash-looping/40419), Ruler cannot be v2.0.0 cannot be deployed out of the box when using boltdb-shipper backed by GCS storage. The arguments for cache or index directory needs to be passed on and it requires [this commit](https://github.com/grafana/loki/commit/dcbfecf9e549f264e5c16b1eefbe1b4071e508c1) be made available so that the Ruler deployment does not attempt to write files. **To Reproduce** Steps to reproduce the behavior: 1. Deploy Loki using Tanka for v2.0.0 with ruler and boltdb-shipper enabled 2. Ruler pods go into a crash loop **Expected behavior** Ruler to start correctly **Environment:** - Infrastructure: Kubernetes - Deployment tool: jsonnet **Screenshots, Promtail config, or terminal output** ``` level=error ts=2020-12-10T23:15:12.966824939Z caller=log.go:149 msg="error running loki" err="mkdir : no such file or directory\nerror creating index client\ngithub.com/cortexproject/cortex/pkg/chunk/storage.NewStore\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/storage/factory.go:176\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:287\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:103\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:204\nmain.main\n\t/src/loki/cmd/loki/main.go:130\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nerror initialising module: store\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:105\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:204\nmain.main\n\t/src/loki/cmd/loki/main.go:130\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373" ````

Topic		Replies	Views
Unable to start ruler component Grafana Loki loki	1	981	July 23, 2022
[Ruler] Unable to fetch metrics that exist Grafana Loki	2	359	August 4, 2023
Failed to load rules from Loki: 404 from rule state endpoint Grafana Loki	3	1792	February 3, 2024
Loki alerting via ruler doesn't work Grafana Loki loki , configuration	3	807	November 29, 2023
Deleting a corrupted directory for boltdb (file too small) Grafana Loki loki	2	1020	July 18, 2023

Ruler deployment crash looping

Related topics