Queries Process to many Bytes

m3r1 · July 18, 2021, 5:03pm

I used Logcli to run a 1 hour metric query, Logcli --stats output stats the query proccessed 5.0GB of uncompressed bytes.

Store.DecompressedBytes: 5.0GB

This doesn’t make sense, my cluster ingests as a whole around 300KB per second, so even if the query pulls every single chunk ingested in the 1 hour time range i’m querying the query would only need to process 1.08GB not to mention i’m filtering the logs by the filename label.

Creating Grafana panels of Loki metric queries is unusable without caching, each query takes forever, I think a good place to start tuning this is understanding why each query processes so much data.

I’m using boltdb-shipper as the index store and s3 as the chunk store. I thought that there is some problem with how index files are saved, that they might be pointing to larger than desired sets of data so I added a compactor instance in an attempt to prevent duplicate chunks from being loaded by queries. This indeed helped and the same query reported to process 3.6GB of uncompressed bytes.

Store.DecompressedBytes: 3.6GB

Any idea on how exactly chunks are loaded by queriers? Or general thoughts on how to decrease queries time.

We are running Loki on vm’s, 2 query frontends, 3 queriers (with 12 cpu’s each), 4 ingesters, 2 distributors, 1 compactor.

auth_enabled: false

server:
  graceful_shutdown_timeout: 5s
  grpc_server_max_concurrent_streams: 1000
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  http_listen_port: 3100,
  http_server_idle_timeout: 120s,
  http_server_write_timeout: 1m 

distributor:
  ring:
    kvstore:
      store: memberlist

ingester:
  chunk_encoding: snappy
  chunk_block_size: 262144
  chunk_target_size: 4000000
  chunk_idle_period: 15m
  lifecycler:
    heartbeat_period: 5s
    join_after: 30s
    num_tokens: 512
    ring:
      heartbeat_timeout: 1m
      kvstore:
        store: memberlist
      replication_factor: 3
    final_sleep: 0s
  max_transfer_retries: 60

ingester_client:
  grpc_client_config:
    max_recv_msg_size: 67108864
  remote_timeout: 1s

frontend:
  compress_responses: true
  log_queries_longer_than: 5s
  max_outstanding_per_tenant: 1024

frontend_worker:
  frontend_address: frontend:9096
  grpc_client_config:
    max_send_msg_size: 104857600
  parallelism: 12

limits_config:
  enforce_metric_name: false
  ingestion_burst_size_mb: 10
  ingestion_rate_mb: 5
  ingestion_rate_strategy: local
  max_cache_freshness_per_query: 10m
  max_global_streams_per_user: 10000
  max_query_length: 12000h
  max_query_parallelism: 256
  max_streams_per_user: 0
  reject_old_samples: true
  reject_old_samples_max_age: 168h

querier:
  query_ingesters_within: 2h

query_range:
  align_queries_with_step: true
  max_retries: 5 
  parallelise_shardable_queries: true
  split_queries_by_interval: 30m

schema_config:
  configs:
    - from: 2020-06-10
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: test_
        period: 24h

compactor:
  working_directory: /opt/loki/compactor
  shared_store: s3

storage_config:
  aws:
    bucketnames: bucket_names
    endpoint: endpoint
    region: region
    access_key_id: mysecret_key_id
    secret_access_key: mysecret_access_key
    http_config:
      idle_conn_timeout: 90s
      response_header_timeout: 0s
      insecure_skip_verify: true
    s3forcepathstyle: true
  boltdb-shipper:
    active_index_directory /opt/loki/boltdb-shipper-active
    cache_location /opt/loki/boltdb-shipper-cache
    shared_store: s3
 
memberlist:
   abort_if_cluster_join_fails: false
   bind_addr:
     - the_bind_ip_address
   bind_port: 7946
   join_members:
     - ip_address:7946
     - off_all_the_loki_components:7946

dujas · August 22, 2021, 4:12pm

Hi There,

I think I am on the same boat for now. As you mentioned, you are running the loki cluster in vm, is that in local build or on any K8S/dockerized env?

m3r1 · August 22, 2021, 4:42pm

We are running distributed Loki directly on the vm (exceuting a binary).

Solving the slow query times was done in a few steps.

We were saving logs in an on premise object storage that uses S3. However we were getting around 30mb/s of download speed (this is just the speed our hardware can give) this speed isn’t good enough to run metric queries fast so we switched to saving data in a Cassandra cluster we deployed. Cassandra saves its data on disk which gave us much better performance.
This step correlates directly with the question I posted. Loki build up log data in memory inside chunks, each chunk represents a log stream (a specific combination of lables). At query time Loki pulls the entire log stream leading to it processing more data than relevant to the query. The solution was to use regex in Promtail to extract fields at send time, leading to a more specific log stream (less data processed) and a decrease in processing work that needs to be done at query time. The reason we didn’t do this at the start is that Loki recomends to yse static labels as much as possible since extracting a field like an ip address can build up a lot of log streams in memory, leading to a big index and cluster wide long query times. However in our use case, extracting very specific fields on very specific targets didn’t have a negative affect.

If you are a heavy dashboard user and is looking to use Loki mostly for this feature I would recommend using the ELK stack instead. Loki can’t compete, at least from my experience, with ELK’s speed when it comes to metric queries.
The reason we went with Loki is that we are already heavily invested in prometheus and as a result most of our observabillity dashboards use metrics from prometheus, while Loki will mostly be used in the explore tab in grafana.

Hope This helps.

dujas · August 23, 2021, 6:31am

Thank you so much for the details.

I have some confusions if you do not mind checking them out:

Per the 2nd point in your answer, does that mean the more labels created in the config file the more chunks will be created in the stores since the combinations are getting bigger?
I am running the local build as well in the VM by following https://github.com/grafana/loki/tree/main/production/docker, the querier, ingester and distributor are running in the same instance of loki on 1 vm, but how do you configure the number of each of them like 3 queriers, 4 ingesters (replica factors?) and 2 distributors?
Well, we are still in the research stage, ELK will be the alternative if Loki is not that as expected, thanks for this suggestions.

m3r1 · August 23, 2021, 7:35am

Yes, exactly! Logs from a single host forwarding 2 files to Loki under 1 job name will be sent to 2 chunks in the ingesters {job=“job_name”, instance=“instance_name”, filename=“filename1”} and {job=“job_name”, instance=“instance_name”, filename=“filename2”}.
We were collecting logs from an http load-balancer, the relevant vms sending logs only had the following labels: job, host and filename (static labels). Due to poor performance we extracted using a regex in promtail the http_method field, this had 2 affects. We were able to query logs with a specific http_method which reduced the amount of logs processed by a query (we queried only get requests or only post requests). However the more obvious benefit to this extraction is the fact that we didn’t have to extract the field at query time which decreased query execution time.
It is important to keep in mind that more chunks equls to a larger index, which leads to slower cluster wide query times so you have to be careful with what fields you are extracting and how often you do so. Please refer to this article which explains this better than myslef: Best practices | Grafana Labs
I think i misunderstood what you meant when you said local build. We have a single vm for every Loki component: 3 ingester vms, 3 querier vms, 2 query-frontend vms, 2 distributors vms and 1 table-manager vm. We don’t have access to a k8s cluster to deploy loki. On every vm we run a linux service which executes the Loki binary with the proper target flag.
I wouldn’t recommend using Loki without Prometheus I think the main benefit of Loki for us was the seamless integration with the Prometheus eco-system. Prometheus is an amazing piece of software and I would definitely recommend checking it out if you haven’t already.

dujas · August 24, 2021, 3:11am

Thanks.

Regarding 2nd point, i deployed a new env with 2 query-frontend vms as well. But i got one question, which is the 2 frontends are running on differnet vms and how to set the frontend_address in the config file?

m3r1 · August 24, 2021, 4:26am

This was actually a choke point for us.

Since we aren’t using k8s we had to deploy another vm that acts as a load balancer (and a second one for failover), we use haproxy load balancer.

We use 1 load balancer to distribute traffic from promtail to the distributors and to the query frontends. You can configure 1 haproxy to distribute traffic to 2 backend groups according to the url that was used.

Whild we use round robin as the load balancing method to our distributors, we distribute traffic to our frontends in an active passive method (only 1 is active). While no part of the documentation I could find supports this so we might be wrong, we found out that a single query uses only halth of the queriers available cores (round robin between 2 frontends) also the connected workers metric to each frontend showed halth of the cluster querier cores are connected to each frontend. Since we have a fairly small deployment we decided we want to use every core for every query and we went the active passive way. I guess the case is different on larger deployments.

dujas · August 24, 2021, 6:38am

Thanks for the reply.

I tried the loadbalancing with nginx for the 2 frontend running on 2 VMs. I did not use downstream_url but the frontend_worker since there is a discussion in another topic that this way could bring some more parallesium.
However, the configuration is not that simple as expected. As you know that the fronend_worker shoud point to the grpc port of the frontend, then how about the port 3100 for http queries?

I posted this out in another topic https://localhost:3000/t/how-to-configure-nginx-for-frontend-worker/52306, your suggestion is highly welcomed.

Btw, I set a memberlist for ingesters and distributors and configure the nginx for them as well, that looks fine for now.

system · August 24, 2022, 6:39am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Query is extremely slow Grafana Loki	1	606	January 26, 2023
Is it possible to increase loki long queries speed? Grafana Loki loki	2	854	January 26, 2023
How to properly scale Loki queries with lots of data Grafana Loki loki	2	1584	November 24, 2022
Grafana Histogram with Loki datasource Grafana Loki loki	2	373	November 10, 2023
Loki querier not scaling Grafana Loki	6	176	May 18, 2024

Queries Process to many Bytes

Related topics