Hello Loki Team,
I am currently running a Loki setup and I am seeking guidance on how to improve its performance. Here are the details of my system configuration, setup, and the issue I am experiencing:
System Configuration:
- Server: 16 GB RAM, 16 core CPU
- S3 ECS Dell for object store: 1 TB bucket
- Loki Version: 2.8
Service: I am running Loki, Promtail, and Grafana in a Docker-compose setup.
Settings: Promtail is configured to transfer logs from Kafka to Loki, with a log rate of approximately 850,000 logs per minute.
Issue: When querying the logs via Grafana, I am experiencing performance issues. A query for the last 1 hour takes a significant amount of time, and a query for the last 30 days takes around 5 minutes to view just 1000 logs, often resulting in a timeout error. During the query, CPU usage spikes to 99% across all 16 cores and then drops back down once the query finishes or fails.
The size of the logs is about 500 MB when compressed and 2.7 GB when uncompressed. I have come across the term TSDB (Time Series Database) but am uncertain how to incorporate it into my setup, or if it would require an additional server.
Objective: I am seeking advice on optimizing my Loki setup to handle heavy querying more efficiently and quickly.
Loki Config:
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /etc/loki/
replication_factor: 1
ring:
kvstore:
store: inmemory
storage:
s3:
endpoint: https://ecs.server.net
insecure: false
bucketnames: clickhouse_test_bucket
access_key_id: clickhouse_test_user
secret_access_key: secret-key
schema_config:
configs:
- from: 2023-01-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /etc/loki/index
cache_location: /etc/loki/index_cache
shared_store: s3
aws:
endpoint: https://ecs.server.net
insecure: false
bucketnames: clickhouse_test_bucket
access_key_id: clickhouse_test_user
secret_access_key: secret-key
s3forcepathstyle: true
compactor:
working_directory: /etc/loki/
shared_store: s3
compaction_interval: 5m
ruler:
storage:
s3:
bucketnames: clickhouse_test_bucket
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 48h
max_global_streams_per_user: 10000
max_entries_limit_per_query: 50000
ingestion_rate_mb: 4190
ingestion_rate_strategy: global
max_line_size: 100000
query_timeout: 5m
frontend:
log_queries_longer_than: 5m
max_body_size: 1048576
query_stats_enabled: false
max_outstanding_per_tenant: 100
querier_forget_delay: 0s
scheduler_address: ""
scheduler_dns_lookup_period: 10s
scheduler_worker_concurrency: 5
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 3m
chunk_retain_period: 30s
max_transfer_retries: 0
chunk_store_config:
max_look_back_period: 0s
querier:
engine:
timeout: 5m
I have been working with this service for three months, so I am relatively new to this and would greatly appreciate any suggestions on how to tweak the service to handle heavy queries and improve speed. Thank you in advance for your help!