Hi,
From the access logs received from loadbalancers, I would like to aggregate them per IP and see for example the IP making the most traffic.
I can’t make it work currently due to the high number of lines stored, it is working on a stage cluster but stage traffic … so nearly nothing.
As an example, an instant query by ruler can take up to 30s: recording rule to create a metric on status_code or other non dynamic labels.
I’m not sure that it can possible or maybe with a lot more replicas and splitting the query by a small period …
For the moment we are not using loki
for all aggregations request, but would love to be able to do it.
There is maybe something missing on the configuration or some improvements that can be made.
Using the community helm charts.
Config part:
server:
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 104857600
grpc_server_max_concurrent_streams: 2000
http_server_read_timeout: 5m
http_server_write_timeout: 5m
query_range:
align_queries_with_step: true
max_retries: 5
split_queries_by_interval: 15m
cache_results: true
results_cache:
cache:
memcached_client:
consistent_hash: true
host: {{ include "loki.memcachedFrontendFullname" . }}
max_idle_conns: 16
service: http
timeout: 1s
update_interval: 1m
frontend:
log_queries_longer_than: 5s
compress_responses: true
tail_proxy_url: http://{{ include "loki.querierFullname" . }}:3100
querier:
query_timeout: 4m
max_concurrent: 6
engine:
timeout: 4m
frontend_worker:
frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095
grpc_client_config:
max_send_msg_size: 1.048576e+08
parallelism: 6
querier replicas + memcached:
querier:
replicas: 6
resources:
limits:
memory: 20Gi
requests:
cpu: 2
memory: 20Gi
queryFrontend:
replicas: 3
resources:
limits:
memory: 5Gi
requests:
cpu: 1
memory: 5Gi
memcachedChunks:
replicas: 9
extraArgs:
- -m 29000
- -I 2m
- -v
resources:
requests:
memory: 30Gi
limits:
memory: 30Gi
The current result is that querier replicas got OOM and restart after some time (and no results on grafana) for a query with a timerange of 5m.
Not sure if reducing split_queries_by_interval
could help or add more querier replicas …