We have ingested our nginx logs to loki so we can analyze the request time. When we have small volumes everything works as expected, but with larger volumes things times out. Initially we got some issue with “too large series” or something, but after we bumped that we know get timeouts instead.
What we think is strange is that even if we change the time range to something really small we still have some problems. The range we’ve tested only yields around 500 rows, but the following query still fails:
It feels like we are doing something wrong to get the issue. I would expect it to be possible to get a result for 24 hours, which would equal to around 1.5-2M log entries… but maybe my expectation are to high?
There are some other posts on this forum regarding optimization, might be worth a read. Also some information on what your existing setup looks like would help.
The setup is loki running in a kubernetes cluster, don’t have the config available but can get it if needed. We are ingesting the http access log from an nginx app using promtail, and that works fine. Querying the log also works fine, even over the full set of 24 hours. The problem is when I try to add rate and count.
I really don’t understand why, since even when we limit the time range so that it includes 500 entries or so without count, it will still fail when applying count on this 500 rows. It feels that it is something we’ve missed conceptually of how loki works when querying the data.
First I’d try to narrow down why it fails to even return a small set of data. Logs from querier should hopefully provide some insight. Since querying logs directly works fine, I’d guess it’s likely not related to resources.
Also, your query also looks incorrect. You’d either do count or rate, not both.
I think you can do count and rate at the same time? Doesn’t rate split the logs into “batches”, then I do count within those batches. At least that is the idea. The rate option was also something I got from the query builder when I selected count.
I’ve tried to narrow it down, but fail in finding the real cause… that’s why I’m posting here , since I’m not sure how to go about debugging this behavior.
While I don’t think that’s the reason it’s failing, I also don’t think that’s what you are looking for.
From doc:
rate(unwrapped-range): calculates per second rate of the sum of all values in the specified interval.
This will return a number of series, depending on how many streams your selector result to. If you add count() on top, you are really then counting number of “series” returned by rate, not number of logs.
If you are looking to count number of log, I’d recommend checking out count_over_time function. Also, try run your query one level at a time, and check querier / query frontend (if you have one) and see what errors you get.
The way you describe count vs count_over_time makes sense. However, it doesn’t completely make sense with what I see in the result. If count with rate is counting the number of series generated by rate, shouldn’t that yield in a single number of the whole selected range? We actually get results that we can plot in the graph.
If you do count on top of rate you should get a single number on the range, unless the number of series changes during the range (it can happen if your label selector isn’t exhaustive).