I’m planning a potential architecture for our monitoring & logs stack. Grafana, Prometheus & AlertManager are a given, pondering Thanos & Loki atm.
About Loki I’m wondering what’s the recommended architecture for multiple heterogenous environments. We’ve some metal at Hetzner, EKS & EC2 & serveless @ AWS, … I found nothing in the doc, bits of discussion here from 2020 hinting at ‘one central loki, all agents shipping there’. I was also considering 'one grafana, multiple loki data sources through VPNs / VPC peering '.
Only dependency is Object Store which for me is S3. I plan on having one central Loki per “environment” which for me is one for staging and one for production, just so I can test upgrades etc. in the staging cluster without breaking production.
In production we will probably log 10k-30k logs per second once we are up and running which does not seem to be any issue for even my proof of concept Loki setup with pretty low resources. This is for ingestion. Querying that data has been more challenging but slowly finding ways of optimizing that as well
Testing the upgrade is indeed a topic. Here’s my plan about those, tell me if I missed something: since Loki upgrades are unlikely to affect the data (the logs themselves), I can duplicate Loki using Velero (our k8s resources backup & duplication tool) on a lab cluster, run the upgrade there, see if it seems ok, then apply on production.
Noted for ingestion. About querying, what were your challenges, and your solutions if you can explain those quickly ? I intend to use the distributed Helm chart, which should allow us to tune the resources & scale out the bottlenecks.
My load testing created very few streams so querying that test data was almost impossible because the responses were massive.
From what I have learned so far. Try to find a good balance of streams for segmenting your data (not too many but also definitely not too few). We also run Loki in micro services mode. Querying speeds up by adding querier nodes (not surprising). There are plenty of posts about improving query performance. Still figuring this out…