Recommended architecture for multiple environments

rrrrrrr · February 14, 2022, 2:01pm

Hi there

I’m planning a potential architecture for our monitoring & logs stack. Grafana, Prometheus & AlertManager are a given, pondering Thanos & Loki atm.

About Loki I’m wondering what’s the recommended architecture for multiple heterogenous environments. We’ve some metal at Hetzner, EKS & EC2 & serveless @ AWS, … I found nothing in the doc, bits of discussion here from 2020 hinting at ‘one central loki, all agents shipping there’. I was also considering 'one grafana, multiple loki data sources through VPNs / VPC peering '.

What’s the wiser solution for 2022 ?

b0b · February 14, 2022, 3:53pm

Hi,

any idea of your log volumes?

I am testing this at the moment Examples | Grafana Labs

Only dependency is Object Store which for me is S3. I plan on having one central Loki per “environment” which for me is one for staging and one for production, just so I can test upgrades etc. in the staging cluster without breaking production.

In production we will probably log 10k-30k logs per second once we are up and running which does not seem to be any issue for even my proof of concept Loki setup with pretty low resources. This is for ingestion. Querying that data has been more challenging but slowly finding ways of optimizing that as well

rrrrrrr · February 15, 2022, 8:30am

Thanks for the feedback.

Testing the upgrade is indeed a topic. Here’s my plan about those, tell me if I missed something: since Loki upgrades are unlikely to affect the data (the logs themselves), I can duplicate Loki using Velero (our k8s resources backup & duplication tool) on a lab cluster, run the upgrade there, see if it seems ok, then apply on production.

Noted for ingestion. About querying, what were your challenges, and your solutions if you can explain those quickly ? I intend to use the distributed Helm chart, which should allow us to tune the resources & scale out the bottlenecks.

b0b · February 15, 2022, 8:53am

My load testing created very few streams so querying that test data was almost impossible because the responses were massive.

From what I have learned so far. Try to find a good balance of streams for segmenting your data (not too many but also definitely not too few). We also run Loki in micro services mode. Querying speeds up by adding querier nodes (not surprising). There are plenty of posts about improving query performance. Still figuring this out…

system · February 15, 2023, 8:53am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hardware requirements for running Grafana & Loki Grafana Loki loki	3	5974	April 2, 2023
[Help] Anyone got a ref or guide to integrate Loki with Grafana ON-PERM? Grafana Loki	1	215	August 1, 2023
Planning system requirements and storage Grafana Loki	5	1515	December 2, 2022
Manager here. I am new to Loki and I got a few questions Grafana Loki	1	628	November 19, 2020
Loki multiple ingesters and single Grafana using GCS Grafana Loki	3	539	September 22, 2022

Recommended architecture for multiple environments

Related topics