Grafana stack seems to be heavily focused on alarms and metrics as a starting point. First, you have some metrics event, then you drill down to logs and traces to figure out what happened.
Why I got that impression? Loki does not have indexing; its labels seem to be used for sharding (streams). The recommendation is to use only source identifiers as labels. It is only efficient to search by when and where. Tempo indexes only by trace id, so only searching by trace id that you got from somewhere (like metrics) is efficient.
That is all good and well unless you want to find functional rather than non-functional issues, as you cannot easily search by customer id or similar field. For example, if a customer complains about the transaction that he had yesterday at 13:44, how would you find that? Maybe backend search can help but it is not production-grade.
Am I getting this right? What can I do about functional issues?
Hi @zeljko! There are multiple options for trace discovery with Tempo.
The main one is Tempo search. As you point out, backend search is not very mature yet, though I wouldn’t say it’s not production-grade. We use it in production and in our Grafana Cloud Traces offering. While it’s performance may be not as good as we’d like in some cases, it works well and we’re actively working on improving it. One of our main projects right now is a new columnar format that will speed search and provide great improvements.
That said, there are other options for trace discovery, such as derived fields and exemplars. Derived fields consist on adding the trace ID to logs and linking to the trace datasource from there. Exemplars allow you to jump from metrics to traces. Internally we use both daily and work pretty well.