I am taking a look at Grafana Tempo, and I see on the landing page that “Not indexing the traces makes it possible to store orders of magnitude more trace data for the same cost”.
This makes me wonder how TraceQL is executing its queries. Not having indexes is great for high write throughput, but does that mean that performing read queries requires a lot of data scanning? What kind of query execution plan does TraceQL use?
Hi, great question. This is a super deep topic and the design of Tempo is definitely a balance between write-time and read-time trade-offs. The main concept is that Tempo stores traces in columnar Parquet files which it is able to scan very quickly. See the original design proposal for a good introduction. The Parquet files have a schema suited for querying traces on fields like duration, status, and attributes of different data types. There are dedicated columns for some common attributes like http.status_code, so when executing a TraceQL query like { span.http.status_code >= 500 } Tempo only needs to access that 1 column, which is very efficient. There’s a lot more going on and we keep evolving the Parquet schema in Tempo each release, but this covers the main concepts.
I guess this is not much different than most column-storage OLAP databases. Using some table-statistics to prune out shards before having to scan, and then use additional logic to join the results in-memory. I find it very interesting that the Parquet format provides a lot of that functionality. This is quite nice, and I like it.
The usual problem I see with this type of solution is that as the data size increases, the in-memory result join gets more expensive. Since separate query predicates results in an increasing amount of data, but the resulting intersection is smaller.
I do not really have any other questions, but I want to design a possible future design suggestion: Provide an optional custom IdGenerator for languages that embeds a hash of the service name (or TenantId), and a timestamp embedded into the SpanId/TraceId. And then have the data of each tenant write to separate folders. This would help the parquet files to stay more focused on each service, for easy aggregation, but still allow for easy distributed tracing, since the id itself contain information for where the data exists.
I found it interesting when I learned about AWS OpenTelemetry X-Ray IdGenerator. The trace ID contains the timestamp, for easy lookup on the database.
Looking forward for more updates on this product. Y’all do a great job.
Provide an optional custom IdGenerator for languages that embeds a hash of the service name (or TenantId), and a timestamp embedded into the SpanId/TraceId.
It’s awesome you suggest this. We actually considered this very early on when the backend was still proto based to improve search times. However, some of our earliest users were using things like Istio to generate trace ids and so did not have the option. So we abandoned the idea. Currently, our parquet files are sorted by trace id for quick trace lookup but sorting by timestamp has some very nice properties. If the timestamp were part of the trace id you’d get both together!
This would help the parquet files to stay more focused on each service, for easy aggregation, but still allow for easy distributed tracing, since the id itself contain information for where the data exists.
A key design tenant of Tempo is to keep the trace entirely in one spot. This allows for structural queries like:
If we sharded data based on anything that split traces we’d lose the ability to efficiently do this. Personally, I think these kinds of queries are Tempo’s killer feature.
I have considered allowing a custom label configuration that shards beneath the tenant. Say you were aware that traces never crossed a specific label boundary like resource.environment. Then you could safely create “sub tenants” where the traces were sharded by the configured label. Then if the user includes resource.environment in their query you can look at a subset of their traces.
There are ways to extend this for labels that have different values in a trace, but it gets more complicated and less reliable.
Dunno. We’re doing our best! Everything is a trade off right? Thank you for the deep thoughts on the database. We love to see comments like this from the community.