I am trying to run tempo-distributed in kubernetes but getting the following error messages
"Failed to resolve tempo-distributed-gossip-ring: lookup tempo-distributed-gossip-ring on 10.96.0.10:53: no such host" │
"Failed to resolve tempo-distributed-gossip-ring: lookup tempo-distributed-gossip-ring on 10.96.0.10:53: server misbehaving"
msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0"
I have been reading and the reason is because the distributed and ingester can’t communicate. They need to be in the same gossip ring. There is a an issue related to this in this forum but they are using etcd for KV but it is not my case.
Is this gossip ring new to tempo? I don’t remember having this issue in the past.
The question is:
How can I create a gossip ring for tempo? Is there an article showing the steps because I am at lost?
The gossip ring is not new at all, the implementation hasn’t changed recently afaik. It’s indeed used by memberlist to ensure all components can find each other and it allows them to load balance requests / shard work between each other.
This page has a more information about the different rings: Consistent Hash Ring | Grafana Labs
How are you deploying Tempo? From the generated names I’m guessing you are using the Helm chart and deploying to Kubernetes?
If so, the gossip ring is backed by a headless Kubernetes service (it’s probably called tempo-distributed-gossip-ring). Is this service listed with kubectl get svc?
So your config and the services look alright. The headless tempo-distributed-gossip-ring service is used for the gossip ring.
That last log line:
msg="joined memberlist cluster" reached_nodes=4
seems to indicate that this component eventually managed to join memberlist. Does pusher failed to consume trace data still appear?
To get an overview of memberlist you can check out the /memberlist or one of the ring endpoints. /memberlist should list the other members, i.e. all distributors, ingesters and queriers.
See API documentation | Grafana Labs
To visit this page I usually set up a port-forward:
Yeah, I did what you suggested and the memberlist looks right. It eventually settles, it works now thanks. We had a network issue causing problems to query the traces, but the ring is fine.
By the way, I know that Loki has a gateway nginx ingress controller for basic auth as part of the deployment. Do we have something similar for tempo?
We don’t right now. We have an open issue to document the use of a gateway and this user shared their setup already: Document how to deploy Tempo to ingest traces from multiple clusters · Issue #977 · grafana/tempo · GitHub
If you get a good setup feel free to share your experience as well! In Grafana Cloud we use a custom gateway that is closely tied into our auth infrastructure, so it doesn’t make sense to open-source it.
It is good to have this kind of document, I can see it is on early stage as it only covers routing with an ingress, but still needs basic auth. I started to work on it; at first glance it seemed not that difficult using this nginx ingress contoller set up but it turns out that grafana has troubles reaching the tempo-distributed-query-frontend service.
In Kubernetes I deployed a nginx controller with the following configuration:
Once that you enter credentials it lets you in, not sure if getting a 404 from tempo-distributed-query-frontend is Okay though.
curl -v tempo.mycompany.dev -u "user:password"
* Trying 52.##.##.##...
* TCP_NODELAY set
* Connected to tempo.mycompany.dev (52.##.##.##) port 80 (#0)
* Server auth using Basic with user 'user'
> GET / HTTP/1.1
> Host: tempo.mycompany.dev
> Authorization: Basic MDUyZTdkODQtYjA2ZC00OGFjLWJhMzctZGE4YTM0MmQ4NGM3OmVReTZmNjR0Z==
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 404 Not Found
< Date: Sun, 03 Oct 2021 19:43:58 GMT
< Content-Type: text/plain; charset=utf-8
< Content-Length: 19
< Connection: keep-alive
< X-Content-Type-Options: nosniff
<
404 page not found
* Connection #0 to host tempo.mycompany.dev left intact
* Closing connection 0
However when I create a datasource in grafana and point it to the nginx ingress controller the test passes but when I query for a trace I get the 404 that I get with the curl command.
You said you are using a custom gateway for tempo, are you using nginx? is it possible to share part of the configuration? specially how to configure to gran access to tempo-distributed-query-frontend service.
Yeah, that’s expected: there is nothing at /. I recommend using /api/echo to verify the query-frontend is reachable. To query a trace use /api/trace/<traceID>.
Unless you have configured http_api_prefix, in that case it will be /<prefix>/api/echo
This error is weird, Tempo usually answers with 404 trace not found. Something else is returning this 404 I think.
It’s a custom Go app, not based upon nginx or anything else. It accepts requests on a limited amount of paths, verifies authentication, sets the X-Scope-OrgID header (only needed if you run Tempo multitenant) and passes the request to either the distributor or the query-frontend (depending on the path).
Our gateway only allows:
GET /api/echo
GET /api/traces/{traceID}
POST /opentelemetry.proto.collector.trace.v1.TraceService/Export (OTLP GRPC).
We also have Envoy running between our gateway and the distributors to ensure load-balancing is GRPC-aware. The default Go load balancer does round robin which isn’t great for GRPC streams. This is explained in a bit more detail here: gRPC Load Balancing | gRPC
This will only be necessary if you are working with GRPC of course.