Tempo Vulture fails to query Tempo 90/95% of the time

Hello, I tried to deploy Tempo & Tempo-vulture on Kubernetes but it seems like a high number of Tempo Vulture queries are failing (404), even though ingester seems OK.

Here is a bit of context:
Custom Helm Chart Templates:

storage:
  trace:
    backend: gcs
    gcs:
      bucket_name: grafana-tempo-storage
traces:
  jaeger:
    # -- Enable Tempo to ingest Jaeger GRPC traces
    grpc: true
    # -- Enable Tempo to ingest Jaeger Thrift Binary traces
    thriftBinary: false
    # -- Enable Tempo to ingest Jaeger Thrift Compact traces
    thriftCompact: false
    # -- Enable Tempo to ingest Jaeger Thrift HTTP traces
    thriftHttp: false
  # -- Enable Tempo to ingest Zipkin traces
  zipkin: false
  otlp:
    # -- Enable Tempo to ingest Open Telementry HTTP traces
    http: false
    # -- Enable Tempo to ingest Open Telementry GRPC traces
    grpc: true
  # -- Enable Tempo to ingest Open Census traces
  opencensus: false

For tempo Vulture:

tempoAddress:
  # -- the url towards your Tempo distributor, e.g. http://distributor
  push: http://tempo-tempo-distributed-distributor
  # -- the url towards your Tempo query-frontend, e.g. http://query-frontend:3100
  query: http://tempo-tempo-distributed-query-frontend:3100

Ingester metrics:

Tempo Vulture Logs:

E0827 09:37:13.720721       1 main.go:217] tempo url http://tempo-tempo-distributed-query-frontend:3100/api/traces/7f77b5ead87c970307da776e48f53d01
E0827 09:37:43.720688       1 main.go:217] tempo url http://tempo-tempo-distributed-query-frontend:3100/api/traces/312c83960a99aba1078f439f1339691f
E0827 09:37:43.733165       1 main.go:220] trace not found 312c83960a99aba1078f439f1339691f
E0827 09:37:43.733193       1 main.go:115] error querying Tempo trace not found
E0827 09:38:13.720435       1 main.go:217] tempo url http://tempo-tempo-distributed-query-frontend:3100/api/traces/2ec0d3ff939e977d5f3d81178d499b0b
E0827 09:38:13.738969       1 main.go:220] trace not found 2ec0d3ff939e977d5f3d81178d499b0b
E0827 09:38:13.738997       1 main.go:115] error querying Tempo trace not found

Query Frontend Logs:

level=info ts=2021-08-27T09:36:43.72913336Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=5c085d48f9b70c94 url=/api/traces/6a4128219711c2d03bd036a55c15bf4a duration=2.174074ms response_size=0 status=404
level=info ts=2021-08-27T09:37:13.748972787Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=6ff7dfa7785bc705 url=/api/traces/7f77b5ead87c970307da776e48f53d01 duration=19.823453ms response_size=31175 status=200
level=info ts=2021-08-27T09:37:43.730438653Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=7dfd9078215a3d0a url=/api/traces/312c83960a99aba1078f439f1339691f duration=4.114527ms response_size=0 status=404
level=info ts=2021-08-27T09:38:13.736302197Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=1f2c2de172362a17 url=/api/traces/2ec0d3ff939e977d5f3d81178d499b0b duration=3.890153ms response_size=0 status=404
level=info ts=2021-08-27T09:38:43.731232202Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=430b8396d46bc714 url=/api/traces/1d1d191169694e534c925012e2057f8b duration=2.919309ms response_size=0 status=404

But once in a while, the querier succeeds to query it:

level=info ts=2021-08-27T07:59:43.74124059Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=273151edb55a0b2a url=/api/traces/15004a4f0a6d3ff47b504d2aad6719ea duration=12.496063ms response_size=4221 status=200
level=info ts=2021-08-27T08:21:13.745419617Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=58d9faad6d056415 url=/api/traces/40c7db016c97f94c6b3440cc84d1276b duration=16.035543ms response_size=19685 status=200
level=info ts=2021-08-27T08:21:43.730745961Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=723bbd08e52005e8 url=/api/traces/740c1be16a79090b6fd7658a9d6cc9eb duration=3.221735ms response_size=0 status=404
level=info ts=2021-08-27T08:23:13.733750734Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=4369729f9260c327 url=/api/traces/20041be668d2413169a69085b0af1dbe duration=3.215038ms response_size=0 status=404
level=info ts=2021-08-27T08:29:13.732850912Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=64d61ed72a7e4ee5 url=/api/traces/4ed8edabad630fbb70acc146420076f4 duration=4.500166ms response_size=0 status=404
level=info ts=2021-08-27T08:46:13.731772424Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=5326badbda65e07a url=/api/traces/30a3b3cd751cbc6f3bd64eaf11028f74 duration=3.12009ms response_size=0 status=404
level=info ts=2021-08-27T08:48:43.73743803Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=6217c14e4d43bc40 url=/api/traces/2c95635565f805fe26c24d03880def8a duration=11.208868ms response_size=12860 status=200
level=info ts=2021-08-27T08:52:13.732720859Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=6200951b87e460ef url=/api/traces/086cecfed1ce09b622f022967fa5cf2f duration=3.246412ms response_size=0 status=404
level=info ts=2021-08-27T09:01:44.836208631Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=636faee5ae8ac280 url=/api/traces/66ccbf1cff7118945ecd5f0f2a4e8758 duration=15.751986ms response_size=19375 status=200
level=info ts=2021-08-27T09:10:43.784799736Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=7a08465341a3ecb1 url=/api/traces/716e343c8d14ee9011712534887cae51 duration=57.297725ms response_size=37153 status=200
level=info ts=2021-08-27T09:11:43.837460553Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=75ea476fe1f90c53 url=/api/traces/6f36ca7e6561cc7138bae5eaa601a4d6 duration=109.769497ms response_size=11489 status=200
level=info ts=2021-08-27T09:14:13.731200855Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=3d23f042ae6fcdb3 url=/api/traces/33d9be3f677c8b1d066cdfc75ac999b4 duration=3.028284ms response_size=0 status=404
level=info ts=2021-08-27T09:21:43.732868039Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=17f6c643bf11a2d7 url=/api/traces/6207fca5ac776c64020011603cf47c4a duration=3.197556ms response_size=0 status=404
level=info ts=2021-08-27T09:37:13.748972787Z caller=frontend.go:108 tenant=single-tenant method=GET traceID=6ff7dfa7785bc705 url=/api/traces/7f77b5ead87c970307da776e48f53d01 duration=19.823453ms response_size=31175 status=200

I have no clue where the issue could be, it you can provide some investigation tips, it’s be appreciated!

Your logs look different than ours, your vulture may be a bit out of date. Are you logging any lines for writing traces. I see a lot of these:

T=2021-08-27T15:13:09.525Z L=INFO M="sending trace" org_id=1 write_trace_id=64dcd96e2bac3068549e83f862b9d172 seed=1630077180
T=2021-08-27T15:13:19.526Z L=INFO M="sending trace" org_id=1 write_trace_id=6606fff5aa56988222802b7fd5b21901 seed=1630077190
T=2021-08-27T15:13:29.526Z L=INFO M="sending trace" org_id=1 write_trace_id=54b067ddeda8c7c2708801005d279488 seed=1630077200
T=2021-08-27T15:13:39.525Z L=INFO M="sending trace" org_id=1 write_trace_id=58ba146613e0eda32dcd2b88d65f240f seed=1630077210

in our logs. We just cut 1.1. Perhaps try the vulture from that release?

Here are our cli params:

      -prometheus-listen-address=:8080
      -tempo-push-url=http://distributor
      -tempo-query-url=http://query-frontend:3100
      -logtostderr=true
      -tempo-org-id=1
      -tempo-read-backoff-duration=5s
      -tempo-write-backoff-duration=10s

Hello @joeelliott, indeed I used the Helm Chart that deplous tempo-vulture:0.7.

I switched to v1.1, and finally managed to found the error, the compactor was missing a right for the bucket (storage.objects.get). I did not see it at first in logs, and wasn’t expecting that the compactor would block the full service, but it appears it can.

Thanks for your help,

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.