I have taken the example distributed compose and converted it over to Docker Swarm. However I am seeing continual network problems. This is the Swarm YAML file I am using:
version: '3.8'
services:
distributor:
image: grafana/tempo:1.4.1
hostname: distributor
command: "-target=distributor -config.file=/etc/tempo.yml"
# No ports exposed, the multiple NICs mess up Tempo
#ports:
# - "3100:3100"
# - "4317:4317"
volumes:
- /opt/tempo/configs/tempo.yml:/etc/tempo.yml:ro
ingester:
image: grafana/tempo:1.4.1
hostname: ingester-{{.Task.Slot}}
command: "-target=distributor -config.file=/etc/tempo.yml"
volumes:
- /opt/tempo/configs/tempo.yml:/etc/tempo.yml:ro
deploy:
placement:
max_replicas_per_node: 1
replicas: 3
query-frontend:
image: grafana/tempo:1.4.1
hostname: query-frontend
command: "-target=query-frontend -config.file=/etc/tempo.yml"
volumes:
- /opt/tempo/configs/tempo.yml:/etc/tempo.yml:ro
deploy:
replicas: 1
querier:
image: grafana/tempo:1.4.1
hostname: querier
command: "-target=querier -config.file=/etc/tempo.yml"
volumes:
- /opt/tempo/configs/tempo.yml:/etc/tempo.yml:ro
deploy:
replicas: 1
compactor:
image: grafana/tempo:1.4.1
hostname: compactor
command: "-target=compactor -config.file=/etc/tempo.yml"
volumes:
- /opt/tempo/configs/tempo.yml:/etc/tempo.yml:ro
deploy:
replicas: 1
metrics_generator:
image: grafana/tempo:1.4.1
hostname: metrics_generator
command: "-target=metrics-generator -config.file=/etc/tempo.yml"
volumes:
- /opt/tempo/configs/tempo.yml:/etc/tempo.yml:ro
deploy:
replicas: 1
networks:
default:
name: some-network
driver: overlay
attachable: true
# If this is not internal, the multiple NICs mess up Tempo
internal: true
If the network is not declared as internal
, I see errors like this in the logs:
level=warn ts=2022-08-22T08:20:36.41597158Z caller=tcp_transport.go:428 component="memberlist TCPTransport" msg="WriteTo failed" addr=172.18.0.9:7946 err="dial tcp 172.18.0.9:7946: i/o timeout"
If I expose any port, I see errors like this:
level=warn ts=2022-08-22T08:31:10.297712956Z caller=tcp_transport.go:428 component="memberlist TCPTransport" msg="WriteTo failed" addr=172.18.0.7:7946 err="dial tcp 172.18.0.7:7946: connect: network is unreachable"
The IPs getting logged seem to relate to the docker_gwbridge
network.
The issue seems to be down to the container having multiple NICs in these scenarios, but no amount of fiddling with http_listen_address
, grpc_listen_address
, instance_interface_names
, interface_names
, advertise_addr
, or bind_addr
and trying to “lock” the executing service to eth0
has fixed them, although they do have an effect as when I set everything to be eth0
(including the advertise/bind IP), I get errors like this:
evel=warn ts=2022-08-22T08:32:28.387377885Z caller=tcp_transport.go:428 component="memberlist TCPTransport" msg="WriteTo failed" addr=10.0.0.9:7946 err="dial tcp 10.0.0.9:7946: i/o timeout" level=warn ts=2022-08-22T08:32:28.394078007Z caller=tcp_transport.go:428 component="memberlist TCPTransport" msg="WriteTo failed" addr=172.18.0.9:7946 err="dial tcp 172.18.0.9:7946: i/o timeout"
Note the mix of ingress
and docker_gwbridge
IPs.
Can anyone offer a suggestion on how to resolve these warnings and have it only use the valid NIC and IPs, or point me to where in the documentation I can find an answer? My problem looks similar to issue 927, but I didn’t find a solution there.
An internal
network with no exposed ports isn’t of much use and the only solution I can think of is adding something like HA Proxy to the mix as at the moment the stack can’t consume spans, with the distributor logging errors like this:
level=error ts=2022-08-22T09:12:57.061880819Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0"
And the ingesters:
ts=2022-08-22T09:49:56.071293583Z caller=memberlist_logger.go:74 level=error msg="Push/Pull with distributor-96cf3228 failed: dial tcp 10.0.0.102:7946: connect: connection refused"
Host OS is CentOS 7.9.2009