Shipping specific metrics with certain labelled values to Grafana Cloud

Hello!

Been trying to figure this out for days and we’re frankly getting nowhere. Context - we’re using Grafana Cloud to monitor a bunch of EKS clusters. All clusters run v15.5.3 of the prometheus-community/prometheus Helm chart, and we define a remote_write block inside Prometheus’ config that allows us to ship certain metrics, with certain labelled values to Grafana Cloud. So for instance:

  1. We’d ship kube_node_status_condition as is.
  2. We’d like to ship kube_deployment_status_replicas_ready, but only when deployment=coredns, and exported_namespace=kube-system. Similarly, to monitor the state of Prometheus on the cluster itself, we’d only like to ship this metric when deployment=prometheus-server and exported_namespace=monitoring.
  3. Much in the same vein - we’d like to only ship those instances of container_memory_usage_bytes when container=~prometheus-server.

The list here goes on, where there are other metrics which we’d like to selectively ship. The problem here is making this selective - we’ve tried a bunch of ways, but all we’ve managed to do is ship the metrics we want to ship - but with all their instances, not just the ones we want to keep.

Here’s what our existing remote_write block looks like:

serverFiles:
  prometheus.yml:
    remote_write:
      - basic_auth:
          password: XXXXX
          username: XXXXXX
        remote_timeout: 120s
        url: https://XXXXXXX
        write_relabel_configs:
          - action: keep
            regex: >-
              kube_node_status_condition|kube_deployment_status_replicas_ready|kube_daemonset_status_desired_number_scheduled|kube_statefulset_status_replicas_available|kube_pod_status_ready|container_cpu_usage_seconds_total|container_memory_usage_bytes|kube_pod_container_resource_requests|kube_pod_container_resource_limits|kubelet_volume_stats_used_bytes|kubelet_volume_stats_capacity_bytes
            source_labels:
              - __name__

Here are some versions of what we’ve tried inside write_relabel_configs:

          - action: keep
            regex: kube_deployment_status_replicas_ready;kube-system;coredns
            source_labels:
              - __name__
              - exported_namespace
              - deployment
          - action: keep
            regex: kube_deployment_status_replicas_ready;monitoring;prometheus-server
            source_labels:
              - __name__
              - exported_namespace
              - deployment
          - action: keep
            regex: kube_node_status_condition
            source_labels:
              - "__name__"
          - action: keep
            regex: kube_deployment_status_replicas_ready{exported_namespace="^kube-system$", deployment="^coredns$"}
            source_labels:
              - "__name__"
              - "exported_namespace"
              - "deployment"
          - action: keep
            regex: kube_deployment_status_replicas_ready{exported_namespace="^monitoring$", deployment="^prometheus-server"}
            source_labels:
              - "__name__"
              - "exported_namespace"
              - "deployment"
          - action: keep
            regex: kube_deployment_status_replicas_ready.*
            source_labels:
              - __name__
          - action: drop
            regex: .+
            source_labels:
              - exported_namespace
              - deployment
          - action: keep
            regex: kube_deployment_status_replicas_ready.*
            source_labels:
              - exported_namespace
              - deployment
              - __name__
          - action: keep
            regex: ^(kube-system|monitoring)$
            source_labels:
              - exported_namespace
          - action: keep
            regex: ^(coredns|prometheus-server)$
            source_labels:
              - deployment
          - action: labelmap
            regex: __name__|exported_namespace|deployment

None of these have worked - the result with all three versions is the same, where no metrics are pushed out to Grafana. Incidentally, we also don’t see any error logs on the cluster’s Prometheus server logs - if anything, those logs indicate that a write was successful.

Would definitely love some feedback/help on this - thanks so much in advance!

Hello!

To help debug, you will want to look at the prometheus-server container logs.
Within the logs you should see a remote_name value in the output.

kubectl logs prometheus-server-*-* prometheus-server | grep remote_name

This remote_name value can be cross referenced as a label on the prometheus_remote_storage_bytes_total metric within the prometheus instance.

You can port forwarding to your cluster prometheus instance like below.

kubectl port-forward service/prometheus-server -n prometheus 8088:80

And query it using curl.

curl -fs --data-urlencode 'query=prometheus_remote_storage_bytes_total' http://localhost:8088/api/v1/query`
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"prometheus_remote_storage_bytes_total","instance":"localhost:9090","job":"prometheus","remote_name":"XXXXXX","url":"https://prometheus-prod-10-prod-us-central-0.grafana.net/api/prom/push"},"value":[1677037797.435,"2893153"]}]}}

Or with the interface.

If there is a positive value then the remote_write is working to ship that set of metrics.

From the Grafana Cloud side, you can check your metrics ingestion with the Billing and Usage dashboard within your Grafana cloud instance.

Adjusting the incoming active series will be reflected on the Metrics Active Series panel.

Hope that help you debug your remote_write configuration.

Hi Peter!

Thanks so much for the extremely detailed debug walkthrough here. Unfortunately it seems like Prometheus isn’t shipping anything.

In the original post, I’d mentioned a bunch of write_relabel_configs we’d tried; right now, there’s a fourth variation I tried:

        write_relabel_configs:
          - action: keep
            regex: "kube_node_status_condition|kube_deployment_status_replicas_ready|kube_daemonset_status_desired_number_scheduled|kube_statefulset_status_replicas_available"
            source_labels:
              - __name__
          - source_labels: [exported_namespace, deployment]
            regex: "kube-system|coredns"
            action: keep
            separator: ";"
          - source_labels: [exported_namespace, deployment]
            regex: "monitoring|prometheus-server"
            action: keep
            separator: ";"
          - source_labels: [exported_namespace, daemonset]
            regex: "kube-system|node-local-dns"
            action: keep
            separator: ";"
          - source_labels: [exported_namespace, statefulset]
            regex: "porter-agent-system|porter-agent-loki"
            action: keep
            separator: ";"

So even this variation isn’t working. What always works - a single blanket rule that allows certain metrics with all labels:

        write_relabel_configs:
          - action: keep
            regex: >-
              kube_node_status_condition|kube_deployment_status_replicas_ready|kube_daemonset_status_desired_number_scheduled|kube_statefulset_status_replicas_available|kube_pod_status_ready|container_cpu_usage_seconds_total|container_memory_usage_bytes|kube_pod_container_resource_requests|kube_pod_container_resource_limits|kubelet_volume_stats_used_bytes|kubelet_volume_stats_capacity_bytes
            source_labels:
              - __name__

While a lot of documentation I’ve come across - plus tutorials from platforms like New Relic and other individuals - while all of them seem to indicate that it’s possible to ship a specific set of metrics only if certain labels contain specific values, I’m wondering if that’s truly the case. If it isn’t - is there some kind of mechanism that allows us to dump/delete metrics before a specific window/date on Grafana Cloud? I’d reckon even that’ll go a long way towards basically not having a bloated metrics store.

Thanks once again!

Just verified - when I use the blanket rule, prometheus_remote_storage_bytes_total starts showing actual numbers, and I can see data for those metrics pop up - just with all the label/values.

        write_relabel_configs:
          - action: keep
            regex: >-
              kube_node_status_condition|kube_deployment_status_replicas_ready|kube_daemonset_status_desired_number_scheduled|kube_statefulset_status_replicas_available|kube_pod_status_ready|container_cpu_usage_seconds_total|container_memory_usage_bytes|kube_pod_container_resource_requests|kube_pod_container_resource_limits|kubelet_volume_stats_used_bytes|kubelet_volume_stats_capacity_bytes
            source_labels:
              - __name__

This makes me think that the remote writing part works just fine, and it’s definitely something I’m doing wrong with the write_relabel_configs.