Vonage-status-panel not showing status changes, alias, or value

Hi,
I’m using Grafana 9.3.2 on Oracle Linux Server 8.7.

I am attempting to display a server up status over time. Depending on how many times up has returned 0 in the past 15 minutes, display either statuses from green, to yellow, to red. Also, show the up count. At present I am attempting to just get this to work for a couple of servers. Once I can prove the concept, I hope to use a template to show a panel of servers.

The vonage status plugin seemed to be a good fit for this task. I created three queries:

A: sum_over_time(up{instance=“hostname:9200”}[1m])
B: sum_over_time(up{instance=“hostname:9200”}[5m])
C: sum_over_time(up{instance=“hostname:9200”}[15m])

Thresholds:
A: Warning 2 Critical 0
B: Warning 4 Critical 2
C: Warning 7 Critical 5

For all queries:
Aggregation is Last
Handler Type is Number Threshold
Display Alias is Always True
Display Value is When Alias is Displayed

I don’t have any servers with an in-between state, only those who have been reporting up or down for very long periods of time. No matter the server that I put in the query, the status is displayed as green and the value and alias is never displayed. I can verify the queries are returning the expected results by viewing them in Table mode. Servers reporting up show as “1” and servers that are not reporting up show as “0”.

I expected the status box for a server that was “up” to display as green and to show the alias and value. The documentation is not clear for which alias and value it would show, so I expected to discover that by trial and error. I did not expect it to show nothing when every query said to always display the alias.

I reduced the queries to just the following to make it easier to debug:

A: sum_over_time(up{instance=“hostname:9200”}[15m])

It still returned the same results, i.e., the color was always green and no alias or value was displayed.

This is the Grafana log from the moment I hit “Run Queries” to when it stopped writing (edited to remove a few hopefully extraneous lines):

logger=auth t=2023-01-20T17:24:47.815566724Z level=debug msg=“seen token” tokenId=3 userId=1 userAgent=“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36”
logger=datasources t=2023-01-20T17:24:47.82804432Z level=debug msg=“Querying for data source via SQL store” id=3 orgId=1
logger=query_data t=2023-01-20T17:24:47.828949686Z level=debug msg=“Processing metrics query” query=“unsupported value type”
logger=tsdb.prometheus t=2023-01-20T17:24:47.829916535Z level=debug msg=“Sending query” start=2023-01-20T17:09:47.511Z end=2023-01-20T17:24:47.511Z step=15s query=“sum_over_time(up{instance="hostname:9200"}[15m])”
logger=context userId=1 orgId=1 uname=admin t=2023-01-20T17:24:48.037539615Z level=debug msg=“Updating last user_seen_at” user_id=1
logger=tsdb.prometheus t=2023-01-20T17:24:48.04307898Z level=debug msg=“Sending resource query” URL=api/v1/labels
logger=context userId=1 orgId=1 uname=admin t=2023-01-20T17:24:48.205019479Z level=debug msg=“Updating last user_seen_at” user_id=1
logger=tsdb.prometheus t=2023-01-20T17:24:48.206462678Z level=debug msg=“Sending resource query” URL=“api/v1/label/name/values?start=1674234589&end=1674235489”

The only odd thing in the log is the “unsupported value type” message, but that seems to happen prior to the sending the actual query.

Any help would be appreciated. Thanks.

Welcome

I think you will get better traction for an answer if you reduced the wall of log to be a bit succinct

Thanks for the feedback. Yeah, I agree there is a lot there. I will try to remove some lines that I hope are less relevant.

Hi @brymatm,

Short answer:
You should set Aggregation to sum inside Status panel for each query (hide all queries except one and loop trough each of them). After each query hiding you have to close and again open Options menu in Status panel (you don’t need to exit panel). Also, please check if thresholds are different for each query (in Status panel) like you said:

Simulation:
I set up telegraf agent to monitor port 8282 on localhost with net_response plugin. Data is collected every 20s. If port is open/listening then I get result_code = 0 and when is closed I get result_code = 2. I am using grafana v 9.3.2 and influxdb 2.6.1 with bucket test mapped to DBRP (so that I can use Influx QL instead of Flux).

Telegraf configuration with net_response plugin
[[outputs.influxdb_v2]]
  urls = ["http://127.0.0.1:8086"]
  token = "<Your token with write permission on bucket test>"
  organization = "Org"
  bucket = "test"
  [outputs.influxdb_v2.tagpass]
    config = ["system"]


[[inputs.net_response]]
  protocol = "tcp"
  address = "localhost:8282"
  [inputs.net_response.tags]
    config = "system"

In Grafana I created Status panel with 3 queries:

Grafana queries
  • A query (aliased as Port_8282_1m)
SELECT sum("result_code") FROM "net_response" WHERE time <= now() AND time >= now()-65s GROUP BY time($__interval), "port", "server", "protocol" fill(none)
  • B query (aliased as Port_8282_5m)
SELECT sum("result_code") FROM "net_response" WHERE time <= now() AND time >= now()-305s GROUP BY time($__interval), "port", "server", "protocol" fill(none)
  • C query (aliased as Port_8282_15m)
SELECT sum("result_code") FROM "net_response" WHERE time <= now() AND time >= now()-905s GROUP BY time($__interval), "port", "server", "protocol" fill(none)

Note that in each query I added extra 5s (65s instead of 1m). Reason behind this is to avoid value flapping when oldest data point exits from time interval (without that value would go from 6 to 4 and back to 6 for A query).

Then I configured Status panel for each query with different thesholds. Thresholds are:

A: Warning 2, Critical 4
B: Warning 10, Critical 20
C: Warning 40, Critical 60
Grafana: Status panel configuration for each query

A query:

B query:

C query:

Not that for each query (A,B,C) other two queries are temporary hidden in order to show/config thresholds for that specific query. After each hiding you need to close and open Options menu which is in Status Panel on right upper side.

Result:
With that configuration I managed to get state/color changes when thresholds are met for 3 queries that look at different time (-1m, -5m, -15m).

Status panel: Pictures of state changes

Port_8282_1m goes from critical to warning:

Port_8282_1m goes from warning to ok:

Port_8282_5m goes from critical to warning:

Port_8282_5m goes from warning to ok:

Port_8282_15m goes from critical to warning:

Port_8282_5m goes from warning to ok:

Note: Position of tooltip shows that queries indeed look only in specified time (-1m, -5m, -15m). Also notice that alias names (Port_8282_*) change order based on state (warning is always last). Also, don’t be bothered with wrong title of right panel (Status panel) - I forgot to changed it :slight_smile: .

Bonus:
For tracking state changes over time I suggest using State Timeline panel which is preinstalled and easier to use. Example of State Timeline for same dataset:

State Timeline

State Timeline panel:

State Timeline configuration (value mapping 0 to OK and color green, 2 to CRIT and color red):

Note: Query is tagged by ports so if you would monitor more ports you would have more rows in State Timeline.

Hope that helps :slight_smile:

Best regards,
ldrascic

I am extraordinarily grateful for the time and effort you put in toward responding to my query. A couple things I neglected to mention: One is that we’re using Prometheus 2.37.5 as the data source for Grafana. The other is that the eventual goal is an at-a-glance status dashboard with panels for all the systems we’re monitoring. If the panel is green, everything is good. If yellow, there has been some lack of response over the past X minutes. If red, the system has been down over that time period.

The State Timeline you mentioned would not work for status dashboard concept, but I do appreciate the suggestion.

It looks like the telegraf net_response SELECT sum(“result_code”) query is somewhat the reverse of the Prometheus sum_over_time(up) query that I’m running. “up” returns 1 for up and 0 for down.

What perplexes me is that I get results from the query, as shown by the table view, but the panel does not reflect that, even after changing the aggregation to Sum. Thanks also for including the images in your response, I didn’t think to add images and didn’t know about the ability to “hide the details”. Here are images of my test query that show the panel view and the table view:

Status Panel with Settings

Status Panel Table

Does this shed any more light on the problem? Thanks!

Unfortunately, I think that Prometheus data source is not supported for Status Panel plugin. I am not Prometheus user but I think that in Prometheus there is no function which would be similar to ALIAS BY in Influx QL.

From Grafana plugin page - Vonage Status Panel:

Supported Data Sources

Currently the plugin was tested with influxDB and Graphite. Support for other data sources could be added by demand

However, maybe you can do something similar with default Stat panel. Here is a presentation:

Configuration:

1. Dashboard variable configuration

In order to achieve panel repeating for each host you need to create a dashboard variable with query that will return all hosts which send that specific metric (in your case instance availability).

Go to Dashboard settings (cog icon near time) → Variables → New variable

Query:

net_response_result_code{}

Without regex this would return:

{__name__="net_response_result_code", config="netresponse", host="fedora", instance="192.168.101.133:9125", job="Telegraf", port="8282", protocol="tcp", result="success", result_type="success", server="localhost"}
{__name__="net_response_result_code", config="netresponse", host="monitoringserver", instance="localhost:9125", job="Telegraf", port="8282", protocol="tcp", result="connection_failed", result_type="connection_failed", server="localhost"}

Regex to extract host only:

/host="(?<hostvalue>[^"]+)"/

Now go back to dashboard and set your variable to All option and save dashboard with option “Save current variables values as dashboard default”. This will ensure that you variable is always set to All (important for repeating panel).

Grafana dashboard variable save

Now go back to Dashboard settings (cog icon near time) → Variables → <variable_name> and set “Show on dashboard” to “Nothing” in order to prevent changing value of variable. Save the dashboard.

2. Panel configuration

Create a query in panel and set Panel options → Repeat options → Repeat by value → your_variable_name and adjust Max per row to desired value. Note: Query needs to be filtered with your_variable_name (e.g. host = “$host” ) so that each panel shows 1 status for 1 value value of variable (e.g. host).

Query:

sum_over_time(net_response_result_code{host="$host"}[5m])

Set panel title to dashboard variable and set repeat option:

Set text mode to value:

Define Value mappings to strings OK/Warning/Critical (and coloring) for specific ranges:

Optional: Data link

Optional: you can set data link so when you click on each panel you can be redirected to URL (e.g. to another dashboard which shows details about that host - Note: that dashboard should use dashboard variable too). Because of I am passing Server=${host:queryparam} destination dashboard will open for that specific host. Data link:

http://192.168.101.131:3000/d/fkwAvHo4k/linux-hosts?orgId=1&var-datasource=test&var-Server=${host:queryparam}&var-cpu_cores=All

Define value mappings:

Note: If you want to edit anything on panel (query, max per row, resize…) you edit only first panel and after that you save and reload dashboard.

And that’s it :slight_smile:

Best regards,
ldrascic