Azure IoT Edge module metrics in action

We are familiar with the Azure IoT Hub metrics which are offered. The Azure cloud tells us eg. how many messages are received or the number of devices that are connected.

If we look at Azure IoT Edge, you had to collect your own made metrics in the past.

Because IoT Edge modules are Docker containers and therefore sandboxed, you had to rely on the (third-party) logic to capture Host metrics. Regarding metrics about the edge agent and hub, these were not available.

Until now.

With the most recent IoT Edge runtimes, agent, and hub, we have access to Edge metrics.

Both the Agent and Hub module expose the metrics over HTTP endpoints:

Within the Moby runtime, port 9600 is exposed on both individual modules. Outside the runtime, we have to assign individual host ports to prevent using the same host port.

Let’s see how this looks like and how we can harvest metrics in a custom container.

Container Create options

It starts with assigning host ports to the two modules. Go to the runtime settings and set the container create options:

Observe two different port numbers.

Here are the individual settings.

Edge Agent options

I assigned port 9601:

{
  "ExposedPorts": {
    "9600/tcp": {}
  },
  "HostConfig": {
    "PortBindings": {
      "9600/tcp": [
        {
          "HostPort": "9601"
        }
      ]
    }
  }
}

So, if I browse to this HTTP port using a browser, I get this answer:

Note: Give the edgeAgent some time to gather the telemetry.

You can see quite some metrics exposed.

Edge Hub options

Because the edgeHub has some container create options already by default, we have to merge the metrics settings.

I assigned port 9602:

{
  "ExposedPorts": {
    "9600/tcp": {}
  },
  "HostConfig": {
    "PortBindings": {
      "443/tcp": [
        {
          "HostPort": "443"
        }
      ],
      "5671/tcp": [
        {
          "HostPort": "5671"
        }
      ],
      "8883/tcp": [
        {
          "HostPort": "8883"
        }
      ],
      "9600/tcp": [
        {
          "HostPort": "9602"
        }
      ]
    }
  }
}

This result is this browser output for the EdgeHub:

Note: at first, you will get some security warnings. This is probably due to how the edgehub is secured with a certificate. Please accept the certificate (follow the advanced settings).

Here you will see the metrics of the EdgeHub. The format of the EdgeHub and EdgeAgent is the same, it’s the Prometheus format.

For those familiar with Grafana visualization tooling, this format is supported as input for Grafana.

Here follows the example of the full output I gathered in my test environment (running both a Heartbeat module and a custom module we look at later on):

Edge Agent output
esponse: # HELP edgeAgent_available_disk_space_bytes Amount of space left on the disk
# TYPE edgeAgent_available_disk_space_bytes gauge
edgeAgent_available_disk_space_bytes{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="b3ac9b36-be4e-4b1c-af6a-ef23362d4fb7",disk_name="sda2",disk_filesystem="ext4",disk_filetype="HDD",ms_telemetry="True"} 30285934592
edgeAgent_available_disk_space_bytes{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="b3ac9b36-be4e-4b1c-af6a-ef23362d4fb7",disk_name="sda1",disk_filesystem="vfat",disk_filetype="HDD",ms_telemetry="True"} 529387520
# HELP edgeAgent_total_time_running_correctly_seconds The amount of time the module was specified in the deployment and was in the running state
# TYPE edgeAgent_total_time_running_correctly_seconds gauge
edgeAgent_total_time_running_correctly_seconds{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="b3ac9b36-be4e-4b1c-af6a-ef23362d4fb7",module_name="hb",ms_telemetry="False"} 1067.9547444
# HELP edgeAgent_used_cpu_percent Percent of cpu used by all processes
# TYPE edgeAgent_used_cpu_percent summary
edgeAgent_used_cpu_percent_sum{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="b3ac9b36-be4e-4b1c-af6a-ef23362d4fb7",module_name="hb",ms_telemetry="False"} 0.11444853568283364
edgeAgent_used_cpu_percent_count{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="b3ac9b36-be4e-4b1c-af6a-ef23362d4fb7",module_name="hb",ms_telemetry="False"} 3
edgeAgent_used_cpu_percent{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="b3ac9b36-be4e-4b1c-af6a-ef23362d4fb7",module_name="hb",ms_telemetry="False",quantile="0.1"} 0.01647850855745721
# HELP edgeAgent_iotedged_uptime_seconds How long iotedged has been running
# TYPE edgeAgent_iotedged_uptime_seconds gauge
edgeAgent_iotedged_uptime_seconds{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="b3ac9b36-be4e-4b1c-af6a-ef23362d4fb7",ms_telemetry="True"} 4106
# HELP edgeAgent_total_time_expected_running_seconds The amount of time the module was specified in the deployment
# TYPE edgeAgent_total_time_expected_running_seconds gauge
edgeAgent_total_time_expected_running_seconds{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="b3ac9b36-be4e-4b1c-af6a-ef23362d4fb7",module_name="hb",ms_telemetry="False"} 1067.9547444
edgeAgent_total_time_expected_running_seconds{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="b3ac9b36-be4e-4b1c-af6a-ef23362d4fb7",module_name="edgeAgent",ms_telemetry="True"} 1067.9573255
[SKIP]

Note: due to the size of the message, I skipped a large part of it.

Edge hub output
Response: # HELP edgehub_reported_properties_total Reported properties update calls
# TYPE edgehub_reported_properties_total counter
edgehub_reported_properties_total{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="89f5f392-d9d0-4e00-b8ef-cd845ff42e52",target="edge_hub",id="uno2372g/metrics",ms_telemetry="True"} 3
edgehub_reported_properties_total{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="89f5f392-d9d0-4e00-b8ef-cd845ff42e52",target="upstream",id="uno2372g/metrics",ms_telemetry="True"} 3
edgehub_reported_properties_total{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="89f5f392-d9d0-4e00-b8ef-cd845ff42e52",target="upstream",id="uno2372g/$edgeHub",ms_telemetry="True"} 6
# HELP edgehub_reported_properties_update_duration_seconds Time taken to update reported properties
# TYPE edgehub_reported_properties_update_duration_seconds summary
edgehub_reported_properties_update_duration_seconds_sum{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="89f5f392-d9d0-4e00-b8ef-cd845ff42e52",target="edge_hub",id="uno2372g/metrics"} 1.2287810000000001
edgehub_reported_properties_update_duration_seconds_count{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="89f5f392-d9d0-4e00-b8ef-cd845ff42e52",target="edge_hub",id="uno2372g/metrics"} 3
edgehub_reported_properties_update_duration_seconds{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="89f5f392-d9d0-4e00-b8ef-cd845ff42e52",target="edge_hub",id="uno2372g/metrics",quantile="0.1"} 0.2503908
edgehub_reported_properties_update_duration_seconds{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="89f5f392-d9d0-4e00-b8ef-cd845ff42e52",target="edge_hub",id="uno2372g/metrics",quantile="0.5"} 0.2503908
edgehub_reported_properties_update_duration_seconds{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="89f5f392-d9d0-4e00-b8ef-cd845ff42e52",target="edge_hub",id="uno2372g/metrics",quantile="0.9"} 0.2503908
[skipped]

Note: due to the size of the message, I skipped a large part of it.

Custom metrics module

As seen above, the metrics are also available within the context of IoT Edge modules. We have to make use of the internal HTTP endpoints. Both modules expose their own endpoint so no collision is occurring.

The metrics are not distributed by the internal routing between the modules. This is a pity because it could be interesting to take action on changing metrics locally.

So we need to build a bridge.

Therefore, to expose the metrics in the internal routing, I wrote this custom metrics module. The Docker container is available here.

This module exposes the following ‘telemetry’ based on the metrics seen above:

{
  "Timestamp":"2021-01-02T14:10:49.7778998Z",
  "Uptime":68,
  "Disks":{
    "sda2":{
      "Size":60761956.0,
      "Used":31185848.0,
      "Free":29576108.0
    },
    "sda1":{
      "Size":523248.0,
      "Used":6268.0,
      "Free":516980.0
    }
  },
  "Modules":{
    "hb":{
      "Cpu099":0.02,
      "MessagesSentCount":210,
      "MessagesRecievedCount":210
    },
    "edgeAgent":{
      "Cpu099":0.16,
      "MessagesSentCount":0,
      "MessagesRecievedCount":0
    },
    "host":{
      "Cpu099":1.06,
      "MessagesSentCount":0,
      "MessagesRecievedCount":0
    },
    "metrics":{
      "Cpu099":0.04,
      "MessagesSentCount":0,
      "MessagesRecievedCount":0
    },
    "edgeHub":{
      "Cpu099":0.12,
      "MessagesSentCount":0,
      "MessagesRecievedCount":0
    }
  },
  "MessagesSentCount":210,
  "MessagesRecievedCount":210
}

Here we see eg. information about the disks on the host system and CPU usage of the modules.

This is just a small portion of the actual metrics seen in the original messages. The problem I faced was that I could not find a useable C# library for parsing Prometheus messages.

So I needed something else…

RegEx to the rescue

If we look at the EdgeHub Prometheus message, this is the part that shows how many messages are sent by one module (the heartbeat module) to the cloud (upstream) by the Edge Hub:

# HELP edgehub_messages_sent_total Messages sent from edge hub
# TYPE edgehub_messages_sent_total counter
edgehub_messages_sent_total{iothub="edgedemo-ih.azure-devices.net",edge_device="uno2372g",instance_number="89f5f392-d9d0-4e00-b8ef-cd845ff42e52",from="uno2372g/hb",to="upstream",from_route_output="output1",to_route_input="",priority="2000000000",ms_telemetry="True"} 210

I wrote a few RegEx lines to extract the data from the messages.

This is the line which extracts that number of upstream messages for multiple modules:

edgehub_messages_sent_total{[\.a-zA-Z""=_,0-9\/-]*,from="([a-zA-Z0-9\/]*)",to="upstream"[\.a-zA-Z""=_,0-9\/-]*} (\d*)

This can be tested using online tooling like this website:

As you can see, we extract the correct number of messages using this RegEx function.

Contribution to open source

The module is just covering the basics.

It could also be improved by supporting eg. child device metrics. So I invite you to come with pull requests.

Conclusion

We have seen how we now have Azure IoT Edge metrics supported by the Edge Hub and Edge Agent modules.

These metrics are originally made available using HTTP Endpoints.

By using the custom metrics module, we are able to integrate the metrics into the IoT Edge (non-functional) logic using the internal routing.