Azure Data Explorer connector for IoT Hub with real-time ingest

Azure Data Explorer (ADX) is a fully managed, high-performance, big data analytics platform that makes it easy to analyze high volumes of data in near real-time. The Azure Data Explorer toolbox gives you an end-to-end solution for data ingestion, query, visualization (including third-party dashboards), and management.

ADX can ingest data both from (persisted) storage and data provided as a stream:

Ingested data is managed and optimized by the underlying cluster so it can be queried and made available in (third-party) platforms as a view (like Power BI, Grafana, etc.) based on a powerful query language.

Data can also be ingested directly from Azure IoT Hub, as a stream for real-time analysis.

Let’s check out how this works.

What about Time Series Insights?

For those who have experience with Azure Time Series Insights (TSI), this all sounds familiar.

That TSI service also ingests (and internally transforms) data from streams (like IoT Hub) and makes it possible to query it. This is mostly done using the visual editor. Views can then be shared in Power BI.

ADX must be seen as a superset of TSI. This is mainly due to the recent addition of supporting the ingestion of data as real-time streams.

In my previous blog post about Azure Time Series Insights, I introduced TSI as a powerful dashboard to slice and dice your IoT time-series data.

Unfortunately, TSI will not be serviced after March 2025:

Don’t panic! This is still three years from now and a migration path is already offered! (All notifications reference that same page, explaining how to migrate). So, plan your migration after reading this post 🙂

Regarding that same documentation, we see lots of advantages when switching over to ADX:

And I agree that ADX has good papers on managing lots of data from different sources using a rich and powerful query language.

For example, with ADX, you can separate different data message formats in multiple tables and data can be deleted also. These are features hard to find in TSI.

Unfortunately, there are disadvantages too:

  • No rich interactive visual tooling (time sliders, swimming lanes, graphs, time-shifting, pointers, etc.)
  • The learning curve of the Kusto Query Language (KQL) will take some time
  • Different pricing mechanisms due to the usage of a compute cluster

Let’s see how this will evolve in the near future. At this point, you need to invest some time to learn about how ADX handles time series but due to the elaborate querying language, it will be more advanced in the end.

ADX is a first citizen of Azure which is now the bases of multiple Azure solutions. Just look for the typical Kusto Query Language. For example, Azure monitoring and Azure Application Insights are tightly integrated with ADX logic.

Note: Kusto is an ode to Jacques Cousteau, the famous explorer. With KQL you can explore an ocean of data 🙂

Let’s check out how we can start using ADX to explore IoT Data.

IoT Hub connection

Only recently, ADX added support for streams to ingest data in real time:

Notice the real-time apps and dashboard to the right.

ADX supports three ways of ingestion in the Azure portal interface:

  1. EventHub
  2. IoT Hub
  3. Blobs, written to blob containers (accessible via Event Grid events)

Note: there are many more ways to ingest data. There are custom code solutions, native interaction with other tools via connectors, and other tools:

Interestingly this is all still just batch ingestion by default. To allow real-time streaming, we have to activate that property of the cluster.

Today, we focus on the IoT Hub stream ingest so regarding that IoT Hub streaming solution, there are a number of steps to perform:

  1. Enable streaming on the ADX environment (or create an ADX cluster with that option)
  2. Check the fields of the incoming message data. Which values do you want to ingest?
  3. Create an ADX database/table and define the columns for fields you want to ingest
  4. Check the system properties. Which IoT Hub properties do you want to ingest?
  5. Create a mapping for incoming messages together with the system properties
  6. Create an IoT Hub streaming connection
  7. See incoming messages
  8. Do some querying and reporting

Let’s check out how to perform these steps and have live data ingested in (near) real-time.

Note: Please be aware, that with ADX, you need specific AAD roles to create, alter, read, write, etc. Having a Contributor role (or Administrator) helps a lot during this exercise.

1 Enable streaming on the ADX environment (or create an ADX cluster with that option)

We create an ADX cluster first. We use the Azure portal to create it.

Note: Here, I start from scratch, without any ADX environment. If you have a cluster already, just read along, you have to change one property afterward to get on par.

On the first page, you have to select the resource group and region. Give the ADX a unique name:

We also have to specify the compute cluster. This is not that hard, just select an SKU that matches your needs. You can resize it later on.

Here, I select one of the cheapest DEV/Test SKUs:

I also limited the number of availability zones (optional).

Next, the scale of your cluster has to be set on the second page:

By default, both manual scale and one instance are set. Optimized ‘autoscale’ is a very interesting solution once you start to ingest and query more production data. If you do not scale in time, your solution could be throttled.

For now, the manual scale option and one instance count are good enough for me, so I will experience the impact of heavy usage.

On the next Configuration page, we enable and disable capabilities:

By default, ADX works with batches of data to limit resource usage and related costs!

Remember, ADX will transform incoming data in the most efficient format to query it later on. This is normally done in batches. This batching is depending on a number of parameters like five minutes of ingestion time, 1 GB of data ingested by default, or 1000 items ingested. If the stream of data into each table is high (over 4 GB per hour), really consider using batch ingestion!

Note: streaming will put some computing pressure on the cluster, including all databases inside it. Check the documentation if streaming will work in your case regarding the expected number of messages. After the construction of an ADX, these properties can still be changed.

Enabling streaming is great for IoT solutions where a latency of less than a second is required but the load is not that high!

Regarding the other configuration settings, Security, Network, Diagnostics, and Tags, we accept the default settings.

Review and create the new cluster.

Note: creating the cluster will take some time.

If you had created a cluster already, check the configuration and enable streaming ingestion if needed:

2 Check the data fields of the incoming message. Which values do you want to ingest?

Azure Data Explorer works with databases and tables.

It is recommended/mandatory to create a separate table for each type of incoming message (e.g. humidity, pressure, heartbeat).

Later on, we will be asked by the streaming connection to specify the database, the table, and the table columns which have to match the message type structure. So, let’s check the incoming message format.

Here, we will use this Raspberry PI simulator as an IoT Hub device, based on the Azure IoT Device SDK:

The message sent by this device contains four fields as part of the JSON message:

  1. ‘messageId‘
  2. ‘deviceId‘
  3. ‘temperature‘
  4. ‘humidity‘

It also supports an application/user property named ‘temperatureAlert’.

Note: User applications are not ingested by ADX at this moment. To prove this, we take this column into account.

Now that we know the message format, we are ready to create a database and table definition.

This simulation device works well with an IoT Hub device connection. So, give it the primary symmetric key connection string and check if the IoT Hub is ingesting the messages.

Also, add an extra consumer group on the ‘Event Hub compatible endpoint’ of the IoT Hub (I use ‘adx’) because we want to send all incoming messages to ADX later on:

3 Create an ADX database/table and define the columns for fields you want to ingest

By now, the cluster should be created:

See the state is set to ‘running’. Optionally, check out the settings like the scale and the configurations.

First, we need to add a database before we can ingest data into a table (inside that database).

Find the menu Databases:

Select Add database:

Here, I create the ‘streamingdata’ database and disable the retention. This way, all ingested data will exist (and grow) forever. I set the cache (for extra fast in-memory querying) to 31 days:

To get the best performance, you need to keep the timespan of your query data limited to the cache period.

Note: If you want to limit costs, play with the retention time of both database (storage) and cache (memory). You can still change this afterward but data older than the retention time will be removed/destroyed forever!

Press the Create button.

The database is now created:

Select the database and a new Azure Portal pane will show the details and database-related menu.

Adding tables will be done using the Query editor, inside this database.

So, within the Database context, click the ‘Query’ menu item:

The Query editor will be shown. Notice the database already is selected, both in the list of databases and as the scope for the query to build.

Just for fun, try these two queries:

.show cluster 

.show cluster databases 

The first shows the number of instances within your cluster (including the SKU size) and related information.

The last one shows database information inside your cluster.

Now, execute the following command to add a table inside your database:

.create table Telemetry (
messageId: int,
deviceId: string,
temperature:decimal,
humidity:decimal,
temperatureAlert: string,
IotHubDeviceId: string,
IotHubEnqueuedTime: datetime
)

Note: If we look at the columns, we see the usual suspects which reference the incoming message fields (messageId, deviceId, temperature, humidity). Also, extra columns are added which are not mentioned in the incoming telemetry. We will learn about those two columns soon.

Executing the command will result in the creation of the ‘Telemetry’ table:

So, the table is created. What about these two extra columns?

4 Check the system properties. Which IoT Hub properties do you want to ingest?

We leave the query editor for a moment and move back to the Database pane.

In there, the menu ‘Data connections’ is shown. There, data connections are added, including IoT Hub connections:

We want to add this IoT Hub data connection:

In this connection, we have to fill in a unique name and select the right (already existing) IoT Hub. We also need to tell which IoT Hub-related system properties must be added:

Update: this screen is a bit outdated (the UX is made a bit more modern) but the functionality shown is still ok,

The two IoT Hub properties I selected will match those two columns I added in the ‘Telemetry’ table definition already.

Note: as seen above, I selected this ‘adx’ consumer group I added earlier to the IoT Hub so now it’s selectable here.

Note 2: Here, the original IoT Hub IoTHubOwner policy is used. It’s better to create another policy with ‘registry read’ and ‘service’ rights first. So you keep track of policies being used outside the IoT Hub. And you can rotate specific keys once in a while.

Wait, we are not ready, we still need to add more information!

How does the connection know where to drop incoming data and how to place the right data fields into the table columns?

The connection will be made against that ‘Telemetry’ table and coincidently, we have to fill in a table name:

So, it’s clear we need to fill in the correct table name which is ‘Telemetry’.

Note: there is a switch called ‘allow routing the data to other databases’. Incoming messages can be attributed to user properties describing the table name, data format, and mapping name. This is called dynamic routing. In this post, we go for the original static routing by filling in the other fields.

We also define that the incoming messages are formatted as JSON. This is quite obvious for IoT devices using the Azure IoT Device SDK, most of the examples work with JSON. If not, check out the other format. If this does not work for you, you need to add an extra Azure function or Azure Stream Analytics logic between the IoT Hub and ADX to transform (or deflate) messages to a format suitable.

Note: There are more than ten other supported formats to choose from:

We are still missing one more thing: describing the mapping between the incoming message fields and the table columns.

So, last but not least, why do we have to supply a Mapping name?

5 Create a mapping for incoming message type together with the system properties

Warning: Do not save or leave the page yet. At this point, we are still constructing that connection between some data source and a table. Do not cancel the creation dialog…

How does ADX know which value from ANY possible message format maps on the columns of the table?

For that, we need to define an ADX table mapping…

We can add one or more mappings to a table. Each mapping has a name and describes for each column the path of the field inside the incoming message, based on the format. So, the mapping of a JSON message will differ from the mapping of a CSV file/message.

A JSON-related mapping looks like this for an incoming JSON message:

.alter table Telemetry policy streamingingestion enable

.create table Telemetry ingestion json mapping "JsonTelemetryMapping"
    '['
        '{"Column": "messageId", "Properties": {"Path": "$.messageId"}},'
        '{"Column": "deviceId", "Properties": {"Path": "$.deviceId"}},'
        '{"Column": "temperature", "Properties": {"Path": "$.temperature"}},'
        '{"Column": "humidity", "Properties": {"Path": "$.humidity"}},'
        '{"Column": "temperatureAlert", "Properties": {"Path": "$.Properties.temperatureAlert"}},'
        '{ "column" : "IotHubDeviceId", "Properties":{"Path":"$.iothub-connection-device-id"}},'
        '{"Column": "IotHubEnqueuedTime", "Properties": {"Path": "$.iothub-enqueuedtime"}}'
    ']'

Actually, these are two separate commands!

We first enable streaming for this individual table. Then, we describe the mapping and add it to the table.

Regarding the policy, we tell ADX we want to have streaming abilities for this table. Optionally, we can even provide some information about the expected data (eg. hintAllocatedRate). This is to optimize the streaming mechanism.

The second command is related to the mapping.

As you can see, using the mapping, all columns are mapped using their individual path.

Note: the user property ”temperatureAlert’ will never work! I tried it in several ways but in the end, ADX does not ingest application/user properties. So, you can leave out:

'{"Column": "temperatureAlert", "Properties": {"Path": "$.Properties.temperatureAlert"}},'

Note: This is also the case with IoT Hub message enrichments! These are added as user properties. For now, it is recommended to add an Azure Function or make use of Azure Stream Analytics between the IoT Hub and ADX to bring message properties over to the message body.

I named this mapping ‘JsonTelemetryMapping’ which is the value to fill in into the data connection Mapping name field. Unfortunately, the connection cannot be created BEFORE the mapping is saved:

So, you can either execute both commands using a separate browser pointing to the query editor or leave this Connection Create page, execute the queries, and retry this connection creation again.

6 Create an IoT Hub streaming connection

I suggest opening a second browser page and navigating to the query pane.

There, execute both the policy command and the new mapping command:

Now, we finally can create the connection:

Note: this takes a few seconds.

Now, the connection is created:

If we are lucky, we even see the first metrics regarding processed messages using the link to the health section of the connection:

Still, these are metrics. We want to see the arrival of actual messages.

7 See incoming messages

Again, go back to the query pane and select the count of the Telemetry table.

Using KQL, this is quite simple:

Telemetry
| count

Press the Run button. The answer shows the number of rows:

Note: Do you see the query is highlighted when you put the cursor in it? In that case, press shift-enter to execute the query. This is a minor editor feature that saves time over and over again.

If you repeat this a number of times, you will see the data is updated as a live stream!

Yes, you are looking at live, real-time data in ADX.

Note: If this count is not updated at a similar rate as the device is generating data, either the configuration went wrong somewhere or the ingestion rate is too high! If the steaming ingestion is too much (think about that 5 minutes, 1000 messages, 1 Gb of data limitation), it will fall back to using batches again (probably with a five-minute interval).

Note: The simulation we use can be throttled inside your browser (because the tab page of the simulation is not open, the browser can decide to hold JavaScript execution inside it). Check your browser for this too.

8 Do some querying and reporting

Finally, we can start querying!

Although there are a number of great tutorials (see below in the Conclusion section of this post) regarding the Kusto Query language, let’s try some queries to experience some of the power of KQL.

Here, I summarize the number of incoming messages per device:

Then, this is how to show the number of messages per device with the condition: ‘temperature is higher than thirty’:

Here, the editor shows me a chart of how many messages fall into groups of two degrees (20-21, 22-23, etc.):

So, the editor has some nice charts too. There is even an option to build dashboards on top of this using the ADX Web UI!

Here, I filter on humidity value and enqueue time:

Filter on only recent messages (till 50 minutes ago):

Adding some calculated fields:

Render a time chart based on the IoT Hub enqueue time with different values:

There are many more ways to query your data. And, the visualization of incoming data can be done using third-party toolings like Power BI, Excel, Grafana, Tableau, Kibana, Redash, Sisense, or solutions supporting ODBC (like QLik) or JDBC.

If you want to put your dashboards under version control, check out the export and import (replace) options:

Costs. Free?

ADX runs on a cluster of computing resources, probably VMs. This means, that depending on the computing power needed, you need to either scale up or out (or vice-versa). For a development environment (as seen in this demo), I used the advised resources:

This resource runs for less than 25 Euro cents per hour for one machine (at the time of writing). Other resources available are:

So, this development machine runs for around 150 euros a month (extra costs regarding storage and data transport can occur).

These are pay-as-you-go prices. You can save from 14% up to 57% if you commit yourself to a reservation, for up to three years.

For a complete pricing indication, check the calculator.

If you are just starting to learn about Azure Data Explorer or doing some experiments, check out this opportunity to roll out a FREE ADX cluster!

This will bring you to your free cluster Azure Data Explorer. More information about the limitations and feature comparison with a regular cluster is found here. It will be available for (at least) a year, without any SLA. Unfortunately, this free version lacks support for Streaming ingestion at this moment.

Update: The free ADX now supports streaming data too.

If you like your free ADX and find yourself limited, the upgrade to a regular Azure Cluster is coming soon:

Update: This update feature is now available.

Bonus: Explorer tool

Until now, you have seen the browser experience querying the databases.

Check out this Kusto Explorer you can download and install locally!

It has many more features and guidance compared to the (slow) Azure portal query editor:

For instance, you can lock tab pages to a database, so there is less risk in executing a test command on production.

It also lets you query your query history, as a menu option:

.show queries
| where StartedOn > ago(3d)
| order by StartedOn desc
| project User, StartedOn, Text, Database, Duration, State, FailureReason, ClientActivityId

It should even work with Microsoft Fabric KQL Databases.

Conclusion

Azure Data Explorer is a promising and worthy Azure resource inheriting the legacy of Azure Time Series Insights.

At this moment, it lacks the ease of navigation to the fast amounts of incoming data. You have to build your own tooling and queries for that. On the other side, the feature set is much broader and richer and you are more in control regarding the data.

The KQL language can help you a lot in that direction. Investing some time to learn this language is always a good idea due to its broad usage in other Azure resources (like Azure Monitoring or Application Insights).

You can start for free with the free cluster and practice your ADX skills. If you want to learn more about ADX, start with this free MS Learn collection of ADX-related modules.

Microsoft also offers this free course using PluralSight (Redeem your benefit here). In this four-hour video course, you get a good insight requiring data ingestion, querying, visualizing, and monitoring ADX including the architecture behind it. There are also two extra courses regarding KQL (Kusto Query Language (KQL) from scratch and Microsoft Azure Data Explorer – Advanced KQL). You only need to register an account, no credit card information is needed.

Check out my second blog about this topic where we look at a more real-world solution supporting multiple types of messages for multiple tables using dynamic routing.

This post is the first part of this series:

See also this video:

Een gedachte over “Azure Data Explorer connector for IoT Hub with real-time ingest

Reacties zijn gesloten.