Testing Azure IoTHub Manual failover

The Azure IoTHub is the center of all the IoT efforts of Microsoft. Over the last couple of years (or even months) we see a lot of innovations from that side.

The latest addition is the Manual failover which is now in preview.

This makes it possible to move a complete IoTHub (with eg. all of its devices and routes) to the ‘sister region’. For example, an IoTHub living in West US 2 will move to West Central US. And you can move it back too.

The manual failover is a good starting point for having a more resilient IoTHub. It’s not perfect, there is a chance that unread messages or data is lost. Failover is hard:

But it’s a perfect way to test the ‘automatic’ failover which Microsoft provides when something happens with the region your IoTHub is living in.

I wanted to test this failover. And I wanted to build a client-side solution so I would not lose any messages.

Let’s see how it can be tested.

I had to perform a number of steps:

  1. Create an IoT Hub in the right, supported region
  2. Create some persistence to check the number of messages lost
  3. Route message to an Azure Function in the right way
  4. Write a client which keeps messages which are not accepted by an IoT Hub in a queue
  5. Test the complete solution

And I also watched the video on the Internet Of Things Show.

So let’s start.

Create a supported IoT Hub

I live in Europe so normally I would use West Europe as the region I put resources in.

But at this moment: “Manual failover is currently in public preview and is not available in the following Azure regions: East US, West US, North Europe, West Europe, Brazil South, and South Central US.”

So I created an IoT Hub in West US 2. It failover location is West Central US:

How to persist messages

I wanted to check if any message would get lost. So I had the idea to send messages with a counter and the messages had to be persisted in SQL Azure.

So I created both an SQL Azure Server, Database, and Table:

And I wrote a simple query to see if any message went lost. In that case, the message counter following on the previous message was not available:

select tc1.counter
  from TableCounter tc1
 where 0 =
           (
             select count(tc2.counter)
               from TableCounter tc2
              where tc2.counter = tc1.counter + 1
           )
 order by tc1.counter

Then I added an Azure Function which was able to listen to the IoT Hub and persist the messages in SQL Azure. I used the External Table output for that. It is a simple way to write data to a table but do not overwhelm the SQL Server. Otherwise, you will be throttled:

#r "Microsoft.Azure.ApiHub.Sdk"
#r "Newtonsoft.Json"

using System;
using Microsoft.Azure.ApiHub;
using Newtonsoft.Json;

public static async Task Run(string myIoTHubMessage, TraceWriter log, ITable<Telemetry> outputTable)
{
  log.Info($"C# IoT Hub trigger function processed a message: {myIoTHubMessage} at {DateTime.Now}");

  dynamic json = JsonConvert.DeserializeObject(myIoTHubMessage);

  var telemetry = new Telemetry
  {
    counter = json.counter,
    timeStamp = json.timeStamp
  };

  // insert
  await outputTable.CreateEntityAsync(telemetry);

  log.Info($"persisted {json.counter} at {json.timeStamp}");
}

public class Telemetry
{
  public int counter {get; set;}
  public DateTime timeStamp {get; set;}
}

But how are the message received from the IoT Hub?

Route message to an Azure Function in the right way

Within Azure, an IoTHub can be referenced like an EventHub. This is what is used when an Azure Function is connected to an IoT Hub directly. But in the video, it was already mentioned that these references are not part of the Failover. I tried it and these two connections are what I got:

Endpoint=sb://iothub-ns-manualfail-623176-f126223f6f.servicebus.windows.net/;SharedAccessKeyName=iothubowner;SharedAccessKey=uAP3cwiIg5gZZMOy3sbSk9RTDz/q02GV1407TuV0GIQ=;EntityPath=manualfailover-ih
Endpoint=sb://iothub-ns-manualfail-623176-b479c80cf8.servicebus.windows.net/;SharedAccessKeyName=iothubowner;SharedAccessKey=uAP3cwiIg5gZZMOy3sbSk9RTDz/q02GV1407TuV0GIQ=;EntityPath=manualfailover-ih

I connection was created before a failover, the second one I had to create afterward to get my data which was coming in.

So in the end, I use an EventHub in between the IoT Hub and the Azure Function. And I added a route within the IoT Hub:

On the cloud side, everything is connected. Messages are coming in, routed to the EventHub, distributed to the Azure Function and persisted in SQL Azure.

Queue messages on the client side which are not accepted by the Azure IoT Hub

The IoT Hub device client is capable to retry sending a message when it is not accepted by the IoTHub at first. You can use the method ‘_deviceClient.SetRetryPolicy(IRetryPolicy retryPolicy)’ as seen here to configure the policy.

There is a standard implementation (it’s even default behavior) for exponential backoff. This is for individual messages.

But I want to implement my own solution which involves a queue. So in my IoT Device client, I used ‘NoRetry’ policy, also available. This means that if a message is not accepted by an IoT Hub, it is lost, normally.

So I came up with this queue solution:

using System;
using System.Collections.Generic;
using System.Text;
using System.Threading;
using Microsoft.Azure.Devices.Client;
using Newtonsoft.Json;
 
internal class AzureIoTHub
{
    private const string deviceConnectionString = "[primary or secundary device connectionstring]";
 
    private DeviceClient _deviceClient = null;
 
    public AzureIoTHub()
    {
        CreateClient();
        CreateQueue();
    }
 
    public MessageQueue MessageQueue { get; private set; }
 
    private void CreateClient()
    {
        if (_deviceClient == null)
        {
            // create Azure IoT Hub client from embedded connection string
            _deviceClient = DeviceClient.CreateFromConnectionString(deviceConnectionString, TransportType.Mqtt);
            _deviceClient.SetRetryPolicy(new NoRetry());
        }
    }
 
    private void CreateQueue()
    {
        if (MessageQueue == null)
        {
            MessageQueue = new MessageQueue(_deviceClient);
        }
    }
 
    public void SendDeviceToCloudMessage(Telemetry telemetry)
    {
        var messageString = JsonConvert.SerializeObject(telemetry);
        var message = new Message(Encoding.ASCII.GetBytes(messageString));
        MessageQueue.Enqueue(message);
    }
}
 
internal class MessageQueue
{
    private Queue<Message> _messageQueue;
 
    private Timer _timer;
 
    private DeviceClient _deviceClient;
 
    private int _retryCount = 0;
 
    public MessageQueue(DeviceClient deviceClient)
    {
        _deviceClient = deviceClient;
 
        _messageQueue = new Queue<Message>();
 
        TimerCallback timerDelegate = new TimerCallback(Dequeue);
        _timer = new Timer(timerDelegate, null, 0, 1000);
    }
 
    public void Enqueue(Message message)
    {
        lock (_messageQueue)
        {
            _messageQueue.Enqueue(message);
        }
    }
 
    public void Dequeue(Object state)
    {
        lock (_messageQueue)
        {
            if (_messageQueue.Count == 0)
            {
                return;
            }
 
            var message = _messageQueue.Peek();
 
            try
            {
                _deviceClient.SendEventAsync(message).Wait();
 
                var messageJustSentIsNowDequeued = _messageQueue.Dequeue();
            }
            catch
            {
                // Ignore any type exception; TODO Should make better decision; Should notify something.
 
                _retryCount++;
            }
        }
    }
 
    public int Count()
    {
        return _messageQueue.Count;
    }
}
 
internal class Telemetry
{
    public int counter { get; set; }
 
    public DateTime timeStamp { get; set; }
}

So we have this AzureIoTHub class which is provided a message (an instance of the Telemetry class).

It does not send the message directly to the IoTHUb. No, it hands over the message to an instance of the MessageQueue class!

And this MessageQueue class puts the message in an actual queue.

Meanwhile, a timer is running every second and tries to send a message. It first ‘peeks’ for the earliest message, it tries to send it, and if it succeeds, the message is removed from the queue. If it fails, the message stays in the queue.

Note: this is a possible implementation. It is for demo purposes only. It lacks eg. client-side persistence and proper exception handling. Feel free to extend from here.

Add this class to your UWP client app and see how messages are queued once the

Note: I had to update my NuGet packages, earlier versions of the Device Client are not resilient to a failover!

Testing the complete solution

So I put my code in a Device Client UWP app and I gave it a device connection string.

public sealed partial class MainPage : Page
{
    private AzureIoTHub _azureIoTHub;
    private int _counter = 0;
    private DispatcherTimer _timer;
 
    public MainPage()
    {
        this.InitializeComponent();
 
        _azureIoTHub = new AzureIoTHub();
 
        _timer = new DispatcherTimer
        {
            Interval = new TimeSpan(0, 0, 30)
        };
        _timer.Tick += _timer_Tick;
    }
 
    private void _timer_Tick(object sender, object e)
    {
        btnSend_Click(null, null);
    }
 
    private void btnSend_Click(object sender, RoutedEventArgs e)
    {
        var count = _azureIoTHub.MessageQueue.Count();
 
        _counter++;
 
        var telemetry = new Telemetry { counter = _counter, timeStamp = DateTime.UtcNow };
 
        _azureIoTHub.SendDeviceToCloudMessage(telemetry);
 
        tbTimer.Text = $"Message {_counter} sent at {DateTime.Now} ({count} messages queued)";
    }
 
    private void btnStart_Click(object sender, RoutedEventArgs e)
    {
        tbTimer.Text = "Timer started";
        _timer.Start();
    }
 
    private void btnStop_Click(object sender, RoutedEventArgs e)
    {
        _timer.Stop();
        tbTimer.Text = "Timer stoppped";
    }
}

I have three buttons in my app. One is to send a simple message. The other two are for starting and stopping a timer.

Note: the interval is quite long (30 seconds). This is to prevent throttling in the Azure Function when messages arrive.

I have sent some messages using the Send button:

If I press this button fast enough, the number of messages in the queue grows. But the messages are all sent to the SQL Server.

I then started the timer which sends messages an I performed a failover:

I have to confirm this failover by proving the name of the IoT Hub:

Note: I can only do a failover like twice a day. Microsoft is letting us test this feature more often. I just created a new IoT Hub and quickly created a device and route on it. This took me less time than waiting for the next failover opportunity.

So the failover starts:

And the messages are queued:

Then I was notified the failover succeeded:

The locations are now switched:

Note: The failover duration is something like 5 minutes. But in the client can take up to 10 minutes before messages are accepted again.

On the client device, the queue was emptied:

All messages were passed on to the Azure Function:

And the SQL Azure query told us all messages were persisted:

Note: Message number 39 just arrived while switching screens.

Conclusion

We have committed an actual failover! Feels Great 🙂

And with this queueing solution, we have not lost any message, that feels nice too.

But there are a number of things you have to keep track on:

  • Use the right region, one which supports failover (will be fixed later on)
  • Use an EventHub for passing data on to an Azure Function. Do not use ‘internal’ endpoints which are not supported by the failover
  • Use recent Device Client NuGet packages which support a failover (I used Microsoft.Azure.Device.Client version 1.17.1)
  • If you use the Visual Studio IoT Hub Connected Service, please update all NuGet packages afterward. Check out the minimal target version of your project if you get any exception during updating
  • Write your own retry policy so all messages are queued on the client during the failover

Also, check out a solution for throttling within Azure Function. In this demo, I trusted on the ‘slow’ queue so messages could be sent using the ‘External Table’ output. You could also decide to batch all messages once the failover is completed and then use a Stored Procedure.

We are now able to test an actual failover, in case Microsoft decides to do it for us.

Advertenties