Restarting Storm EventHub Topology on a new cluster
Azure EventHub is a popular highly scalable data streaming platform. More about Azure EventHub can be found here:
/en-us/azure/event-hubs/event-hubs-what-is-event-hubs
It is very common to use Storm in conjunction with EventHub as a platform for downstream event processing.
As part of this, an EventHub Spout is created which receives events from the EventHub. The current implementation of the spout can be found here:
https://github.com/apache/storm/tree/master/external/storm-eventhubs
A little bit of deepdive in the code reveals that it is advisable to have the same number of spout instances as the eventhub partitions. This is an excellent way to completely utilize the parallelism framework that Storm provides us with and consume Events from all the partitions at the same time.
Each event in a partition is associated with a offset. Once Storm consumes the Event from a particular partition, it stores the metadata information in a parent zookeeper node(zknode) called /eventhubspout.
The actual number of children zookeeper nodes will be equal to the number of partitions in the eventhub. This is because as events are consumed from each partition, the last consumable event offset is stored in the corresponding zookeeper node partition.
With this background in mind let us imagine an upgrade scenario. An Eventhub Storm topology is setup and messages are being consumed seamlessly and everything is going good. Now an upgrade requirement comes in where you have to move to a newer version of the cluster. However you do not want to lose the events that have been processed till date and want to continue from where you left off.
As you guessed it, the magic bookkeeping is with the zookeeper nodes. So if the zknodes are copied over from the previous cluster to the current one and you have the same storm topology name it is possible to start consuming events from where you left off.
There is a zkdatatool which does precisely this for you. The jars and scripts can be found here:
https://github.com/hdinsight/hdinsight-storm-examples/tree/master/tools/zkdatatool-1.0
First login to your zookeeper node (the IP address can be found from the hosts section in Ambari) and create a folder in your home directory. Copy the zkdatatool-1.0 in the folder. Make sure the jars are copied correctly in the lib folder. This step is common and needs to be executed on your current cluster(where the topology was running) and also in your new cluster where you are migrating to.
Now there are three parts that need to be done: In all the three steps replace the cluster version 2.5.1.0-56 with your appropriate cluster version
Part 1) Export: This step takes the zookeeper zknode node data and copies it to the local file system. You can validate this by navigating to the local file system and finding out the binary data is the same as the zknode. The zknode here by default is named as eventhubspout. This step needs to be executed in the current cluster where the storm topology has been run.
You should be able to see zkdata folder now created. This folder has all the specified bookkeeping of eventhub. Copy this zkdata folder into any of the zookeeper node in the new cluster.
Part 2) Delete: This step needs to be executed on the new cluster which you are migrating to. This step just ensures that if there is a zknode named eventhubspout, it is deleted
Part 3) Import: This step needs to be executed on the new cluster where you are migrating to. Copy the export data from step 1 to the local directory of your zookeeper node and run the import step. After successful completion of this step you should find zknode named as eventhubspout.
Verify using a zookeeper client if the zookeeper hierarchy is created and offset populated after this step.
This completes the migration of the bookkeeping information. Start your storm topology with the same name and you should be able to seamlessly consume events from where you left off!