How is headnode failure handled in hdinsight ?

31610895 20 Reputation points
2023-08-11T06:20:42.81+00:00

Hi,

As HDInsight have fixed number of head nodes i.e 2. I'm curious how head node failures are handled in HDinsight ? Does failed headnode gets removed and a new headnode gets added in cluster ? failed headnode remains in cluster ?

Thanks,

Akshit Mehta

Azure HDInsight
Azure HDInsight
An Azure managed cluster service for open-source analytics.
204 questions
{count} votes

1 answer

Sort by: Most helpful
  1. QuantumCache 20,186 Reputation points
    2023-08-11T16:15:13.6733333+00:00

    Hello @31610895, Welcome to QnA forum.

    HDInsight clusters have two headnodes in active and standby modes, respectively.

    high availability infrastructure

    In the event of a headnode failure, the standby headnode takes over as the active headnode**.** This is achieved through the use of Apache ZooKeeper(I have provided a link to read more in the below section), which is a coordination service for distributed applications.

    ZooKeeper conducts active headnode election and provisions a few background Java processes, which coordinate the failover procedure for HDInsight HA services. These services are the master failover controller, the slave failover controller, the master-ha-service, and the slave-ha-service**.**

    When the standby headnode takes over as the active headnode, it starts all HDInsight HA services on it and stops these services on the other headnode**.** The failed headnode remains in the cluster, but it is no longer used as an active headnode. Instead, it becomes the standby headnode.

    It's important to note that HDInsight HA services should only run on the active headnode, and will be automatically restarted when necessary. Since individual HA services don't have their own health monitor, failover can't be triggered at the level of the individual service. Failover is ensured at the node level and not at the service level.

    More reading:

    High availability infrastructure

    Architecture

    Each HDInsight cluster has two headnodes in active and standby modes, respectively. The HDInsight HA services run on headnodes only. These services should always be running on the active headnode, and stopped and put in maintenance mode on the standby headnode.

    To maintain the correct states of HA services and provide a fast failover, HDInsight utilizes Apache ZooKeeper, which is a coordination service for distributed applications, to conduct active headnode election. HDInsight also provisions a few background Java processes, which coordinate the failover procedure for HDInsight HA services. These services are the following: the master failover controller, the slave failover controller, the master-ha-service, and the slave-ha-service.

    If the response is helpful, please click "Accept Answer" and Click 'Yes'. So that we can close this thread.

    2 people found this answer helpful.