Role Instance Restarts Due to OS Upgrades

[アーティクル]
09/19/2012

Update March 7, 2013

Added to the Q&A section --- Q: How long will the upgrade take? How long will my VM be down?

Update October 17, 2014

Added information about Guest Agent updates. Thanks to my colleague Anurag Sharma for this idea.

------------

Roughly once per month Microsoft releases a new Guest OS version for Windows Azure PaaS VMs. The exact schedule varies and the historic trend can be seen at https://msdn.microsoft.com/en-us/library/windowsazure/ee924680.aspx. During this rollout the Window Azure Fabric Controller will do two passes through all of the datacenters. There is also a periodic update of the Azure guest agent that runs inside of your VM.

Host OS. The first pass will upgrade the Host OS. The host OS reboots instances and the fabric controller ensures that only instances from one upgrade domain at a time will be rebooted. During this reboot, your role instances will go through the standard shutdown process and the RoleEnvironment.OnStop event will be raised to give you a chance to gracefully shut down the instance. The Host OS update can take several days for the fabric to coordinate the upgrades across all of the different hosted services and upgrade domains within a datacenter. It is not uncommon for different instances of your deployment to be updated several hours apart from each other.
Guest OS. Once the Host OS has finished upgrading across the datacenter then the Guest OS will be upgraded for services which are configured to use automatic Guest OS versions and this upgrade will proceed using standard upgrade domain rules for your service. Your VM will be rebooted and the Windows Partition (the D drive) will be reimaged with the upgraded OS. The Guest OS update process is much faster than the Host OS update since the fabric only has to coordinate the update within your hosted service and your upgrade domains. The duration of the Guest OS update process for your service will largely depend on how many instances you have, how many upgrade domains you have, and how long your service takes to shut down (Stopping/OnStop events) and start up (startup tasks and OnStart event).
Guest Agent. The Azure guest agent is updated on a roughly monthly basis. When the guest agent is updated the host process running your role (typically WaWorkerHost or WaWebHost) will be gracefully shutdown, then the guest agent will update itself, then the host process will start again. See https://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx for more information about the guest agent process and how it interacts with your service.

Mark Russinovich has a great blog post which describes the Host OS upgrade process - https://blogs.technet.com/b/markrussinovich/archive/2012/08/22/3515679.aspx.

Note that this article is focused on PaaS scenarios, but the Host OS update process applies to IaaS Persistent VMs as well. For more information about IaaS VM restarts see https://blogs.msdn.com/b/windows_azure_technical_support_wats_team/archive/2013/11/27/windows-azure-iaas-host-os-update-demystified.aspx.

Impact to Your Service

As long as each of your roles has 2 or more instances then your service will not experience downtime due to the adherence to upgrade domains. The blog at https://blog.toddysm.com/2010/04/upgrade-domains-and-fault-domains-in-windows-azure.html has a great explanation of upgrade and fault domains, and why having 2 instances of a role is required to meet the 99.95% uptime SLA.
Approximately every month, expect your instances to reboot once for the Host OS update. If you have automatic guest OS updates, expect your instances to reboot again. These reboots are typically several hours apart, but this time frame can change depending on the makeup of different services within a datacenter.
Your role needs to adhere to the rules around host OS updates, in particular instances should reach the Ready state within 30 minutes of starting the Startup tasks. For more information about this limitation see /en-us/azure/cloud-services/cloud-services-update-azure-service#how-an-upgrade-proceeds.
Your role instances should be able to handle a Reboot, a Reimage, and a Recycle. The Host OS upgrade will cause a Reboot of your instance, and the Guest OS upgrade will cause the equivalent of a Reimage of your instance. See the common issues below for more information.

Common Issues

See https://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx for more information about the processes which are running and the location of log files which can be used to troubleshoot.

The most common problem is roles not reaching the Ready state after the OS upgrades. The most common root cause for this problem is a startup task or code in the OnStart or Run function not running correctly. There are 2 common categories of this root cause:
1. A failure of the code to run twice due to the Host OS reboot which will cause your startup tasks to run again. If you are doing something in a Startup task and executing a command which returns an error when run twice (ie. ‘appcmd set config’ to add a section will fail when run twice with the error “New add object missing required attributes. Cannot add duplicate collection entry of type…”) then your startup task will fail and cause your role instance to begin recycling. To troubleshoot this type of failure, RDP into the VM and look in the Event Logs for errors, and look in the WaHostBootstrapper.log for Startup task failures. During your normal development and testing process you should proactively initiate a Reboot of your role instances from the Windows Azure Management portal in order to test your service and make sure that it works correctly in this scenario. A common fix for startup task failures is to add an 'exit /b 0' to the end of your startup task. See https://msdn.microsoft.com/en-us/library/windowsazure/hh124132.aspx for more information on why this is needed.
2. A failure of the code to run after the Windows partition is reimaged. During the Guest OS portion of the update, the Windows Partition is reimaged. The Windows Partition is typically where program installations and registry changes are stored, and during the reimage those changes will be lost. If the startup code assumes that the change exists (ie. if the startup task makes a registry change and then stores a record of that change on the C: or E: drive so that the code isn’t run twice) then the role instance may fail to work properly. During your normal development and testing process you should proactively initiate a Reimage of your role instances from the Windows Azure Management portal in order to test your service and make sure that it works correctly in this scenario.
If your startup code takes longer than 30 minutes to complete then you may have multiple role instances taken out of service at the same time. This is most common when a startup task installs a program or feature, downloads cache data, or downloads website information. See the Host OS update rules in the ‘Impact to Your Service’ section above for more information about this.
Occasionally the Windows Azure Platform will fail to restart the host or guest OS after an update. Overall this is a rare scenario and the platform is constantly improving to eliminate these types of failures. If you are in this scenario then your symptoms will typically be a ‘Waiting for Host’ message in the portal that does not change after at least 30 minutes, and the inability to RDP into the role instance. In this scenario there is little you can do short of deleting the deployment to recover this instance. If you open a support incident (https://www.windowsazure.com/en-us/support/contact/) the support team can manually recover that instance. Note: If you are able to RDP into the role instance then the problem is almost always due to a failure in the startup code as described in common issue #1 above.
During the OS upgrades one or more of your instances will be unavailable at any given time which will cause reduced capacity for your service. For example, you have 2 instances of a web role and both instances typically run at 75% CPU. During the OS upgrade one instance will be rebooted during the upgrade which means all traffic will be directed to the remaining instance which will exceed the capacity for that instance and your service availability will be impacted. You should ensure that your service has sufficient excess capacity to absorb X% of the instances being unavailable, where X is 1/<number of upgrade domains> (ie. for 2 upgrade domains you will lose 50% of your capacity, and for 5 upgrade domains you will lose 20% of your capacity).
If your website takes several minutes to warmup (either standard IIS/ASP.NET warmup of precompilation and module loading, or warming up a cache or other app specific tasks) then your clients may experience an outage or random timeouts. After a role instance restarts and your OnStart code completes then your role instance will be put back in the load balancer rotation and will begin receiving incoming requests. If your website is still warming up then all of those incoming requests will queue up and time out. If you only have 2 instances of your web role then IN_0, which is still warming up, will be taking 100% of the incoming requests while IN_1 is being restarted for the Guest OS update. This can lead to a complete outage of your service until your website is finished warming up on both instances. It is recommended to keep your instance in OnStart, which will keep it in the Busy state where it won't receive incoming requests from the load balancer, until your warmup is complete. You can use the following code to accomplish this:

  public class WebRole : RoleEntryPoint {  
   public override bool OnStart () {  
     // For information on handling configuration changes  
     // see the MSDN topic at https://go.microsoft.com/fwlink/?LinkId=166357.  
     IPHostEntry ipEntry = Dns.GetHostEntry (Dns.GetHostName ());  
     string ip = null;  
     foreach (IPAddress ipaddress in ipEntry.AddressList) {  
       if (ipaddress.AddressFamily.ToString () == "InterNetwork") {  
         ip = ipaddress.ToString ();  
       }  
     }  
     string urlToPing = "https://" + ip;  
     HttpWebRequest req = HttpWebRequest.Create (urlToPing) as HttpWebRequest;  
     WebResponse resp = req.GetResponse ();  
     return base.OnStart ();  
   }  
 }

Detection and Notification

Notification

At this time the Windows Azure platform does not offer proactive notifications when an OS upgrade is happening. The Windows Azure development team is working on this functionality so that service administrators can better plan for upgrades and possible service impact. Your role instances will receive a RoleEnvironment.Stopping event prior to being shut down and you can use that event to gracefully terminate any work that the role instance is doing or notify an administrator that an instance is shutting down.

In the meantime you can subscribe to the Windows Azure OS Updates RSS feed at https://sxp.microsoft.com/feeds/3.0/msdntn/WindowsAzureOSUpdates. This feed should be updated the same day that the OS updates start being rolled out to the datacenter. This typically does not give advanced proactive notification, but it does help identify when the updates are happening. As noted above in the Host OS and Guest OS description the update process can take several days to complete, so it may be one or more days between when the RSS feed is updated and your hosted service begins updating.

The Guest OS list at https://msdn.microsoft.com/en-us/library/windowsazure/ee924680.aspx and the OS version selection dropdown in the management portal are typically updated after the Guest OS rollout has completed so you should not use the latest entry in these lists as an indication of when the OS updates are in progress.

Detection

At this time there is no direct way to detect a Host OS upgrade, but you can see the evidence of the reboot within the logs on the VM:

- Search System event logs for event source USER32, event ID 1074, with message “The process D:\Packages\GuestAgent\WaAppAgent.exe (RD00155D50206D) has initiated the shutdown of computer RD00155D50206D on behalf of user NT AUTHORITY\SYSTEM for the following reason: Legacy API shutdown”. This indicates that the Windows Azure fabric’s guest agent (WaAppAgent.exe) initiated a shutdown of the VM.
- Look in the AppAgentRuntime.log.old files for a message saying “Context_Start” with a Context=”StopContainer()”

Frequently Asked Questions

Q: How can I opt out of the OS updates?

A: You cannot opt out of the Host OS updates because Microsoft must maintain updated and patched host OSes within the datacenter. You can opt out of the Guest OS update by specifying a version of the Guest OS, but note that your service will no longer receive security patches and may be vulnerable. See /en-us/azure/cloud-services/cloud-services-how-to-configure-portal#manage-guest-os-version.

Q: How do I force the reboots to be done only during non-business hours?

A: There is no way to control when an individual instance or service will be upgraded for the Host OS. The upgrade is started on all Azure datacenters across the world at approximately the same time, and the fabric works continuously on upgrading each datacenter. This process takes several days due to the complexity of making sure upgrade domain rules are followed for all cloud services, and there is no way to control or determine when a specific instance will be impacted. To control the Guest OS update you can specify a fixed Guest OS version and then update it whenever you are ready.

Q: I installed something on the VM and now the VM has rebooted and the software I installed is gone, why?

A: Connecting to an Azure PaaS VM via RDP and making changes or installing software is unsupported. At any point in time the VM may be completely rebuilt and any changes you make will be lost. This can happen if the hardware fails and we have to startup a new VM on new hardware. This will also happen during the Guest OS update when the Windows Partition is rebuilt. If you need to install software or make changes to the VM you must create a startup task and do the work from there. This ensures that when the VM is recreated that your configuration will be executed again.

Q: Can one of the updates in the new Guest OS version break my service?

A: The updates that are installed onto the new guest OS version are publicly available and thoroughly tested hotfixes which are also being deployed to servers around the world via Windows Update and the chance negative impact to your service is extremely small. However, the root of the question goes back to how you manage OS patches in your on-premise services - do you install directly on the production servers and assume it will work, or do you have a staging environment where you test the patches first? You will follow the same pattern in Azure. If you want to have a staging environment to test patches prior to production then you should configure your production service to use a fixed version OS string in the .cscfg file. Then when a new guest OS is available you can deploy your service into the staging slot using the newest guest OS version. After you have validated that the service works correctly on the latest guest OS you can then either do a VIP swap, or do an in-place upgrade of your production service to use the latest OS.

Q: How long will the upgrade take? How long will my VM be down?

A: There is a common misconception that the more patches being applied, the longer the update will take. This is based on the belief that the upgrade works similar to how a Windows Update upgrade happens on your local desktop machine where a bunch of patches are copied to Windows and installed with subsequent reboots, but this is not how upgrading works in Azure. When a new OS version is being released in Azure, the OS team will take the latest image, apply the patches, and then save a new VHD with this new base image. This base image is then copied to a repository in Azure. When the fabric is instructed to do an OS upgrade it will first make a copy pass where it copies this new base image VHD to the hard disks on each server in the datacenter that is going to be upgraded. Once this copy process is finished the fabric will begin the upgrade process, following the normal upgrade domain rules. When a guest is going to be updated the fabric will do a graceful shutdown of the OS and then start a new VM using the new base image. The time it takes to upgrade a given VM for a Guest OS is roughly the time it takes to do a graceful Windows shutdown + the time it takes to start Windows. The timing for a Host OS update is a little different. When a Host is being upgraded it first sends the shutdown message to each Guest OS running on that Host. Each Guest OS is then given the standard OnStop and Windows Shutdown time to finish shutting down. Once every Guest OS is shut down, then the Host OS does a graceful shutdown and goes through it's normal shutdown procedure. Once the Host OS is shutdown then the Host is rebooted using the new OS image. Once the Host is up and running then it will start each of the Guest OSes. Typically this Host OS update process will take 15 to 20 minutes, but it can vary depending on how many other Guests are on that Host and how long they take. Having said that, there will always be exceptions if there is a failure on a particular node and the Azure fabric determines that the Guests on that node need to be moved to a different node.

Q: How do I gracefully handle the OS shutdown?

A: When the OS is being updated the Azure Fabric will perform a graceful shutdown of your role instance. This means that your ASP.NET code will receive the Application_End event, and the Azure service runtime will raise the Stopping and OnStop events. Your code will have 5 minutes to finish cleanup work in OnStop before the process is shut down. After your Azure host process is shut down then Windows will go through a normal graceful shutdown including raising the standard OnStop and related events for Windows Services. For more information about gracefully handling a shut down of your instance see https://azure.microsoft.com/en-us/blog/the-right-way-to-handle-azure-onstop-events/, https://msdn.microsoft.com/en-us/library/hh180152.aspx and https://msdn.microsoft.com/en-us/library/windowsazure/microsoft.windowsazure.serviceruntime.roleentrypoint.onstop.aspx.

Comments

Anonymous
September 20, 2012
This explains a lot of my recent headaches with our Azure web roles in an "infinite initialization" state. Thanks for the information - it is very helpful!
Anonymous
September 20, 2012
Thanks for detailed information :)
Anonymous
September 23, 2012
We have had this issue where roles fail after the OS updates. Reimaging always fixes it. Thanks, Matt Watson <a href="http://www.stackify.com">http://www.stackify.com</a>
Anonymous
September 23, 2012
Matt, I would encourage you to open a support incident at www.windowsazure.com/.../contact next time this happens and the team can help you investigate why your role fails to start. The root cause is typically pretty easy to find and the fix is usually easy to implement, and this will make your service much more robust. Kevin
Anonymous
October 10, 2012
Do you have any idea why two reboots are necessary? Once the host has been rebooted why not immediately reimage all the VMs inside it and let them start? What's the need for the second reboot?
Anonymous
March 07, 2013
Dmitry, the 2 upgrade pass has been around since Azure started and I am not positive of the reasoning behind this design decision. My best guess is to try to isolate the Host OS upgrade in order to make it faster and get through the datacenter as quickly as possible. During the host OS upgrade of any specific server the fabric waits for a maximum of 15 minutes for each guest on that host to report Ready before it is able to move to the next upgrade domain for that service. During a Host OS update the Windows partition on the guest OS is preserved which can shorten the startup time for the hosted service running in that guest OS. During a Guest OS update the Windows partition is wiped out which means startup tasks that do installations will have to run again which will increase the amount of time it takes to get to the Ready state. See blogs.msdn.com/.../windows-azure-disk-partition-preservation.aspx for more info on the disk preservation scenarios.
Anonymous
August 13, 2013
The comment has been removed
Anonymous
November 18, 2013
There is an intermittent issue with the certificate path for our SSL web service that occurs at certain times, I am assuming, either when our cloud service on Azure reboots or is moved. This occurred on November 18, 2013, and previously on or about September 27, 2013. Using SSL Checker at www.sslshopper.com/ssl-checker.html it reports that the certificate is not trusted in all web browsers. When I add our domain to IIS site binding, the issue is resolved. Sometime after the September occurrence, I later removed the site binding setting (as it is a real issue that prevents using staging for testing) and we had no issues until last night, Nov 18. Again I had to add the site binding to resolve the issue (at 08:45 UTC). I have now removed the site binding setting at 13:30 UTC and the issue remains resolved. The real problem is that before I changed the site binding setting, requests to our web service could not be made. Salesforce.com only allows Apex callouts for GET and POST requests to SSL web services only for certain specified root certificates and only when the certificate path can be determined correctly by Salesforce. Callouts will result in a PKIX path building failed error when the path can not be determined. After adding the domain to IIS site binding, Salesforce has no problem. This appears then to be a Windows Azure issue where the certificate paths are not re-established promptly when certain changes are made to the server instance. It seems that having multiple role instances would not avoid this issue as our web service works, using soapUI for testing, but the certificate path for Salesforce is still not correct.
Anonymous
June 01, 2017
It's an old post so, so helpful. do you have a current link for this section: Your role needs to adhere to the rules around host OS updates, in particular instances should reach the Ready state within 30 minutes of starting the Startup tasks. For more information about this limitation see http://msdn.microsoft.com/en-us/library/hh543978.the link here seems no to take me to anything about 30 minutes.
Anonymous
December 05, 2017
Hi Don. Updated the 30 minute timeout link to https://docs.microsoft.com/en-us/azure/cloud-services/cloud-services-update-azure-service#how-an-upgrade-proceeds, thanks for the feedback.

次の方法で共有