ARR Health Check sends too many requests too soon defying the configured interval
This is an interesting one, so wanted to blog on this. Had an Enterprise customer with an ARR that was keeping Web Servers very busy playing the Health Check ping-pongs by the seconds, even though ARR Health-Check interval was configured for 30 second. As a result, Web Servers were overwhelmed at times, and users were experiencing slowness.
If you are new to ARR Health Check, you can assign a Test page to configure health settings in Application Request Routing (ARR) and set the properties for URL testing or live traffic testing. You can use the URL test to detect if a server has become unhealthy or healthy. The Live Traffic test leverages the live requests. Based on configurable conditions, you can use the Live Traffic test to mark a server as unhealthy. However, you cannot use this test to determine if an unhealthy server has become healthy because ARR does not forward live requests to servers that are currently unhealthy. Events are raised when either of the health tests detects changes in server health. Properly planning health checks is important with any load balancer. Health checks are used to check the state of your servers so that if a server fails, it is automatically taken out of rotation, and then added back again when it has recovered.
This particular customer had a server farm with two servers and health test was configured to make a request every 30 seconds. But when we pulled the request log from one of the Servers, we found a bizarre scenario: Every 30 seconds, there was a burst of 15 health-check pings to the Server, all at once.
2017-04-25 11:20:26 www.myLoadBalancedSite.net 200 0 0 270 90 0
.. …. …. ….
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0 ------------> 30 sec later from previous line
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:56 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:55 www.myLoadBalancedSite.net 200 0 0 270 90 0
2017-04-25 11:20:56 www.myLoadBalancedSite.net 200 0 0 270 90 0
… ….. …. …..
2017-04-25 11:21:25 www.myLoadBalancedSite.net 200 0 0 270 90 0 ------------->30 sec later from previous line
Why? Because ARR Health Check uses all w3wp processes on the ARR server to PING the Health Check URL. The more w3wp processes on ARR server, the more frequent PING it will be from HealthCheck. In this case customer had 15 w3wp.exe processes on ARR, thus ARR PING will be 15 times within 30 seconds.
Two possible solutions here:
1. Reduce w3wp numbers that ARR will monitor. Basically to scale back on the number of app pools or worker processes (this is what my customer used)
2. Increase the PING interval and schedule to start w3wp.exe processes individually with some time apart (Theoretically sounds good, but haven't tested). For example, if you configure app pools to start with 30 seconds apart and the PING interval 7.5 minutes, probably we get some ideal PING frequency.
App pool 1 start on 00:00:00, it will send the ping at 00:00:00, 00:07:30, 00:15:00,…
App Pool 2 start on 00:00:30, it will send the PING at 00:00:30, 00:08:00, 00:15:30,…
App pool 3 start on 00:01:00, it will send the PING at 00:01:00, 00:08:30, 00:16:00,…
…
App pool 15 start on 00:7:30, it will send PING at 00:07:00, 00:14:30, 00:22:00,…