Bing Outage exposes Testing in Production (TiP) Hazards
I have been a big proponent of shipping services into production more frequently and conducting more testing on the code once it is in production. Those of us that support this methodology tend to call it TiP (Testing in Production). Find links to previous blog posts on this subject below.
After the recent Bing outage (evening of 12/3/2009), I find myself thinking about the Hazards of TiP and thought I might make a post about some lesson's I have drawn from this production outage and what has been written about it so far. ZD Net posted a bit of a sarcastic blog with the title "Microsoft is making progress on search: You noticed Bing's glitch." According to the official blog post by the Bing team (here) the outage was, “The cause of the outage was a configuration change during some internal testing that had unfortunate and unintended consequences.”
Internal testing that had unfortunate
and unintended consequences
Despite this black mark, I still believe TiP is the right direction to go for services testing but clearly there are some hazards and lessons we can extrapolate.
These two posts imply that the outage was wide spread, noticed by a lot of individuals, and caused by an errant configuration change in support of a test. My assessment is that while there was clearly an attempt to run a test configuration in production, the test did not cause the outage. The challenge came where the test configuration change somehow went to all of production.
The core concept of TiP is to minimize risk through TiP-ing. In order to accept the risk of less stable code into production in order to run tests, the less stable code must be easily sandboxed. Whatever happened here was likely a configuration management mistake not a testing error.
I have an axiom for my operations team, and that is that all manual processes will eventually fail, unless they don’t.
I like Bing, but half-an-hour downtime is unacceptable these days. Do you guys not have failover systems?
Comment on Bing Blog by b.dylan.walker
The reality is that the Bing system is very automated. The team has shared some information about their infrastructure so I won’t go into details here less I share something not disclosed. In outage like this from a test configuration change impacting production is clearly a case of fast moving automation.
In order to enable TiP and to take more risk into production, the change management system of a service must be rock solid and fully automated. Clearly though from what has been shared they have a state of the art system. In fact it is likely this state of the art system that allowed the errant change to propagate so quickly and require a full roll back.
Therefore the gap must be in the safety mechanisms to prevent such a mistake in combination with how fast the mistake rolled out to all environments. Another factor in successful TiP is metering of change in production. This change just moved too fast and while the bing system is highly automated it still takes a long time to undo a change across so many servers.
My takeaway from this outage is to remember that TiP does work but you need solid change management.
1. A fully automated deployment system
2. Rock solid controls on change management approval
3. Every change must be a metered change so when a mistake does happen it doesn’t affect every server in production.
Those are the lesson’s I’ve drawn so far. What do you think?
Other Blogs on TiP
· TIP-ING SERVICES TESTING BLOG #1: THE EXECUTIVE SUMMARY
· Ship your service into production and then start testing!
Images from a blog post on WhatWillWeUse.com - https://whatwillweuse.com/2009/12/03/hold-on-40-minutes-while-i-bing-that/
Comments
- Anonymous
December 05, 2009
Your points 1,2 amd 3 are spot on Ken. And 3 can also be restated from "doesn’t affect every server in production" to "doesn’t affect every USER in production". Same thing but the latter message keeps your eyes on the prize. I would add a 4th point.
- Do your TiP (at least your higher risk TiP) during off-peak hours.
- Anonymous
December 07, 2009
Seth, we exchanged emails on this internally and at first I agreed with your point, and actually still do but I want to make a counter point that is more about the right mind set to deployment and change management. when a team approaches deployments and any sort of change as an Off Peak activity they tend to get into a mindset that allows them to cut corners in this area because risk is "lower." As you know I run an operations team along with dev test and pm. On the Ops side I have several teams that insist on deploying over the weekend because that is off peak and lower risk. The problem is they have never automated their deployments. Manual steps will always eventually fail. One more point i'd like to make on Off Peak is that mindset also assumes you won't be wildly successful with 24x7 demand. If a service is successful it will have world wide demand all the time. So, yes, if there is off peak and the change is risky it is ok to consider this option but it is not the point to start from.