Firefighting with Broadswords

[アーティクル]
04/09/2010

Broadsword Techniques

Every good problem solver needs a good Broadsword technique in their arsenal. A Broadsword fix is method that eliminates several potential causes in one fail swoop.

A good example of a Broadsword in the Exchange realm is what we dubbed “The Ninja Trick.” The Exchange 2000/2003 metabase file is easily corrupted by disk issues or anti-virus. Once that corruption occurs, the fix is to fully uninstall IIS (removing the metabase) and then reinstall IIS and Exchange over the top. Depending on how available the customer’s media was, this process generally took under 90 minutes.

What we discovered is that the Ninja Trick solved all sorts of issues that weren’t necessarily metabase related. That is because the entire process not only reset the corrupted file, but it also reset the core Exchange files, removed any hooks into Transport and reapplied service packs. This essentially made the technique a catch all for a large number of common issues that customers were seeing.

Other basic examples of Broadswords:

-FFR (FDisk, Format and Reinstall)

-Migrating Users/Data off a problem server

-Re-installing (Starting Over)

-Replacing Hardware

Time vs Results

In today’s IT culture time is at a premium. SLA’s continue to demand greater uptime as High Availability solutions aim to make disaster recovery seamless. However, as these environments become more complicated with more moving parts, they often become harder to fix when things go wrong.

Broadsword techniques are valuable in some situations but they often cause unexpected problems as well. The Ninja Trick solved many tough cases, but any customizations that were made to the server were removed when it was used. Anti-virus, webpages, disclaimer software and various other things needed to be reconfigured to get the server back into full production.

So the question becomes: “When is it appropriate to use a Broadsword?”

The answer is: When the problem will take longer to troubleshoot and fix the issue then to use the sword and clean up the collateral damage.

A friend of mine recently ran into a major problem with a High Availability solution he had designed for his environment. The requirements of his solution were:

1. Migrate his 160 users from Exchange 2003 to Exchange 2007

2. Provide uninterrupted email service in the event of a server failure

3. Provide High Availabilty to a critical folder which shared financial data

4. No single point of failure

5. $40,000 budget including software and user CALs.

Our solution was:

1. 2 new HP DL180’s with 500gb of storage in a RAID 10 and 8gb of RAM

2. 2 copies of Windows 2008 STD and 2 copies of Exchange 2007 STD

3. CA XOSoft HA and ArcServe backup

4. 1 tape backup

5. Final price was just over $40,000

In this solution, XOSoft would be responsible for replicating the Exchange data and file share between the nodes. In the event of a server failure the XOsoft software would fail the solution to the second machine which contained all of the current data. Arcserve would provide backup. This solution filled all the requirements for the project.

However, after everything was built and testing began he found 2 major issues. Failover between the nodes took over 30 minutes and one of the nodes sporadically rebooted itself without warning.

Thus began the 3 weeks of constant troubleshooting that completely engrossed my friend and kept him from his regular duties for his company. For the most part, he did everything right during his troubleshooting. He tried to make the server generate warnings and errors. He contacted Microsoft and Computer Associates to open trouble tickets. He researched other known issues and dug through the servers with a fine tooth comb.

About 3 days after the problems started he called me to get my take on the situation. Since XOSoft had only recently released a version that was supported on Windows 2008 I recommended a Broadsword of rebuilding the whole solution on Windows 2003. Unfortunately, he was dead set on his original solution and continued to fight it for another 2 weeks.

I went by to help him and found that the failover problem was being caused by TCP/IP settings on the Domain Controller and that Arcserve was crashing the passive node due to an incompatibility with Windows 2008.

This case study is a perfect example of the Time vs. Results problem that many engineers and IT professionals are subject to.

My friend dedicated countless hours of troubleshooting and frustration to this problem. He was constantly harassed by his management and fell way behind on his duties because his ego wouldn’t allow him to start over with a better tested OS version.

People who are good troubleshooters often find it difficult to STOP working a problem until they find the solution. They take it personally and refuse to accept that the root cause may be out of their control.

In most cases, this is good thing! It is that unwillingness to admit defeat that drives them to learn new things and find unique solutions to otherwise unsolvable tasks. In fact, when hiring a technical person, that is the single most important trait that a company will look for.

Ego is what makes a good troubleshooter.

Being able to let go of one’s ego is what makes a GREAT troubleshooter.

Take this following scenario:

There are 4 Edge servers in a DMZ. It takes one of these servers 4 times longer to deliver mail than the other 3. You log into the server and there is nothing obviously wrong. You turn up logging, enable performance monitoring and monitor the server for an hour. You find some events that may or may not be related and spend 2 hours researching them before realizing they are expected errors. Now, you start comparing this server to the others to see what is different and again find nothing obvious.

At this point you have probably spent over 3 hours troubleshooting an Edge server that can easily be rebuilt on the same (or new) hardware and re-integrated into your environment. It is time to stop troubleshooting this issue and use a broadsword.

- There is no guarantee that spending more time troubleshooting will provide any new information.

- The broadsword will probably take under 2 hours to complete.

- There is minimal collateral damage in this case.

- If the issue returns on the same hardware you have essentially ruled out software as the problem.

- If the issue returns on different hardware you have ruled out hardware as the problem.

So by using a broadsword technique of rebuilding the server, you have either fixed the issue or cut the number of potential causes in half. This is most likely far more than could have been accomplished by additional troubleshooting in the original state.

Downsides

There are other potential downsides to using a Broadsword. The main one being that it makes definitive Root Cause Analysis impossible or far more difficult. This is usually the case when multiple changes are made at the same time. In some situations it is more important to get a fix and get it fast. While in others the need to understand why the issue occurred in the first place takes precedence.

So whether or not RCA is required should always factor into the decision to make a sweeping change.

次の方法で共有

Firefighting with Broadswords

その他のリソース