Bug Psychology

Fixing bugs is hard.

For the purposes of this posting, I’m talking about those really “crisp” bugs -- those flaws which are entirely due to a failure on the developer’s part to correctly implement some mechanistic calculation or ensure some postcondition is met. I’m not talking about oops, we just found out that the product name sounds like a rude word in Urdu, or the specification wasn’t quite right so we changed it or the code wasn’t adequately robust in the face of a buggy caller. I mean those bugs where you were asked to compute some value and you just plain get the result wrong for some valid inputs.

Let me give you an example.

The first bug I ever fixed at Microsoft as a full-time employee was one of those. To understand the context of the bug, start by reading this post from the early days of FAIC, and then come back.

Welcome back, I hope you enjoyed that little trip down memory lane as much as I did.

Now that you understand how a VT_DATE is stored, that explains this bizarre behaviour in VBScript:

print DateDiff("h", #12/31/1899 18:00#, #12/30/1899 6:00#) / 24
print DateDiff("h", #12/31/1899 18:00#, #12/29/1899 6:00#) / 24

This prints –1.5 and –2.5, as you’d expect. There’s a day and a half between 6 AM December 30th and 6 PM December 31st, and two and a half days between the other two dates. This is perfectly understandable. But if you just subtract the dates:

print #12/31/1899 18:00# - #12/30/1899 6:00#
print #12/31/1899 18:00# - #12/29/1899 6:00#

You get 1.5 and 3, not 1.5 and 2.5. Because of the bizarre date format that VT_DATE chooses, when you convert dates to numbers, you cannot safely subtract them if they straddle the magic zero date. That’s why you need the helpful “DateDiff”, “DateAdd” and so on, methods.

The bug I was assigned was that testing had discovered a particular pair of dates which DateDiff was not subtracting correctly. I took a look at the source code for one of the helper methods that DateDiff used to do one of the computations it needed along the way. To my fresh-out-of-college eyes it looked something like this:

if (frob(x) > 0 && blarg(y)) return x – y;
else if (frob(x) < blarg(y) && blah_blah(x) > 0 || blah_de_blah_blah_blah(x,y)) return frob(x) – x + y + 1;
else if…

There were seven such cases.

My urge was to dive right in and add an eighth special case that fixed the bug. But my ability to get it right in the face of all this complexity concerned me. It seemed like this was an awfully complicated function already for what it was trying to do.

I researched the history of the code a bit and discovered that in fact variations on this bug had been entered… seven times. Each special case in the code corresponded to a particular bug that had been “fixed”, a term I use guardedly in this case. A great many of those “fixes” had actually introduced new bugs, regressing existing correct behaviour, which then in turn were “fixed” by adding special cases on top of the broken special cases that had been added to “fix” previous bugs.

I decided that this coding horror would end here. I deleted all the code (all seven lines of it! I was bold!) and started over.

Deep breath.

Spec the code requirements first. Then design the code to meet the spec. Then write the code to the design.

Spec:

  • Input: two doubles representing dates in VT_DATE format.
  • VT_DATE format: signed integer portion of double is number of days since 12/30/1899, unsigned fractional part is portion of day gone by. For example: –1.75 = 12/29/1899, 6 PM.
  • Output: double containing number of days, possibly fractional, between two dates.  Differences due to daylight savings time, and so on, to be ignored.

Design strategy:

  • Problem: Some doubles cannot simply be subtracted because negative dates are not absolute offsets from epoch time
  • Therefore, convert all dates to a more sensible date format which can be simply subtracted.

Code:

double DateDiffHelper(double vtdate1, double vtdate2)
{
  return SensibleDate(vtdate2) – SensibleDate(vtdate1);
}
double SensibleDate(double vtdate)
{
  // negative dates like –2.75 mean “go back two days, then forward .75 days”:
  // Transform that into –1.25, meaning “go back 1.25 days”.
  return DatePart(vtdate) + TimePart(vtdate);
}

I already had helper methods DatePart and TimePart, so I was done. The new code was shorter, far more readable, generated smaller, faster machine code and most important, was clearly correct. No special cases; no bugs.

It’s not that my coworkers were dummies. Far from it. These were smart people. But computer geek psychology is such that it is very easy to narrow-focus on the immediately wrong thing, and try to tweak it until it does the right thing.

When faced with these sorts of “crisp” bugs, I try to restrain myself from diving right in. Rather, I try to psychoanalyze the person – who is, of course, usually my past self – who caused the bug. I ask myself “how was the person who wrote the buggy code fooled into thinking it was correct?” Did they not have a clear specification of what the method was supposed to do? Was it misleading? Did they have a clear plan for how to proceed? If so, where did it go wrong?

If there never was either a spec or a plan, then for all you know the whole thing might only be working by sheer accident. There could be any number of design flaws in the thing that just haven’t come to light yet. Editing such a beast means adding unknown to unknown. which seldom leads to good results. Sometimes coming up with a new spec, a new plan and scrapping an existing bug farm is the best way to proceed.

For many years after that, I would ask how to implement DateDiffHelper as my technical question for fresh-out-of-college candidates that I was interviewing for the scripting dev team. I reasoned that if that was the sort of problem I was given on my first day in the office, then maybe that would be a reasonable question to ask a candidate.

When you ask the same question over and over again, you really get to see the massive difference in aptitude between candidates. I had some candidates who just picked up a marker, wrote a solution straight out on the board, wrote down the test cases they’d use to verify it, mentally ran a few of the tests in their head, and then we’d have another half hour to chat about the weather. And I had some candidates who tried earnestly to write the version using special cases, despite my specifically telling them “you might consider transforming this bad format into something more pleasant to work with”, after they got stuck on the third special case. I’d point out a bug and immediately they’d write down code for another special case, rather than stopping to think about the fact that they’d just written buggy code three times already and told me it was correct three times.

Comments

  • Anonymous
    June 01, 2009
    Looks like this has been on your mind a lot recently: it's basically a longer version of something I saw you write on StackOverflow recently: http://stackoverflow.com/questions/921180/c-round-up Indeed, that was the inspiration. -- Eric

  • Anonymous
    June 01, 2009
    Of course, you are still using cases. It's just that by focusing on one VT_DATE at a time to translate it into your sensible format, you only have to deal with 2 cases, rather than the many cases that come from the combined properties of 2 VT_DATEs, apparently so many they have not been successfully enumerated. But why were you using the horrible VT_DATE format in the first place? VBScript uses that format to be compatible with VB. VB uses that format to be compatible with OLE Automation. OLE Automation uses that format to be compatible with Excel. Excel uses that format to be compatible with Lotus 1-2-3. With hindsight, the right thing to do would have been to say "Lotus chose a lousy way to represent dates, let's write a translation tool that turns their goofy date format into something sensible and industry-standard". But we did not do so when we had the chance and we're stuck with it now. -- Eric (PS: As a historical note, VB and OLEAUT were owned by the same team, so it is perhaps not quite fair to say that VB does it that way "because of OLEAUT". They both evolved that behaviour at the same time.)

  • Anonymous
    June 01, 2009
    The comment has been removed

  • Anonymous
    June 01, 2009
    Why even have the unnecessary (vtdate >= 0.0) check when everything can fall into 1 case: return DatePart(vtdate) + TimePart(vtdate); Good point; my way is a potentially premature optimization. I'll fix it. Though in fact, given that probably 99.99% of dates we see are positive, and given how tight the rest of the code was already, it was actually probably a reasonable optimization to take the early out. I don't recall the details; this was 13 years ago after all. - Eric

  • Anonymous
    June 01, 2009
    The Lotus bit reminds me of this story: http://www.astrodigital.org/space/stshorse.html Finding a way to replace useful but flawed code seems to be one of Microsoft's greatest challenges.

  • Anonymous
    June 02, 2009
    So, do these candidates that do exactly the same as your smart coworkers fail immediately or do they have a chance? When I specifically tell the candidate "this problem is difficult to solve by cases, consider transforming the data into a better format for arithmetic", and they keep on trying to solve it by cases, no hire. When I write a bunch of interesting test cases on the board and they write me code that does not pass two of the four test cases, and they tell me it is correct, no hire. Heck, I've had candidates who were unable to write the code that computes the absolute value of the difference between two doubles. Those people will not be successful on a compiler team. No hire. -- Eric 

  • Anonymous
    June 02, 2009
    "we just found out that the product name sounds like a rude word in Urdu" with Urdu as my native language, I would love to hear this story. Regrettably, there's no story there. I just made that up as characteristic of the sorts of geopolitical issues we sometimes run into. Usually we don't have too much difficulty with geopolitical issues on the programming language tools teams, but anyone who, say, displays a map of the world as part of their user interface is required to ensure that the map is carefully designed to not implicitly "take sides" in geopolitical disputes. I'm sure you can think of examples. -- Eric

  • Anonymous
    June 02, 2009
    "...immediately they’d write down code for another special case, rather than stopping to think..." ...they are just trying to SERVE!!! How many times have we all been told that SOFTWARE DEVELOPMENT IS A SERVICE INDUSTRY? :-) Some people just confuse "service" with "subservience" or -- worse -- "servility". So, I've writtent some buggy code?! Sorry, Master! Right away, Master! Whatever you say, Master! Here is another "if-else," Master... A sad state of affairs, I'd say... :-( @configurator: I'd probably ask them to 1)relax, 2)shape up a bit, and 3)try again, calmly. Then I'd decide whether they deserve a chance or not...

  • Anonymous
    June 02, 2009
    I'm sure I remember Code Complete containing a warning against the root cause of all these VT_DATE issues - combining two pieces of data into one data item. The VT_DATE format would be better formulated as a tuple or struct containing (days_offset, offset_in_day), but instead those two parts have been shoehorned into a single double, leading to the mistaken assumptions about how arithmetic works with a VT_DATE because its representation is a familiar type.

  • Anonymous
    June 02, 2009
    Off topic, but any news of Wes Dyer? Very quiet on his blog... Wes drops by the C# language design committee meeting every now and then but I haven't heard much from him either lately. He seemed to be very well when last I saw him though. -- Eric

  • Anonymous
    June 03, 2009
    Great post. I always thought that a large IF is 90% a result of a bad design.

  • Anonymous
    June 03, 2009
    The comment has been removed

  • Anonymous
    June 03, 2009
    Of course the Chevy Chase approach to Jane Curtain during Weekend Update on SNL (in the 1970's) is an alternative approach to dealing with peoples mistakes [You should hear some of the names I have called myself].

  • Anonymous
    June 03, 2009
    if (frob(x) > 0 && blarg(y)) return x – y; else if (frob(x) < blarg(y) && blah_blah(x) > 0 || blah_de_blah_blah_blah(x,y)) return frob(x) – x + y + 1; else if… Ha ha! So true.

  • Anonymous
    June 03, 2009
    // negative dates like –2.75 mean “go back two days, then forward .75 days” Sweet. Did you ever consider making VT_DATE turing-complete? Lol. That almost made me spit-take my Diet Dr. Pepper. -- Eric  

  • Anonymous
    June 03, 2009
    There are bugs for which there are no fixes.  Accept it.  Software is just a flawed thought process made real.  But that doesnt mean all bugs can be fixed.  Because underneath the hood of that bug, there are 700 other bugs. Just try and stop whining about other coders.  Fact is software can never ever be bug free. Ever.   Ever.

  • Anonymous
    June 03, 2009
    Исправлять баги трудно. В контексте этой статьи я говорю о достаточно «явных» багах – тех изъянах, которые

  • Anonymous
    June 03, 2009
    @bob Most implementations of Hello World are pretty bug free. It's also perfectly possible to structure your code so that it's provably correct. Read some of Edsger Dijkstra's works if you wish to be convinced of that. I might be able to agree that sufficiently complex software is astronomically unlikely to be bug free  (for a sufficiently vague definition of 'sufficiently complex').

  • Anonymous
    June 04, 2009
    The comment has been removed

  • Anonymous
    June 04, 2009
    Essentially, code modification usually can be done to refactor existing code to be  simpler, better or more robust.  Also, improve the inline comments with notes about the cases handled by the code. Continual application of this slow 0.01% improvement will turn a large system from a mess into much less of a mess within a year or two.  Six Sigma.

  • Anonymous
    June 04, 2009
    I remember a similar bug from the days of Visual C 4.1 (mid-90s): We were coding a healthcare system and old patients, born before 12/30/1899, got their happy birthday greetings etc. on the wrong day when we upgraded from VC 4.0 to VC 4.1. I think it was a datediff() flavored function that had the same bug as you mention VBScript had. It got fixed in VC 4.2. BTW, there are so many facets re time and dates and computers: I remember in the 80's when Microsoft acquired Lattice C and later released their own Microsoft C 5 and later version 6, great products. One of the time functions had a safeguard so that when run on a PC without a hardware clock, it would instead return a date from 1955. And I always wondered if that was Bill Gates birthday... Rgrds Henry

  • Anonymous
    June 04, 2009
    The comment has been removed

  • Anonymous
    June 05, 2009
    Grant, as interesting as the story is about how the standard railway gauge is derived from ancient Roman standards, it's purely made-up. http://en.wikipedia.org/wiki/Standard_gauge says why: there were many gauges used in the old days of railroads. Since old railways (often used to deliver coal from mines) were horse-driven, it's not surprising that the gauges were mostly around 5 feet, about the width of two horses. Therefore it shouldn't be surprising that ancient Roman wheel ruts were also about the width of two horses. The similarities are only a coincidence. The current standard gauge only came about as a convergence of many different standards, not because of a direct lineage to Roman chariots.

  • Anonymous
    June 05, 2009
    Interesting read. I was going to ask the same question as Faraz asked. :)

  • Anonymous
    June 09, 2009
    Welcome to the 49th edition of Community Convergence. The big excitment of late has been the recent release

  • Anonymous
    June 09, 2009
    The comment has been removed

  • Anonymous
    July 08, 2009
    Consider the product out there that depends on the original, buggy implementation. Your crisp, clean solution breaks backwards compatibility.

  • Anonymous
    September 03, 2009
    @Denis.   when fixing bugs, it is very helpful to know who did it so you know the style of thinking of whoever do it.     Trying to figure out why somebody was doing something on a certain way it's way more difficult if you don't have background info (like "this guy always follows the happy path, be way of uncheked previous errors") The "How did he/she get fooled on this" is easier to answer then.

  • Anonymous
    February 23, 2010
    Thanks for this post. I the add date and diffdate are really useful.

  • Anonymous
    March 12, 2010
    Gabe and Grant, just to ruin the old classic a bit more... Not only does the gauge only have a very vague relation to Roman chariots, the parts of the shuttle were never transported through a tunnel like that described in the myth. Who comes up with these things?