A Simple Puzzle

My original version of the histogram-generating code that I whipped up for the previous episode of FAIC contained a subtle bug. Can you spot it without going back and reading the corrected code?

private static int[] CreateHistogram(IEnumerable<double> data, int buckets, double min, double max)
{
int[] results = new int[buckets];
double multiplier = buckets / (max - min);
foreach (double datum in data)
{
int index = (int) ((datum - min) * multiplier);
if (0 <= index && index < buckets)
results[index] += 1;
}
return results;
}

Note that of course if this were production code, instead of demo code I whipped up in five minutes, it would be a lot more robust in its error detection; the bug that I am looking for is a bona fide error in the logic of the method, rather than things like "the method does not verify that min is smaller than max", and so on.

A hint: the first time I ran this code and displayed the results, the generated histogram looked fine. Then I made a small change to the arguments and the resulting histogram image was obviously wrong. Can you spot the defect?

Comments

  • Anonymous
    February 23, 2012
    It seems a bit odd that you are casting an int to an int inside the array index. It looks like you may have mistakenly included a hint. You are correct; that was an editing error. I've removed it. Thanks! -- Eric

  • Anonymous
    February 23, 2012
    unchecked
    {
       var value = (Int32)4294967296.0;
       Console.WriteLine(value == 0);
       // And so on...
    } I'm not following you. How does this relate to the bug in my code? -- Eric  

  • Anonymous
    February 23, 2012
    CreateHistogram discards all data outside the [min, max) range - specifically, any datum equal to max is discarded.  That may be intentional? That is intentional; the intention is to display the portion of the histogram between the ranges. However, you're on the right track. Does it in fact discard all data outside the desired range? -- Eric

  • Anonymous
    February 23, 2012
    Ah tricky. I admit that it took me a few more minutes to see the issue even after I looked up the correct code. (Spoiler: I knew it would somehow be about negative numbers but the code was correct even when min is negative, at that point I lost hope and cheated)

  • Anonymous
    February 23, 2012
    Hmph.  And I thought I knew C#. :) Eric, please explain why that doesn't behave the way I expected it to.

  • Anonymous
    February 23, 2012
    Is it related to the relationship of buckets to max-min? If there were only one bucket (which is permitted even if silly) it looks like you could get an index that matches no bucket. I need to read these kind of puzzles on something other than my iPhone. The issue I have in mind will come up if there is only one bucket, yes, but the issue I have in mind affects histograms with any number of buckets.  -- Eric

  • Anonymous
    February 23, 2012
    [Addition to my first comment] Double is a 64bit type and Int32 is 32bit. The cast will erase a part of the index before we ensure that the initial operation is in range. Indeed, this is a problem; good catch. However, in this particular case all the data are reasonably sized and the program still has a bug. -- Eric

  • Anonymous
    February 24, 2012
    Agreeing with Lars... any datum equal to the maximum incorrectly gets discarded. And what's the problem with that? It's a histogram of tens of thousands of random data; the odds of one of them being exactly the maximum are tiny, and if it is, who cares if one datum in a hundred thousand is accidentally not displayed? We're already not displaying any data outside of the given range. -- Eric

  • Anonymous
    February 24, 2012
    Isn't this skewed by 0.5 in the cast to int? You tell me. Is it, or isn't it? -- Eric

  • Anonymous
    February 24, 2012
    I'm stuck on the cast to int... it looks like you should be performing some actual rounding logic there to make sure index is the appropriate value when dealing with a min and max that are negative. You're getting close. -- Eric

  • Anonymous
    February 24, 2012
    Data that would be in the first bucket before the histogram will be included in the first bucket of the histogram. (Because "(int)x" will evaluate to 0 for x in ]-1,1[) You got it! -- Eric

  • Anonymous
    February 24, 2012
    Short and sweet: any time (datum-min)*multiplier gives you a result between -1 and 0, not inclusive, the cast to int will always result in a zero. Thus not all data is excluded, the bucket at index 0 gets one buckets worth of extra data, however much that may be.  If you do the cast after checking the range, you will exclude those values. You got it! Beaten by mere seconds by jhominal. -- Eric

  • Anonymous
    February 24, 2012
    Not sure about the C# semantics on the (int) cast from double. My first assumption was that it would truncate, so that eventually, with large enough (or small enough) values of datum, index would wrap around back into the allowed range. Also, I'm agreeing with the others that if datum == max, it's discarded. When I tested casting with gcc, I got this behavior (int)4294967297 -> -2147483648 In fact, casting any large double to int gave me -2147483648. If I cast it to a long, then cast the long to an int, I got 1, as I would expect with truncation.

  • Anonymous
    February 24, 2012
    Eric, If I were to write this code, I would automatically write it as: foreach (var datum in data) {    if(datum>=min && datum<max)    {        bla bla    } } and there can be no bug even if I'm not careful. Is there any particular reason you did not choose to write that way?

  • Anonymous
    February 24, 2012
    The only time it goes into the last bucket is when the value is equal to max.  Did you mean to use a rounding function when going from double to int instead? I'm not following your train of thought here. Suppose min is 0, max is 5 and buckets is 10. Multiplier is 2. If one of the data is, say, 4.7 then we multiply by 2 to get 9.4 and then round off to 9, which is the index of the last bucket. But 4.7 is not equal to 5. So I've provided a counterexample to your claim; would you care to clarify the claim? -- Eric

  • Anonymous
    February 24, 2012
    I just watched a presentation by Bret Victor last night called Inventing on Principle, and he talks about an interface that gives algorithm developers immediate insight into the results of their functions.  If you haven't seen it, you should watch this section of it: www.youtube.com/watch Watching the video and then reading this post made me think: Man, if Eric had this tool, he may not have had that bug. :)

  • Anonymous
    February 24, 2012
    @Ali I don't think that has anything to do with the bug, does it? Mathematicians would normally write something like (min<=value<max) to more clearly express how value falls inside a range with a lower and an upper bound. Indeed some languages such as Python actually allow that syntax. In languages like C# that don't allow that syntax, (min<=datum && datum<max) is the closest you can get.

  • Anonymous
    February 24, 2012
    This all sounds very familiar. This bears some relation to one of Eric's previous posts, "What's the difference? Remainder vs modulus": blogs.msdn.com/.../what-s-the-difference-remainder-vs-modulus.aspx In fact, I think in the comments I linked to an instance of a very similar bug.

  • Anonymous
    February 24, 2012
    @weeble0 I think you misunderstand me, my comment was "first check if the current point is in range, and if it is, only then calculate the index" seems a more natural way to code the algorithm. I wondered if there is some disadvantage I cannot see. (Except that, then there would be no interesting bug/puzzle to talk about)

  • Anonymous
    February 24, 2012
    @Ali Ah, sorry, I did misunderstand you.

  • Anonymous
    February 26, 2012
    private static int[] CreateHistogram(IEnumerable<double> data, int buckets, double min, double max) {  int[] results = new int[buckets];  double multiplier = buckets / (max - min);  foreach (double datum in data)  {    int index = (int) ((datum - min) * multiplier);    if (0 <= index && index < buckets)      results[index] += 1;    min = datum;  }  return results; }

  • Anonymous
    February 26, 2012
    The moral of all this: static typing does not replace testing ;-)

  • Anonymous
    February 28, 2012
    The comment has been removed

  • Anonymous
    April 21, 2012
    (max - min) / buckets ?