Spot the defect: Bad comparisons, part three

Artykuł
01/27/2011

Did you notice how last time my length comparison on strings was unnecessarily verbose? I could have written it like this:

static int ByLength(string x, string y)
{
if (x == null && y == null) return 0:
if (x == null) return -1;
if (y == null) return 1;
return CompareInts(x.Length, y.Length);
}

static int CompareInts(int x, int y)
{
// Positive if x is larger, negative if y is larger, zero if equal
return x - y;
}

static Comparison<T> ThenBy<T>(this Comparison<T> firstBy, Comparison<T> thenBy)
{
return (x,y)=>
{
int result = firstBy(x, y);
return result != 0 ? result : thenBy(x, y);
}
}

Much nicer! My string length comparison method is greatly simplified, I can compose comparisons easily with the ThenBy extension method, and I can reuse CompareInts as a helper function in other comparisons I might want to write.

What's the defect now?

This one should be easy after last time. The lengths of strings are always positive integers and in practice they never go above a few hundred million; the CLR does not allow you to allocate truly enormous multi-gigabyte strings. But though CompareInts is safe for inputs which are string lengths, it is not safe in general. In particular, for the inputs Int32.MinValue and Int32.MaxValue, the difference is 1. Clearly the smallest possible integer is smaller than the largest possible integer, but this method gives the opposite result! CompareInts should read:

static int CompareInts(int x, int y)
{
if (x > y) return 1;
if (x < y) return -1;
return 0;
}

The moral of the story here is that a comparison function that doesn't compare something is probably wrong. Subtraction is not comparison.

Comments

Anonymous
January 26, 2011
Sometimes I think that instead of returning an int, Compare functions should return an enumeration: enum CompareType //Built in. With less sucky naming. { LeftLarger = 1, Equal = 0, LeftSmaller = -1 } static CompareType CompareInts(int x, int y) { if (x > y) return CompareType.LeftLarger; if (x < y) return CompareType.LeftSmaller; return CompareType.Equal; } The meaning of returning an int to a comparison function strikes me as non-obvious, despite its popularity.
Anonymous
January 26, 2011
I would have used x.CompareTo(y)... It also shows the intent much better than x - y
Anonymous
January 26, 2011
But subtraction is comparison, if you take the overflow and carry flags into account!
Anonymous
January 27, 2011
Brian, the value returned by the function could be used by the sorting function as an optimization hint. (From memory I read some documentation about this, I think it was for Delphi, too long ago) It goes something like this: The further away from zero the returned value is, the more 'unequal' both items are. An optimized sorting function could use this knowledge in an attempt to skip a few items. Diff : A - B = 1 Diff : E - F = 1 Diff : A - E = 5 Therefore comparing A to F could be skipped since: A and B are real close, E and F are real close A and E are really far appart therefore A and F could never be close.
Anonymous
January 27, 2011
I had the same thought as Thomas. Why would you not use x.CompareTo(y)?
Anonymous
January 27, 2011
@Bas: But then your spec for a comparison function would have to say that the value is not only of a certain sign, but also proportional to the difference. The comparison functions most of us use are only spec'ed as returning positive, negative, or zero, with no importance given to the magnitude of the positive or negative return values. If your sorting function assumes something about the magnitude of those values without the spec requiring it, you are wandering into very dangerous territory.
Anonymous
January 27, 2011
@Austin, to steal something Eric has said before, if CompareTo did not exist, how would you write it? Because all of the implementations of CompareTo, somebody had to write those. And people are still writing comparison functions, and they need to write good ones. No doubt many if not all of the examples presented in this blog series are from bug reports, either internally at Microsoft or from customers.
Anonymous
January 27, 2011
Substraction is a comparison for unsigned integers. For example, this is wide spread technique in embedded development.
Anonymous
January 27, 2011
The comment has been removed
Anonymous
January 27, 2011
@Bas @Mwf: The optimization hint scenario is feasible, though compare does not promise any meaning on the int, so it is only practical under two situations:

The algorithm (e.g., sort) will still run successfully (and within any guarantees of speed it makes) even if the distance from 0 is arbitrary.
Only specialized libraries should use this type of optimization, since people will probably not be expecting this kind of optimization, and it may slow things down. Anyhow, this kind of optimization strikes me as the kind of thing that should NOT be built into the framework. The people who need specialized comparisons can do some combination of using their own comparison libraries and just casting the enumeration to an int.

Anonymous
January 27, 2011
@Sergey: No, it isn't. The range of the result of a subtraction of values constrained only by type is always twice the range of the type, even if the type is unsigned, so wraparound is still problematic. Of course, as Eric alluded to with the size of strings argument, if the values are constrained to a subset of the values allowed by the type, it may work.
Anonymous
January 27, 2011
I've seen the x - y and -cmp idioms so many times -- even in prominent examples of "well-written" code -- that I've never thought to question them. Needless to say, I've grepped all my source code to check my comparisons now. Found one bug (-cmp to reverse), and one x - y which was okay because of restricted range, but I've added a comment to it now. So thanks, Erik, for these posts! :-) @Bas: That "optimization" sounds pretty ridiculous. :-) I don't think you can safely glean information from any general comparison function. (Maybe you can make some assumptions if you're only sorting integers.) From your example, it does seem clear that A need never be compared to F, but for a different reason: E - F = 1 (implies E > F) A - E = 1 (implies A > E) A > E > F... so A > F.
Anonymous
January 27, 2011
The origin is probably strcmp - strcmp did this in the original implementation in unix, and this is what the "may return any appropriate-sign value" contract was invented for. It was safe then, since what was being subtracted was chars which are smaller than ints, but it got it into people's heads that this is for some unknown reason a terribly clever thing to do (it avoids one or two branches, which was relevant on the PDP-11 but not so much today, and nevermind that strcmp requires three branches per character until the point they are different) @Bas Mommenhof: What kind of sorting algorithm could make use of this and be faster [including the extra work for the comparison function to decide if the inputs are "close" or not] than a standard sorting algorithm? Also, this is an extra contract on top of it, and doesn't fit with current usage [with traditional strcmp implementations, "aa" "ab" results in 1, "aaaa" "aaaz" results in 25.
Anonymous
January 27, 2011
The comment has been removed
Anonymous
January 27, 2011
@Everyone: I am not actually advocating this. Im only stating what I think I remember i have read a long time ago. Brain asked where the int was comming from, this was a possibility. I am not too happy about everyone disagreeing with it, because that could be an indication that my memory is failing. I must be getting too old. (I still remember floppy disks.)
Anonymous
January 27, 2011
"I still remember floppy disks." Oh for crying out loud. I just used a floppy disk a couple of days ago. If simply remembering floppies makes you old, what does that make me? :) (Granted, it was the "modern" 3.5" style...I admit it's been a few years since I've used a "floppy" where the outside was as floppy as the inside :) ). "…but I say let's have a new all-encompassing Number type" Good idea! Count me in. :) (Actually, don't some of the other "less-rigorous" languages have this sort of unbounded, arbitrarily precise numerical data types? You're always going to run into problems with precision for non-integral types, but I agree that otherwise we should not have to worry about overflow any more).
Anonymous
January 27, 2011
To those suggesting the use of arbitrary precision floating point as the default do let me know what should be supplied when asking for pi. I can see arguments for using bigint by default in a language, python does this very nicely I think, but the Real vs Rational issue will always be a problem
Anonymous
January 28, 2011
"To those suggesting the use of arbitrary precision floating point as the default" Who is doing that? In fact, I specifically called out floating point formats (i.e. "non-integral types") as a case where it's not possible to have the data type arbitrarily unbounded. Methinks you are failing the "give the benefit of the doubt" test. :p
Anonymous
January 28, 2011
The comment has been removed
Anonymous
January 29, 2011
I think it's definitely a good point to point out the integer overflow potential in comparison functions. However, the mathematically-minded purist in me twitches when you say "Subtraction is not comparison". In arbitrary precision universes it is, especially if you use the System.Numerics.BigInteger types from the .NET framework, then use the .Sign property to translate it back into the 32-bit domain :) In my mind, a good software engineer has a leg in two domains - the theoretical computer science domain, as well as in the pragmatic engineering domain. This is a good reminder article for the engineering aspect :)
Anonymous
January 29, 2011
Moral of the Story: Don't use an int as the return value of a function that is supposed to have 3 distinct different outcomes. This spec of the comapre function may have made sense in the 70's, when the MicroProcessor was the latest new thing, but one wonders how a thing like that could ever have ended up in .NET
Anonymous
January 30, 2011
I agree with Ferdinand. The return value of the comparison always just smelled wrong to me, as a remnant of the C++ days. Agreed, it enables you to write "return x-y" code, but again, is a method that does not compare, a comparer? What baffles me, is that C# got all those thing right, why was this left shaky at best?

Udostępnij za pośrednictwem

Spot the defect: Bad comparisons, part three

Comments

Dodatkowe zasoby