The DLinq Dilemma: SQL shall be SQL, and C# shall be C#, and never the twain shall meet

With a tip of a hat to Rudyard and his ballad, I would say

Oh SQL is SQL and C# is C#, and never the twain shall meet
Till rows and objects manually a dev has both to beat
But there is neither SQL nor C#, nor join nor dot
When LINQ stands to link the twin what magic have I got

Ballads aside, this is a real problem that we are grappling with. There is really nothing new about this problem - it is the problem of Object Relational Mapping (ORM). The two worlds are rules by languages and run-times that have different semantics. The problem is less acute when you are providing a window to the other through classic API and string query language based ORM components. It is more important when you are truly bridging the two with language integration - that is what LINQ does - with its LINQ to SQL (aka DLinq) component. Here is the first in the series and the most controversial one - null comparison.

In SQL, a = b is not true when a and b are both null (special settings aside).
In C#, a == b evaluates to true when a and b are both null.

(Asides:
1. I am talking about the behavior of shipped products here - not LINQ design choice so far. And I am not going to reopen the debate about null semantics in C# here. You can see posts in my blog and my colleagues' blog for background.
2. VB has a different semantics so this post is C#-specific)

So in the LINQ project, we have to pick our translation from C# to SQL. Consider a simple query:

from c in db.Customers
from s in db.Suppliers
where c.City == s.City
select new {c.CustomerID, s.SupplierID};

If we go with the C# behavior, SQL users moving to LINQ will be surprised.
If we pick SQL behavior then C# users may be surprised. Worse still, those who run the same query with data moved to in-memory collections will get different results if c.City and s.City happen to have null values.

In previous CTPs, we chose the SQL way. So the generated query looks like.

SELECT [t0].[CustomerID], [t1].[SupplierID]
FROM [Customers] AS [t0], [Suppliers] AS [t1]
WHERE [t0].[City] = [t1].[City]

Of course, we could have done a different translation for higher semantic fidelity with C# by changing the comparison to
WHERE t0.City = t1.City
OR (t0.City is null and t1.City is null)

Under this proposal, a user can still get SQL semantics by changing predicate in LINQ query to
c.City != null && c.City == s.City

So here are my questions:
1. Did you even notice the difference between C# and SQL semantics?
2. Which one do you prefer and why? (FWIW, we got no questions/comments about this as far as I remember - zip, zilch, nada)

There are more manifestations of null and other semantics but more about them and our reasons for the current choices in subsequent posts.

Comments

  • Anonymous
    July 16, 2006
    I prefer the SQL semantics.  Why?  Because when I'm using Linq for Sql, its Sql that is my target, and the "conversion" between Linq expressed in C# and Sql should be as close to what I express as possible.  This becomes even more true if you consider the differences between C# and VB that you've said to ignore for now -- don't ignore the difference, instead realize that its pointing you to the fact that the object language is not the real point here and never should be -- its all about Sql instead.

  • Anonymous
    July 16, 2006
    I would want the semantics to be the same for in-memory objects as they would be for SQL-based objects. That said, the most important thing is for the behavior to be transparent and well-documented.

  • Anonymous
    July 16, 2006
    oh~~~~~~~

  • Anonymous
    July 16, 2006
    I'd go with the C# way, if a is null and b is null, then they're both null, if == stands for equality, then all you need to do is look into the dictionary:

    "of the same measure, quantity, amount, or number as another"

    "like in quality, nature, or status"

    The problem here is what do we do if we also want to get the rows where c.City and s.City are null?, well I'd use "?", just like with nullable types in c#

    from c in db.Customers
    from s in db.Suppliers
    where (c.City == s.City)?
    select new {c.CustomerID, s.SupplierID};

    Just like we use ? to indicate that value types can be null....

    Cool ugh? :)

  • Anonymous
    July 16, 2006
    When using Linq targeting C#, I would expect it to honor the C# language. That includes how it handles == for nulls. How Linq does its magic is not important to me as long as it gets it done while hiding the details. When I write C# code, I expect C# behavior. If I expected SQL behavior, then I would write it directly in SQL, not Linq over C#.

  • Anonymous
    July 17, 2006
    I'd expect it to follow the SQL example. Just to take the example supplied, if you're doing a join between a customers table and a suppliers table do you really want to join all the customers with an unknown city to all the suppliers with an unknown city? Of course not- this is just going to produce a load of meaningless pieces of data...

  • Anonymous
    July 17, 2006
    Add two operation relational equal
    where c.City = s.City
    and C# equal
    where c.City == s.City
    so all people are happy

  • Anonymous
    July 17, 2006
    The comment has been removed

  • Anonymous
    July 17, 2006
    Having started development in database environments, my answer should be no surprise: I prefer the Sql way.
    As I have always understood, in Sql null stands for "unknown", so actually the outcome of the comparison null = null should be null too, because it is unknown if unknown equals unknown. As this is very impractical in use (imagine having to write "WHERE ISNULL/NVL(a = b, TRUE/FALSE)"), it was decided to let null = null be false.
    In practice however, null values are often used to indicate "not (yet) applicable" for a value, often used in modelling subtypes ("never the twain shall meet...") or state. I have to credit Dr. Rene Veldwijk, a dutch "Sql guru" (had to use the quotes as one of his rules of thumb is "never trust a so-called guru") for arguing that sql should introduce a second null("-ish") value to indicate just that.
    In that semantic context it would make sense to have null = null return true.
    In short: I second Edgardo in his wish to let the programmer decide upon the null-comparison behavior and the ?== operator looks very usefull to me, sorry to hear that it didn't make it. How about statement attributes (I personally prefer the ?== operator!):
    [StatementAttributes(NullBehavior=NullBehavior.CSharp/AnsiSql)]
    {
    <statement>
    }

  • Anonymous
    July 17, 2006
    I would vote for C# semantics. I thought that Linq was aiming to be a consistent abstraction over multiple data stores. To pick the semantics of one data store in one case and the semantics of another in another case breaks that abstraction and is likely to create confusion. If a Linq query (in C#) did one thing against a Linq to Sql provider and a different thing against a (say) Linq to Xml provider, this would be a mess. If the demand for an operator to support alternative semantics is overwhelming, then I can't see ?== doing too much damage to C#.

  • Anonymous
    July 18, 2006
    I'd prefer consistency with all the other queries; since this is C#, use C# syntax. The whole point of LINQ is to get away from datasource-specific conventions, and to stick with programming language conventions.

  • Anonymous
    July 19, 2006
    C# semantics.... abstract what is below....

  • Anonymous
    July 19, 2006
    C# semantics. I like things feeling consistent in a language.

  • Anonymous
    July 19, 2006
    Why not ask Chris Date for his opinion ... if 'someone' has been confronted in practice with this dilemna it must be him (with all due respect for the Microsoft gurus expertise ;-o, of course)

  • Anonymous
    July 20, 2006
    Linq is all about writing queries in C#. What is there to add ? I am writing in C#. So why should I expect a non-C# behavior ? And Linq is supposed to unify data access. If each Linq flavor starts to behave differently the beautiful promise is broken, and I can't switch my data source from a SQL DB to an XML file.

    This is what I want conceptually. In practice I'm a bit worried about things like joins, as a previous poster already mentionned. Unfortunately, no good solution comes to mind. Maybe the Linq compiler should be optimized to transform this join:
    C#: Person1.BossId == Person2.PersonId && Person1.BossId != null
    to simply
    SQL: Person1.BossId = Person2.PersonId

    if this optimization can be made, I think it's quite ok. It is natural for someone using C# to exclude the null explicitely.

    The fact that the SQL guys are going to be surprised doesn't bother me: if they want to code in C#, they should learn it first, right ?
    The C# guys being surprised by their own language is much more disturbing.

  • Anonymous
    July 20, 2006
    I don't think this is a matter of SQL versus C# semantics. Null means different things in C# and SQL. In SQL, null means unknown. You can't compare two unknown values against each other, and say they are equal.

  • Anonymous
    July 21, 2006
    The comment has been removed

  • Anonymous
    July 22, 2006
    My vote is for the behaviour to stay inline with SQL.
    As posted in multiple posts two unkown value cannot return true when checked for equality.

  • Anonymous
    July 22, 2006
    C# behavior for the win!

    Programmer has to care about null values himself (like we all already did in past before Linq).

    If you make Sql-favorable breach in C# today, tomorrow people will ask about default null dereferencing. Like Person.Father.Mother.BirthDay has to return null if Father or Mother references are not set.

    BTW, in my experience I have never ever thought about NULL is DB as unknown value. Unknown values mean you have a bad DB schema design. NULL for me is known ABSENCE of the value.

  • Anonymous
    July 22, 2006
    I understand the dilemma. I read through the comments and agreed at first with those desiring SQL ("behavior should follow what I target"). But I tend to agree with the school of thought that the whole idea of Linq was supposed to be working with engine-neutral data objects in the C# language, not SQL, despite SQL-like grammar being added to C# when accessing the data via Linq.

    Incidentally, SQL behaviors existed before C#; in my opinion, that makes them old and obsolete.

    However, the very specific example you bring up about comparing a against b while both are null is one that should be addressed differently than "C# behavior vs. SQL behavior"; there should be a syntactical differentiator (i.e. alternate operator) between the two so that both options are available, because I can see both comparisons being useful at different times. What syntactical differentiator? I don't know, you guys are designing the thing. You went through something fairly similar to this when comparing values vs. instances with Object.Equals().

  • Anonymous
    July 24, 2006
    If I write LINQ queries in C#, then I expect C# semantics.  

  • Anonymous
    July 25, 2006
    If ?== looks wrong in C# for the SQL unknown (null = null == true) comparison, then perhaps use ~= (the tilde looks like an "n").  So leave == alone, and let it do the default C# behavior.

    (null == null) > true
    (null ~= null) > false

  • Anonymous
    July 25, 2006
    The comment has been removed

  • Anonymous
    July 25, 2006
    Alexey wrote
    > If you make Sql-favorable breach in C# today, tomorrow people will ask about default null dereferencing.

    Interesting that you mentioned this. I should write about this in the context of LINQ join. Then it might not seem that far-fetched a thing. (No, we are not doing such a thing - just that there is a place where people look for such things.)

  • Anonymous
    July 29, 2006
    NULL value can potentially have two absolutely different meanings: "the value is unapplicable" and "the value is unknown". The "unknown" meaning was (unfortunately) selected when Codd introduce concept of NULL values in extended relational algebra, and this is the reason of three-valued logic in SQL.

    But in reality NULL values uses primarily in foreign keys and with meaning "the value is unapplicable". Because of this usefullnes of three-valued logic in SQL is somewhat questionable. For this reason it is better in LINQ to stay with the classic two-valued logic and treat null values as equal.

  • Anonymous
    July 29, 2006
    I don't care what semantics are used, as long as it's consistant with it's context (which here is C#).  I don't see how you could use SQL semantics therefore; as that would most existing apps.

    Therefore my stylistic preferece is pretty much moot.

    It would be extremely annoying to anyone implementing or extending their own LINQ-able collections if the semantics were to differ from the norm.

  • Anonymous
    August 01, 2006
    The expresion x ==y in C# is not actually checking to see if x and y are equal, if they are refrence types.

    If x and y are reference types the expression checks to see if x and y point to the same object. It does NOT check to see if the values in the object are equal.  Which is why if x and y are null the expression returns true, they both point to nothing.

    This discussion is silly however, because a Value type can never be null.

  • Anonymous
    August 01, 2006
    The comment has been removed

  • Anonymous
    August 01, 2006
    http://www.udaff.com/image/272/27286.jpg

  • Anonymous
    August 10, 2006
    i vote for the c# way, for the reason of abstraction.
    but not to have a limitation from ignoring the sql way, and not to change the c# syntax for one library, the solution i c is to add an extention method (helper method) to the library that we use for linq,so we will have

    if(a.DbEquals(b)) will be the sql way, and the good news is that it aplies to both c# and vb

    and of course the implementation of that method will check for null :)

    i beleive that the solution of a problem of a library should be in the library itself :) thanks to c#3 :)

    sadek.drobi@gmail.com

  • Anonymous
    August 10, 2006
    My vote goes for C# as well with the understanding the DLinq would recognize the pattern value != null && ... and translate it into the SQL way.  Otherwise it would be an awfully leaky abstraction.  If SQL syntax is required DLinq doesn't preclude writing T-SQL.

  • Anonymous
    August 14, 2006
    SQL Syntax should absolutely be followed:  as others have said: NULL means 'unknown' in SQL, and (unknown == unknown) yields unknown.

    C# is an inherently scalar language - we deal with single objects when we do comparisons (even when we're iterating over collections, we're doing repeated single-object comparisons) and we don't have operators that correlate one collection with another (which is, after all, what LINQ is adding to the language).

    SQL, on the other hand, deals with vectors (sets) - and comparisons are done on the basis of attribute values. Comparisons are used to determine inclusion in, or exclusion from, a set - or correlation between elements of related sets.

    In C# a null value means "I haven't allocated, or didn't find, that object" (or variations of that theme, like '... an object with that value of that attribute'); in SQL, a null value is an indication of 'unknown' for an attribute - NOT - an indication of 'not found' for an object (where 'object' correlates with 'element in the set', or 'row in the table' in the SQL world).

    So null in C# is in reality a different creature from null in SQL.

    If I truly want the set of rows from table T1 where attribute T1.X is unknown, joined with the set of rows from T2 where T2.X is also unknown, then I should deliberately request it ... and I'll get a cartesian product of all pairings of the relevant T1 rows (i.e. T1 rows where X is null) with the relevant T2 rows.

    Notice that that result is a far cry from a join of T1 to T2 where T1.X == T2.X and neither is null - I'll get an inner join (i.e. the join you'd expect - T1 rows are paired only with T2 rows for which the X values match, yielding a result set with distinct elements for each distinct value of X).

    These two results - the inner join and the cartesian join - are significantly different.

    If the "null == null" collection of results is claimed to be 'just another variant of "thisValue == thisValue"' (i.e. null is as valid and as concrete a value as is 'London') then null is not really being used as null (unknown), but is being (incorrectly and dangerously) used as just another discrete value in the set of values for the attribute - and should be implemented as such (perhaps with an "Unknown City" value).

  • Anonymous
    August 14, 2006
    SQL Syntax should absolutely be followed:  as others have said: NULL means 'unknown' in SQL, and (unknown == unknown) yields unknown.

    C# is an inherently scalar language - we deal with single objects when we do comparisons (even when we're iterating over collections, we're doing repeated single-object comparisons) and we don't have operators that correlate one collection with another (which is, after all, what LINQ is adding to the language).

    SQL, on the other hand, deals with vectors (sets) - and comparisons are done on the basis of attribute values. Comparisons are used to determine inclusion in, or exclusion from, a set - or correlation between elements of related sets.

    In C# a null value means "I haven't allocated, or didn't find, that object" (or variations of that theme, like '... an object with that value of that attribute'); in SQL, a null value is an indication of 'unknown' for an attribute - NOT - an indication of 'not found' for an object (where 'object' correlates with 'element in the set', or 'row in the table' in the SQL world).

    So null in C# is in reality a different creature from null in SQL.

    If I truly want the set of rows from table T1 where attribute T1.X is unknown, joined with the set of rows from T2 where T2.X is also unknown, then I should deliberately request it ... and I'll get a cartesian product of all pairings of the relevant T1 rows (i.e. T1 rows where X is null) with the relevant T2 rows.

    Notice that that result is a far cry from a join of T1 to T2 where T1.X == T2.X and neither is null - I'll get an inner join (i.e. the join you'd expect - T1 rows are paired only with T2 rows for which the X values match, yielding a result set with distinct elements for each distinct value of X).

    These two results - the inner join and the cartesian join - are significantly different.

    If the "null == null" collection of results is claimed to be 'just another variant of "thisValue == thisValue"' (i.e. null is as valid and as concrete a value as is 'London') then null is not really being used as null (unknown), but is being (incorrectly and dangerously) used as just another discrete value in the set of values for the attribute - and should be implemented as such (perhaps with an "Unknown City" value).

  • Anonymous
    August 15, 2006
    I think when I write C#, as I do in DLinq, I expect C#.
    I admittedly had problems with the SQL rules, but found spots where  they shine as well as where they are just in the way.
    Whether unknown or not allocated, it is all about 'no thing here'.
    IMHO problems arise from implicitly assuming "not equal is unequal", which is not applicable in face of the unknown.
    I would like to vote for a more diverse set of operators reflecting different expectations for null-behaviour.

    BTW, I don't expect a rush of half-baked SQL-to-DLinq translations just for having it.

  • Anonymous
    September 14, 2006
    This week you can watch the first in a series of videos featuring members of the Microsoft C# team. A video of Raj Pai, the Group Program Manager for the C# team, leads off the series.

  • Anonymous
    November 07, 2006
    At the conceptual level, where LINQ expression trees are at, we shouldn't strive to match the precise behavior of programming language. Keep in mind, expression trees are shared among many different programming languages, which have different equality behavior from C#--VB for example. VB treats nulls differently from C#--more like SQL, yet emits the same expression tree for =. As for language consistency, I think that we have a precedent where = can behave differently depending on the operands, due to operator overloading. Value types also behave differently from reference types, and all of SQLs types are value types anyway. We should use data-dependent semantics. The behavior of = really depends on the meaning of nulls in the underlying data. Nulls are essentially overloaded. If we go against the underlying assumptions of the data, then most queries to the database involving nulls will require a complicated syntax. The default behavior would not be the desirable one, which could be a major problem if IQueryable was being used as pluggable processor. One can say that to use IQueryable as a pluggable processor, we need precise behavior, but I think we need behavior appropriate for the data, since nulls in C# and SQL really refers to 2 different concepts. With memory objects, null signifies non-existent values, whereas in SQL they refer to unknown values, so it would be more appropriate to use SQL. If precise behavior is necessary, having a separate function for DBEqual and EqualIncludingNulls for both usages is better.

  • Anonymous
    November 10, 2006
    I think you are asking the wrong questions.  The real question is do you want to support the expected PL/SQL behavior when using Oracle as the datastore?  What if I use db2 as the datastore? Do I get db2 style behavior in LINQ?  Don't forget to give me c# style expressions when running queries against collections and arraylists in memory.

  • Anonymous
    December 13, 2006
    Well, SQLs way is usefull for handling large sets (try joining on a column for which half of the values are null and you will receive a VERY large result). So why not give the user best of both worlds? Let the == stay the C# way and implement (let's say) =? be the new "is mathmatically equal" operator.

  • Anonymous
    December 13, 2006
    Well, should have read more of the comments. If there is no way to add the syntactic sugar (which it is not with respect to lambda expressions and translation to SQL!), you are stuck with: C#s way and doing things like where(a.x == b.x && a.x != null) which would translate to where ((a.x = b.x OR (a.x is null and b.x is null)) AND a.x is not null for which you may hope that the SQL processor is capable of reducing this back to a.x = b.x (to use some indexes!). or using SQLs way and setting up a complete breach of logic inside the language. So you just go ahead and choose, both choices are lousy.

  • Anonymous
    February 09, 2007
    LINQ by its definition is language query, so it should have a behaviour consistent with the language... So I, as most comments, agree with C# behaviour. If it is "embedded SQL", even in behaviour, then LINQ loses its purpose. The case for SQL syntax, handling large sets, it is precisely some that LINQ hasn't solved yet; so maybe the special sugar or different semantics should apply, once the large set/batch solution is on LINQ. Have you looked the possibilities of adapting language constructions like the ones available xBase (Clipper and dBase), to C# and VB syntax? That language was procedural, still very clear for database operations; it was not SQL, neither similar, but easily mapped the equivalent operations. Something like that would be a better match for keeping semantics without making LINQ more complex than needed.

  • Anonymous
    March 20, 2007
    The comment has been removed

  • Anonymous
    March 30, 2007
    I prefer the Criteria used in NHibernate with the NHibernate.Expressions This is not all new by the way, perhaps you should learn NHibernate and then you can be informed.  NHibernate provides options, which is good, rather than force everyone into the same query language, ie. Criteria, HQL, SQL, LINQ, etc... Just some advice

  • Anonymous
    May 21, 2007
    The comment has been removed

  • Anonymous
    June 10, 2007
    PingBack from http://blog.lab49.com/?p=1115

  • Anonymous
    June 10, 2007
    I think you need to introduce 'NULL' to the c# language, and have it be completely different from 'null'. Of course, then youd have to decide whether NULL == null.

  • Anonymous
    September 25, 2007
    Cool!

  • Anonymous
    September 25, 2007
    Cool.

  • Anonymous
    September 26, 2007
    interesting

  • Anonymous
    June 17, 2009
    PingBack from http://patioumbrellasource.info/story.php?id=2234