Conor vs. Anti-Semi-Join Reordering
I was asked to comment on a post about the order of WHERE NOT EXISTS (<subquery>) in a query and its impact on a query plan.
Specifically this thread:
https://social.msdn.microsoft.com/Forums/en/transactsql/thread/278c272b-5ebb-4fda-8985-49927bbd3799
Broadly, SQL Server does attempt to make it so that the order in which you write a query does not matter in terms of plan choice. For many of the common cases, we have done this work. I was reading through the source code last night and I don’t believe we do relative reordering of WHERE NOT EXISTS subqueries (which in the academic literature is called an anti-semi-join) in the same scope.
In the specific customer case, the customer actually did WHERE NOT EXISTS (SELECT TOP 1 … <rest of subquery>).
I will specifically recommend _against_ doing this as a general practice. The Optimizer can do all sorts of magic to transform semi-joins to joins and joins to semi-joins, and there are reorderings that are blocked if you do the TOP 1 (since it is not a relational operator). The SQL Optimizer is smart enough to know not to retrieve more than one row for any WHERE EXISTS or WHERE NOT EXISTS subquery, so you should not consider it necessary or wise to add TOP 1 for this case. (Perhaps there are cases when it is appropriate, but I don’t know of any general cases that could not be solved by using other hints).
As I have posted before, there are some cases where SQL does not do every possible rule transformation because some of them are just not common enough.
Also, remember that we do add rules each release, so don’t assume that we never do these rewrites. We like adding transformation rules in the engine :).
Happy Querying!
Conor
Comments
Anonymous
May 14, 2010
Hi Conor, Thanks for the post! I have a few followup questions though. In this post, your wording makes it seem like the TOP operator is the culprit behind the lack or join optimization; however, at the bottom of the MSDN thread Brad Schulz provides an example that does not use the top operator. So is it fair to say that the optimizer does not optimize ANTI-SEMI predicates at all, regardless of the top operator? Additionally this behavior is not limited to ANTI-SEMI joins, as this also occurs with correlated subqueries, in the predicate. Are these not being optimized too? Here is a sample query to illustrate (with the sample data from the MSDN thread) select * from #Customers c WHERE (select MAX([CustomerID]) from #CustHugeRowSize where CustomerID=c.CustomerID) IS NULL AND (select MAX([CustomerID]) from #CustTeenyRowSize where CustomerID=c.CustomerID) IS NULL select * from #Customers c WHERE (select MAX([CustomerID]) from #CustTeenyRowSize where CustomerID=c.CustomerID) IS NULL AND (select MAX([CustomerID]) from #CustHugeRowSize where CustomerID=c.CustomerID) IS NULLAnonymous
May 14, 2010
The comment has been removedAnonymous
May 15, 2010
IMHO this is a an example where cost is not something you can compare. It appears this issue is that the MERGE JOIN operator does not pass on order after an OUTER join operation, and so the optimiser fixes one merge operator and then hashes the others. So it probably finds a good one and doesn't look at others. Remember the optimiser finds a good enough plan and not the best plan. Also this example is a bit contrived as you are going a cross product of four tables. Each join is going to introduce more rows, and so doesn't make any sense. Cross product queries are often difficult to optimise.Anonymous
July 25, 2010
The optimizer does actually have the ability to compare different alternatives for subtrees with different physical properties (ex: sortedness). It can find the local minimum plan and it can also find the global minimum plan for all plans it considers. Note, however, that the optimizer has logic to not generate all alternatives to improve performance. These are somewhat separate concepts.