Conor vs. Join Algorithms
I got a question from a customer today related to join algorithms. This is a good, general topic, so I will post a few notes on this in the hopes that it is interesting to anyone out there who doesn’t understand how the costing formulas are put together in SQL Server.
(His actual topic is more complex, but he asked about join algorithms :)
SQL Server has 3 major join algorithms (loop join, hash join, and merge join). Each of these have certain advantages and disadvantages. Here’s the cheat-sheet:
LOOP JOIN: Loop join is very cheap when only producing a small number of rows. So, if you are getting 1 row, this is often the cheapest option. Also, the algorithm can be run in parallel queries fairly easily by distributing the outer row to different cores and running each core separately. Note that if you do a loop join with an index lookup on the inner side, that index lookup is essentially a random IO. Random IOs are more expensive than sequential IOs.
MERGE JOIN: (In the academic literature this is sometimes called sort-merge). Merge join works well on sorted input from both sides and does a merge, just as you would expect. In practical terms, these days it is probably used more when the input is already sorted on both sides (with indexes to give you the sort) since doing sorts is not obviously better than building hash tables (see below). Merge join requires a total order on the input, and it is not as easy to parallelize this algorithm in a manner that scales well.
HASH JOIN: Hash join has the lowest per-row cost of the algorithms but has a non-zero setup cost. It builds a hash table from the left hand side input and then uses each row from the RHS to probe into the hash table to do the join. When you have memory to spare, this is algorithm is your friend.
The question from the customer was why there is a tipping point when selecting 67 vs. 68 rows (in their example) where the plan selection went from loops join to hash join. The reason has to do with the higher startup cost of hash joins (from building the hash table) and the higher per-row cost of doing a nested loop join. The inner side of a simple nested loop join will either be a scan or a seek. In the case of a scan, you are doing sequential IO over some number of pages (for the first iteration, after which they will perhaps be in the buffer pool). In the case of the index seek, you are doing random IO (so, the disk head has to move and the disk has to spin around until it finds the right sector for traditional rotating media). This random IO is relatively expensive compared to a sequential IO (no moving the disk head or waiting for the disk to spin around to where you want it to be). The hash join only has to do a sequential scan of that right hand side once. So, that’s going to quickly be faster than doing many random IOs and it will be faster than the CPU cost of reading the pages from the buffer pool many times.
So, the costing models for hash join will cause it to be selected in cases when the expected row cardinality is high enough to make building that hash table a “good deal”. This is often the case in queries that process a lot of rows.
Happy Querying!
Conor
Comments
Anonymous
July 07, 2010
Never mind my previous question. It was explained to me how the random IO for index seeks is when you do an index seek per row that you're joining to... I am still interested to know whether the performance characteristics of the IO subsystem are considered (or will be considered in future releases) when choosing a join type.Anonymous
July 07, 2010
Never mind my previous question. It was explained to me how the random IO for index seeks is when an index seek is required per row that you're joining to... I am still interested to know whether the performance characteristics of the IO subsystem are considered (or will be considered in future releases) when choosing a join type.Anonymous
July 24, 2010
(back from vacation) Basically no, we purposefully do not model the specific characteristics of your IO subsystem in the cost model today. This makes the plan selection more consistent across different kinds of machines. So, if you were to deploy an application from one machine (where you tested the plans) to another, you would generally want the plans to be similar as long as the basic characteristics of the machine were similar. Obviously, there are things we have considered for future releases to allow for greater control over this class of detail. In general, however, we rarely see it to be significant in terms of plan choice in the context of overall performance of an application workload.