SQL Server: Work Around Inability to get Batch Processing with IN and EXISTS

Using IN and EXISTS in queries can some timetimes prevent the use of batch processing in SQL Server 2012, and result in less speedup than expected when using columnstore indexes. For example, dbo.Media is a dimension table with around 6,000 rows, and dbo.Purchase is a fact table with around 101 million rows. The dbo.Purchase table has a columnstore index on it. Suppose there is a table dbo.MediaStudyGroup that contains a subset of the Media table that a business analyst has created to support a particular study. We'll simulate creation of the study group with the following statement:

select top  1000 * 
into MediaStudyGroup 
from dbo.Media;

Now, consider the following two equivalent queries:

select p.Date, count(*)  
from dbo.Purchase p 
where p.MediaId in (select MediaId from dbo.MediaStudyGroup) 
group by  p.Date
order by  p.Date; 
 
select p.Date, count(*)  
from dbo.Purchase p 
where exists (select m.MediaId from dbo.MediaStudyGroup m 
                     where m.MediaId = p.MediaId) 
group by  p.Date
order by  p.Date;

 

Both of these queries take about five seconds to run on a 4 core processor. The query plans for them scan the columnstore index on dbo.Purchase but they don't use batch processing. It turns out that you can rewrite the above two queries using a regular (inner) join as follows:

select p.Date, count(*)  
from dbo.Purchase p, dbo.MediaStudyGroup m 
where p.MediaId = m.MediaId 
group by  p.Date
order by  p.Date;

 

 

Most of the work of this query runs in batch mode, and it takes only a tenth of a second to run.

The plan for the first two queries looks like the following (click to expand):

Notice that the hash join operator is not a batch mode operator. In fact, none of the operators run in batch mode in this plan.

The plan for the third query (with the inner join replacing IN and EXISTS), is as follows:

Notice the Batch Hash Table Build operator. This is an indication that hash join and aggregation is running fully in batch mode. You can also hover the mouse over the operators in Management Studio's showplan view, and the tooltip will show that the estimated execution mode is Batch.

The key idea here is to use regular (inner) join instead of IN and EXISTS in your queries when possible to get the best performance with columnstores.

Return to main columnstore performance tuning page.