Clustered Columnstore Index: Data Load Optimizations - Minimal Logging
When bulk importing large number of rows, you want to make sure it can be done efficiently with minimal resource usage so that that you can finish the data load quickly and start running your analytics. SQL Server provides following three key optimizations during bulk import
- Minimal logging
- Reduced overhead of locks
- Concurrent Inserts
Just like regular tables where data is physically stored as rows, these optimizations are available in tables with columnstore indexes as well but there are some differences. This series of blog explains how each of these optimizations in the context of table with columnstore indexes.
Minimal Logging
Unlike common misconception, it does not mean ‘no logging’. Minimal logging logs the allocation of pages and the changes to the system table but no data. For example, if you are bulk importing 10 million rows, there is no logging of actual data rows but only the page allocations. You may wonder why could SQL not eliminate logging altogether because you can ‘reload’ the data. The issue with this is that if the data load fails, SQL will need to roll back the transaction and will have no way of knowing which pages were allocated. At this point, hopefully I have convinced you that minimal logging is good to minimize any logging related bottleneck. Let us now look at minimal logging for tables with clustered columnstore indexes (CCI).
When you import the data into CCI, the data is loaded either into delta rowgroup or compressed rowgroup based on the rule “If batchsize < 102400, the rows are loaded into delta rowgroup otherwise they are directly loaded into compressed rowgroup”. Minimal logging is only supported when data is directly loaded into compressed rowgroup. This means, to get minimal logging with CCI, you must use a batchsize >= 102400.
Here are couple of examples
-- create the table
create table t_bulkload (
accountkey int not null,
accountdescription char (500),
accounttype char(500),
AccountCodeAlternatekey int)
-- Let us prepare the data
-- insert 110K rows into a regular table
begin tran
declare @i int = 0
while (@i < 110000)
begin
insert into t_bulkload values (@i, 'description', 'dummy-accounttype', @i*2)
set @i = @i + 1
end
commit
-- bcp out the data... run the following in command window
bcp columnstore..t_bulkload out c:\temp\t_bulkoad-1.dat -c -T
--truncate the table
truncate table t_bulkload
-- create clustered columnstore index
CREATE CLUSTERED COLUMNSTORE index t_bulkload_cci on t_bulkload
-- clean up the transaction log (note, the database is
-- simple recovery mode)
Checkpoint
-- now bulkload the data
begin tran
-- this loads the data in 1 batch
bulk insert t_bulkload
FROM 'c:\temp\t_bulkoad-1.dat'
-- show the log records generated
-- note, I am using ‘log record length 1104’ because I know the length of the
-- insert log record.
select count(*)
from sys.fn_dblog(NULL, NULL) as t_log
where allocunitname like '%t_bulkload%' and operation like '%LOP_INSERT_ROWS%' and [Log Record Length] = 1104
You will see that it returns 0 as no data rows were logged. Abort the transaction and run the checkpoint command to cleanup the log records. Now let us run the same command but with the batchsize of 10000. Recall, this will load the rows into delta rowgroup
begin tran
-- Now use a smaller batchsize of 10000
-- I am using even TABLOCK
bulk insert t_bulkload
FROM 'c:\temp\t_bulkoad-1.dat'
WITH
(
BATCHSIZE = 10000
,TABLOCK
)
-- show the log records generated
select count(*)
from sys.fn_dblog(NULL, NULL) as t_log
where allocunitname like '%t_bulkload%' and operation like '%LOP_INSERT_ROWS%' and [Log Record Length] = 1104
You will see that the count is 110000, exactly the number of rows we inserted. In other words, the data rows are fully logged (i.e. no minimal logging). So the question is why did we not get minimal logging with delta rowgroup. There are two reasons. First, the delta rowgroup is organized as a btree, not a HEAP. Second, we expect large number of rows to be loaded to CCI therefore have chosen to optimize data load directly into compressed rowgroup. One additional advantage of directly loading data into compressed rowgroup is that it eliminates the additional step of loading data into delta rowgroup and then migrating this data into compressed rowgroups.
There you have it; our recommendation is to choose a batchsize of > 102400 to get benefits of minimal logging with clustered columnstore index. In the next blog, I will discuss parallel bulk import and locking optimizations.
Please refer to https://www.microsoft.com/en-us/research/publication/enhancements-to-sql-server-column-stores/ for data load into columnstore index "Bulk load rates for clustered column store have been measured at about 600GB/hour on a 16 core machine, using 16 concurrent bulk load jobs (one per core) targeting the same table We did a trickle load test on a single thread whereby we inserted 3.93 million rows, one at a time in independent transactions, into an empty column store index. This was done on a machine with 4 cores and 8 hardware threads. The test was implemented with a cursor reading from a source table and inserting one row at a time into the target. The data was drawn from the Purchase table mentioned earlier. The test took 22 minutes and 16 seconds. The insertion rate was 2,944 rows/second"
Thanks
Sunil Agarwal
Comments
- Anonymous
July 19, 2016
Hi,Does the database need to be in Bulk_logged or Simple recovery mode in order for the minimal logging to materialize with columnstore indexes? And also, will this work with ordinary INSERT, assuming the number of rows affected is > 102400 ?Thanks!!- Anonymous
July 20, 2016
Yes, you will database in bulk-logged or Simple recovery mode. Those rules are same as rowstore. No minimal logging On ordinary insert, assume you are inserting one row or a small set of rows at onne time but for the whole transcation, you are inserting > 100k rows?- Anonymous
July 21, 2016
Thank you very much for your reply!I meant when you insert >100K rows into a columnstore via a single INSERT statement (as in, for example, INSERT...SELECT).- Anonymous
July 26, 2016
Pete, yes, if 'insert..select' in inserting > 102400 rows, you will get minimal logging with bulk/simple recovery
- Anonymous
- Anonymous
- Anonymous