Columnstore Index- Should I partition my columnstore Index?

Table partitioning is a perfect way to manage large tables especially in the context DataWarehouse (DW) which can be very large (think TBs). The table partitioning can help both in managing large amount of dara as well as improving the query performance by eliminating partitions that are not required.  For example, if a FACT table stores SALES information for an organization, a common query pattern will be to look at SALES data from the last week or last month or last quarter and so on. In this case, partitioning on say weekly boundary could be helpful as SQL Server will eliminate other partitions during query execution.

The interesting thing with columnstore index is that it is implicitly partitioned by rowgroups, an internal physical organization. Assume that incoming data is in date/time order, then each compressed rowgroups are indirectly ordered by date/time. The columnstore index maintains Min/Max for most columns within a rowgroup. For the case here, the columnstore exactly knows the date/time range for all the rows in each rowgroup. Now, if you run a query to look at the sales data from the last quarter, SQL Server can find those rows efficiently by looking at the metadata for each rowgroup and eliminating the ones that are out of range. This begs the question if you should even consider partitioning a table with CCI as it is already implicitly partitioned. The short answer is 'Yes' as illustrated below

  • If there are significant number of updates to compressed data, the implicit date/time ordering will begin to dilute as updated rows will be inserted into the delta store. For example, if you update a row from previous month, it will get deleted there and then inserted along with the latest inserts. So by partitioning the data, you can guarantee that all the data for a given week or month or quarter is in its respective partition.
  • If you query the date/time data along with some other attribute. For example, if you have three product lines (P1, P2 and P3) and you are capturing SALES data based on date/time order for all three product lines. Now if you want to query the SALES of last quarter for product line P1, SQL Server will need to process/eliminate the data of P2 and P3. Normally, this is not an issue because columnstore can process the data really fast. However, if you very large number of rows for each product line, and your normal query pattern is to run analytics for a specific product line, then partitioning your columnstore index on Product Line can be beneficial
  • Index maintenance can be targeted at partition level. Similarly, other benefits like marking FileGroups containing older partitions as Read Only allows you to control the size of your incremental backups.
  • You can compress older partitions with COLUMNSTORE_ARCHIVE option that gives you an additional 30% in storage savings at the cost of slower query performance which may be ok as older partitions may not be queries as often.

Thanks

Sunil Agarwal

Comments

  • Anonymous
    October 07, 2016
    Great points!! If you add some real time example with data would be good point for discussion.
  • Anonymous
    October 07, 2016
    Prashanth, How is this 'You can imagine partitioning by 'counties' in USA and you are storing, for example, data for each house. There are total of 3144 counties in USA. You keep the data such as number of bedrooms, square feet, taxes paid and so on. If you assume that most queries will be searching within a county. Now, yearly, as taxes on the property changes, you will need to update each row. In this case, partitioning with CCI makes sense'
  • Anonymous
    October 08, 2016
    We recently tried to use partitioning to speed things up, but in QA testing it actually slowed things down. So I disagree that you should do this for performance reasons. There is almost no scenario where a good index isn't working just as well, and partitioning has a real drawback for any other index. If you have index fragmentation problems you can recreate the index, unless the data is so massive this is no longer possible in a normal timeframe. In that case you have no alternatives. The main issue with partitioning is that your other indexes are now also partitioned. If you have a partition key on say, order date, and another index on customer code, then that index will be split up in as many indexes as you have partitions. If you have to search that particular index the database now has to search ALL of them. If you have 3000 partitions, this means searching 3000 indexes. My take on this: partitioning is great for making administration easier through reducing backup size and compression. It's also good if you have data deletion requirements where you want (or have) to delete data after a certain period of time. But if you want to use this to improve speed you'd better test it real good because in a lot of scenarios it will actually slow things down.
    • Anonymous
      November 01, 2016
      Ronald, your point is correct. you can only improve query performance with partitioning if most partitions can be eliminated,