Limited Statistics Granularity

To set up this scenario, run the script below:

USE tempdb
GO
IF OBJECT_ID ('test1') IS NOT NULL DROP TABLE test1
GO
CREATE TABLE test1 (c1 tinyint, c2 smallint)
DECLARE @x int
DECLARE @msg varchar(1000)
SET @x = 1
SET NOCOUNT ON
BEGIN TRAN
WHILE (@x <= 1000000)
BEGIN
INSERT INTO test1 (c1, c2) VALUES (@x % 255, CASE WHEN @x % 1000 = 500 THEN 1000 ELSE @x % 1000 END)
IF @x % 5000 = 0
BEGIN
COMMIT TRAN
BEGIN TRAN
SET @msg = 'Inserted ' + CONVERT(varchar(20), @x) + ' rows ...'
RAISERROR (@msg, 0, 1) WITH NOWAIT
END
SET @x = @x + 1
END
WHILE (@@TRANCOUNT > 0) COMMIT TRAN
SET NOCOUNT OFF
GO
CREATE INDEX idx1 ON test1 (c2)
UPDATE STATISTICS test1 WITH FULLSCAN
DBCC SHRINKDATABASE (db2)
-- DBCC SHOW_STATISTICS ('test1', 'idx1')

GO

The hypothetical scenario here is that your users complain that the first of the two queries below runs very slowly. You find that if you force SQL to use index [idx1] with an index hint, the query executes much more quickly. But why should you change your app to compensate for what seems to be a SQL Server bug? :)

USE db2
GO
SET STATISTICS PROFILE ON
SET STATISTICS TIME ON
-- Query #1 (slow)
SELECT c1, c2 FROM test1 WHERE c2 = 500

-- Query #2 (fast)
SELECT c1, c2 FROM test1 WITH (INDEX = idx1) WHERE c2 = 500

SET STATISTICS TIME OFF
SET STATISTICS PROFILE OFF

GO

 

To me, this suggests two questions: 1) Why isn't the fast plan chosen by default? 2) Are there any possible solutions that don't require modifying code to supply an index hint?

 

Despite appearances, this actually isn't a bug. Here's the explanation:

 

This table demonstrates uneven data distribution. The [c2] column holds values ranging from 0 to 999. There are exactly 1000 rows with the same [c2] value for each of the integers between 0 and 999. However, the value 500 is an anomaly; there are 0 rows with this value. Run the following query to show the number of rows with each [c2] value and you'll see that there are none with [c2]=500:

SELECT c2, COUNT(*) AS cnt FROM test1 GROUP BY c2

 c2 cnt
------ -----------
...
498 1000
499 1000
501 1000
502 1000
...

When the server builds a histogram for the statistics on a column, it attempts to intelligently select range endpoints so that a given step of the histogram represents a range of values with similar density. If you run DBCC SHOW_STATISTICS on the index you can see the histogram: 

 

 -- DBCC SHOW_STATISTICS ('test1', 'idx1')

 RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
------------ ---------- ------- ------------------- --------
0 0.0 1000.0 0 0.0
499 498000.0 1000.0 498 1000.0
503 2000.0 1000.0 2 1000.0
1000 496000.0 1000.0 496 1000.0

 

RANGE_HI_KEY is the value that sets the upper bound for the range of values represented in each histogram step. The lower bound is RANGE_HI_KEY+1 for the preceding step. For example, consider this row:

 499 498000.0 1000.0 498 1000.0

 

This indicates that there are 498000 rows (RANGE_ROWS) with a value between 1 and 498, inclusive. There are 1000 rows with the exact value of 499 (EQ_ROWS). AVG_RANGE_ROWS tells us that the typical value that falls within the range 1-498 shows up in 1000 rows. Now consider the next step in the histogram:

 

 RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
------------ ---------- ------- ------------------- --------
503 2000.0 1000.0 2 1000.0

 

This indicates that there are 2000 rows (RANGE_ROWS) with a value between 500 and 502, inclusive; and there are 1000 rows (EQ_ROWS) with the exact value of 503. The typical c2 value in the range 500-502 shows up average of 1000 times in the table. Recall what we know about the actual number of rows in this range:

 c2 cnt
------ -----------
...
499 1000
501 1000
502 1000
503 1000
...

This histogram step describes its range of c2 values exactly. What the histogram doesn't tell us, however, is exactly how many times a particular value in the 500-502 range appears. Remember that each histogram step only summarizes the values within a given range; it doesn't tell you exactly how many of each particular value there are (that is, unless the table's domain is small enough for every distinct value to have its own step in the histogram). SQL knows it needs to search for rows where [c2]=500, so it locates this step in the statistics' histogram, finds that the typical value in this range shows up in 1000 rows, and uses this as its rowcount estimate.

 

This is a key point: if a value being searched for falls in the middle of a histogram step, the AVG_RANGE_ROWS for that step is used to estimate the number of matches that will be found. SQL is always optimistic and assumes that the value it is searching for is actually one of the range rows. In this case that assumption is incorrect. The QP ends up overestimating the number of rows that will be returned and choosing a table scan when an index seek would have been more efficient.

 

So if this isn't a bug, what could you do to fix the problem? One solution would be to make the nonclustered index a covering index. Alternately, you could make [idx1] a clustered index. And, if you're using SQL 2005, you could use a plan guide with a USE PLAN hint to force the fast plan without changing the query text. All of these solutions should cause SQL to select the fast index seek plan. The bad cardinality estimate will still be there, but the perf problem won't.

 

(Update: A blog post from Jack Li describing a similar problem can be found here.)

Comments

  • Anonymous
    April 10, 2007
    Bart,Thanks for the article however there is a question:Even if the QP ends up overestimating the # of rows in this case, why is there a reason it should go for a table sacn anyways. I see this is a large table and if there exists a column index (which it does in this case) shouldn't it use it anyways? I could understand it choosing a table scan if the resultset was going to be considerably huge as compared the the table.Rgds.
  • Anonymous
    April 12, 2007
    The comment has been removed
  • Anonymous
    May 31, 2007
    Bart,I struggle with queries on tables similar what you described here. Therefore I found this blog quite interesting. I also read your chapter in Ken Hendersons "Practical Troubleshooting" book.From that article I learned that for stored procedures you can avoid parameter sniffing by using local variables. You get an query plan which only takes into account the column density but not the parameter values.This kind of 'stable' plan would be the solution to the kind of query issues I have. But my application doesn't use stored procedures. Is there any other way getting this kind of plan with other programming technique like sp_executesql, prepared queries, or ad hoc queries?A typical example where I would like to use a 'stable' plan are "ascending columns", where you often find values in the table which are not yet in the statistics.For these values the optimizer will choose the wrong access path. This means using a RECOMPILE hint will not help in this case. A OPTIMIZE FOR hint would be fine, but it requires that I know a 'good' value at the time of development (which I don't). I would need something like a OPTIMIZE FOR ANY hint telling the optimizer that I need a 'stable' plan.Rgds.
  • Anonymous
    June 01, 2007
    The comment has been removed
  • Anonymous
    June 11, 2007
    The comment has been removed
  • Anonymous
    June 12, 2007
    The comment has been removed
  • Anonymous
    June 12, 2007
    The comment has been removed
  • Anonymous
    June 12, 2007
    Yep, you're right about the relative costs -- the first query is a bit cheaper than the second.   There is a little bit of CPU overhead for each row that has to be shuttled between two operators.  (That cost is the very reason that the Seek operators were enhanced to be able to perform a residual WHERE as a secondary operation following the seek.)  My point was just that the plans are more alike than they are different; both, for example, do a seek that returns about 4500 rows, then both apply a filter to those 4500 rows to filter the final resultset down to 2 rows.  The SET STATISTICS PROFILE ON output for the first query could be a bit misleading because it says that the seek returned 2 rows.  In a sense, that isn't really true -- it masks the fact that the seek only filtered the resultset down to 4500 rows, just like the second query, and required a filter as a subsequent operation to trim the resultset down to 2 rows.  Thanks,Bart
  • Anonymous
    October 23, 2010
    Bart ,What does Distinct_range_rows means ...therefore what would this meanHistogram Steps        RANGE_HI_KEY      RANGE_ROWS     EQ_ROWS   DISTINCT_RANGE_ROWS   AVG_RANGE_ROWS1                                            0                                 1                           0                                                1              78084                          78081                                1                  78081                                                1              78085                                   0                                1                           0                                                 1              
  • Anonymous
    October 25, 2010
    A statistics histogram summarizes the data within a single column.  DISTINCT_RANGE_ROWS is the number of distinct values (note: values, not rows) within a particular histogram step.  (The count excludes the upper-most value of the range b/c the presence or absence of a row with the range's upper-most value is given explicitly by EQ_ROWS).  Given your example:  RANGE_HI_KEY  RANGE_ROWS  EQ_ROWS  DISTINCT_RANGE_ROWS  AVG_RANGE_ROWS --------------------------------------------------------------------------------------------- 1             0           1        0                    1 78084         78081       1        78081                1 78085         0           1        0                    1Taking each row in turn, starting with the first row, this means: 1. The range of column values from -infinity to 1 has no rows with a column value less than 1, and exactly one row with a column value equal to 1.  2. The range of column values from 2 to 78084 has 78081 rows (from RANGE_ROWS) with column values less than the value 78084.  This set of 78081 rows contains 78081 distinct column values (from DISTINCT_RANGE_ROWS).  Note that RANGE_ROWS == DISTINCT_RANGE_ROWS, which indicates that the row values are unique.  There is exactly one row equal to the range's HI_KEY column value of 78084.  3. The range from 78085 to 78085 has exactly one row equal to 78085. You may note that DISTINCT_RANGE_ROWS tells you something about the uniqueness of the rows represented by each histogram step.  If you calculate RANGE_ROWS / DISTINCT_RANGE_ROWS, you'll get the average number of "duplicate" rows for each distinct column value in the range.  Put another way, DISTINCT_RANGE_ROWS allows you to calculate the range's selectivity or density.