Conor vs. window functions, sequence/ranking functions order-of-operations
We had a customer issue today that ended up being by-design but was confusing enough that I offered to write up a blog post so that people could reference it in the future. It boils down to the question “when should ROW_NUMBER() be computed in a query?”
I’ll define some terms:
Sequence/Ranking functions: ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(N). These are all functions that use the OVER clause. They support PARTITION BY and ORDER BY clauses within OVER() and are useful for numbering and computing values over a sequence (instead of a set). These return one value over a group of rows. Conceptually, they are computed over the set and then added as a “constant” column over that group of rows.
Window functions: These are new in SQL Server 2012, but the functions overlap aggregate functions such as SUM(), AVG(), … (There are some that are not aggregates). These support operations that return different values for each row in the sequence. Interesting examples include running sums (sum of all values for this column from the first row in the group to the current row) or moving averages (average of last 10 rows). These computations are also done over an ordering of a set of rows. There are additional phrases in the OVER() clause that control the behavior of these windows over which the computations are done.
When the ANSI committee added these operations into SQL, they added some rules about the order of operations which may not be entirely obvious. So I’ll write a little about it here.
Let’s use the following example data set:
1: use tempdb
2: go
3: create table t1(col1 int, col2 int)
4: insert into t1 (col1, col2) values (1, 1)
5: insert into t1 (col1, col2) values (1, 2)
6: insert into t1 (col1, col2) values (1, 3)
7: insert into t1 (col1, col2) values (1, 4)
8: insert into t1 (col1, col2) values (2, 5)
9: insert into t1 (col1, col2) values (2, 6)
10: insert into t1 (col1, col2) values (2, 7)
11: insert into t1 (col1, col2) values (2, 8)
12: go
Now let’s do a basic query and define the binding order for each part of the query.
1: select * from t1 order by col1, col2
This query returns all of the rows in ascending order by col1, col2. The conceptual series of steps taken are:
1. Get rows from t1
2. Get the columns that you wish to return (the project list)
3. perform the order by
Being a Query Optimizer guy, it is important that I remind you that SQL Server does not have to perform things in this exact order – it just needs to return the rows as if it had performed these steps. Additionally, this order of operations determines the binding behavior of the query. So, the project list can refer to the columns in the FROM clause but it can’t refer to other columns in the project list (which is why SELECT col1 as col2, col2 as col1 means “read t1.col1 and call it col2, read t1.col2 and call it col1” instead of “read t1.col1 and call it col2, then return t1.col1 which we just called col2 again”. This binding order is pretty important and it also impacts how the query’s ORDER BY is interpreted – you can refer to the results of the prior step when you do the ORDER BY. This means that ORDER BY binds to the aliases, not the source data in the FROM clause.
Now let’s get to the customer question. The issue at hand is what happens when you have Ranking functions (such as ROW_NUMBER()) and views – in which order are operations performed?
I’ve created a synthetic example. Effectively, we need to determine what happens with Ranking functions, views, project lists, ORDER BY on the query, and filters.
1: create view v1 as
2: select row_number() over (order by col1, col2) as rownum, * from t1
On top of the view there was a query that (in simplified terms was):
1: select * from v1 where col1 = 2 order by rownum
The results of this query is:
The confusion was about what should be returned from the query – does it return ROW_NUMBER() values starting at 1? (I should point out that the customer query was more complicated and contained parallel queries, which was a major contributor to the confusion – it just happened to be unrelated to the core issue in the end so I thought a blog post would help everyone go through the binding order to help in future such cases).
When the customer ran the query without the view, they got the following results
1: select row_number() over (order by col1, col2) as rownum, * from t1
2: where col1 = 2
The reason why these are different is that the order of operations is different. In the first query (where we get back 5, 6, 7, 8), the conceptual order of operations in this query is:
1. FROM clause
2. Project List for the view
3. ROW_NUMBER()
4. Filter
5. Query ORDER BY (which in ANSI is called the “presentation order”)
For the second query (which returns 1,2,3,4), the conceptual order of operations is:
1. FROM clause
2. WHERE clause
3. Project List
4. Ranking Functions (ROW_NUMBER())
(5. ORDER BY for the query, which I did not include in this case)
The Ranking functions are funny because they are conceptually performed after the Project list yet can not bind to aliases in the project list. If you try the following in SQL Server, it will error:
1: select row_number() over (order by foo, col2) as rownum, col1 as foo, * from t1
Msg 207, Level 16, State 1, Line 1
Invalid column name 'foo'.
So Ranking functions are a conceptual extension of the project list but have special rules about PARTITION BY and ORDER BY in the OVER clause. When you look at the query plans for these in SQL Server, you will see separate operators for the ranking and windowing functions compared to a normal “compute scalar”.So if you need to understand why a query returns different values for these ranking functions from what you expect, I suggest you start by walking through the conceptual order of operations (perhaps with an isolated example) until you have a good grasp of the way the query binds.
Happy Querying!
Conor
Comments
- Anonymous
May 27, 2012
Great post. The ROW_NUMBER() function works the way I thought it would but I can clearly see why someone could expect it to work as the last step in a query (and thus do row 1, 2, 3, 4 for the select from the view). It has to work the way it does so that the ROW_NUMBER() function could be used from within a subquery to do things like select the 3rd, 10th and 15th rows (for example). That is select * from ( select col1, col2, r = row_number() over (order by col1, col2) from t1 ) X where X.r in (3, 10, 15) It's a bit more clear in this case because the nested query can really be considered separate from the outer query I suppose (unless it was correlated but it still doesn't really reduce the clarity that much).