Conor vs. UDFs in Joins
In my previous post, I reviewed the basic logic behind why the Optimizer chooses one join algorithm over another. In this post, I’ll go one step further and discuss what happens when the basic logic for joins is combined with the logic for how UDFs work.
(Apologies – I need to find the code plugin for my blog writer)
Here is the code snippet from the customer:
Drop function Func
go
create function Func(@id integer)
returns varchar(510)
as
begin
return cast(@id as varchar(510))
end
go
drop table #x1
go
Create table #x1
(
Id integer
)
go
insert into #x1(Id)
select top(67) ROW_NUMBER() over (order by (select null))
from sys.columns a cross join sys.columns b
go
select * from #x1 a join #x1 b on dbo.Func(a.id) = dbo.Func(b.id)
option(recompile)
go
insert into #x1(Id)
select 68
go
select * from #x1 a join #x1 b on dbo.Func(a.id) = dbo.Func(b.id)
option(recompile)
The question is why the first query does a loops join and the second does a hash join for these UDFs. As you might guess, the answer is actually mostly independent of the UDFs and really has more to do with the number of rows. However, there is a bit more to the story, so I will post a bit more for you about UDFs.
This particular UDF wraps a single piece of scalar logic. However, UDFs are opaque to the Optimizer in SQL Server. So, it doesn’t expand these functions and in-line them. (it could, but the code does not do this today). Generally, most general scalars are not costed in the current SQL Server code since they are relatively cheap compared to the cost of IO. Relational operations, such as a hash join, do have a cost function that has a CPU component, so the cost of processing the scalars are typically wrapped up there. There is no real scalar costing support for your own scalar functions, however. So, these are not really impacting the plan choice in this example.
Interestingly, if you were to rewrite the query without the UDFs and just inline the scalar logic manually, there does appear to be a cost to that and the plan choice tips to hash join earlier.
I generally recommend that customers NOT use UDFs in cases where plan choice is important to their system (ie big queries) and to specifically not put subqueries inside of UDFs. Using a procedural programming paradigm when writing SQL will usually lead to horrible performance because it prevents the Optimizer from making reasoned choices about the join order/algorithm (they too are not costed, even though they do queries in them).
So, I hope that gives you a bit of insight as to how the costing framework works.
Happy Querying!
Conor
Comments
Anonymous
July 01, 2010
Can you explain the join logic on table-valued UDFs? We utilize TVFs for large chunks of data in joins and it seems that performance is hit or miss.Anonymous
July 04, 2010
The comment has been removedAnonymous
July 24, 2010
The comment has been removed