SSIS Lookups modes or leaky abstractions

I've got a question about SSIS Lookup,

how-to change Partial or No_Cache mode in Lookup task ?

I could give a quick answer, but wanted to explain what I think we did wrong in SQL 2005, and how are we planning to fix this. Before continuing, I recommend everyone to read this article:

Joel Spolsky, The Law of Leaky Abstractions
https://www.joelonsoftware.com/articles/LeakyAbstractions.html

What that article has to do with SSIS Lookup and the question in particular?

If you look at the Lookup UI, you wound not find any reference to Partial or NoCache mode (it does show up in Properties panel, but most users miss it). The idea was that you don't have to know about all these details. The UI abstracts away these modes by providing advanced options, where you can enable Memory Restrictions. If have enough memory, you would keep defaults. But if you have less memory, you would enable Memory Restrictions, and specify how much memory Lookup should use. It is all clean and simple, and ideally you don't need to know about all these internal modes and how Lookup operates. I wish :).

And of course, like most abstractions, this one is very leaky. It is even mentioned in some Books Online articles, even though BOL does not mention the Lookup modes when describing Lookup Editor.

Internally, SSIS Lookup has three caching modes: Full, Partial and None. They dramatically change how the Lookup operates. Full cache is used if you don't check Enable memory restrictions, None if you check Enable memory restrictions but don't select Enable caching, and Partial cache is used when you check both Enable memory restrictions and Enable caching.

In Fully Cached mode, Lookup fetches all the data into memory first, then processes its input, matching (using internal binary comparison) input data with reference data. Obviously, SQL is not used anymore to process all this data. So you don't need SQL indexes, but you better have enough RAM to hold all the reference data.

In Partial or No Cache mode, Lookup does not fetch any data in advance. It starts with empty internal cache. For each input row, it looks into the cache, and then if it does not find data, it executes SQL SELECT statement and saves the result in the cache. You set the size of the cache for Partial cache mode in the Lookup properties. For No Cache mode, only the very last row is cached (which makes No Cache a bit of confusing name, but it's been decided it does not make much sense to throw away this last row, the memory for it has already been allocated anyway). Obviously, you need an index on the reference table for this to perform well. Also, SQL uses its own collation rules that are usually different from internal binary comparison.

Now you probably see why this abstraction was very leaky. The results could be different (SQL collation vs. binary collation), you need to create indexes for partial/no-cache mode, etc. And hopefully, now you know enough about what is hidden by this abstraction to use Lookup efficiently and correctly.

P.S. In SQL 2008, we decided to stop hiding this abstraction, and provide direct access to the Lookup cache modes. Here is the first page of new Lookup editor:
https://blogs.msdn.com/photos/michen/images/5267231/original.aspx
I'll probably blog about other new items you see here - Cache connection manager and handling of no-match entries later.

Comments

  • Anonymous
    October 03, 2007
    PingBack from http://www.artofbam.com/wordpress/?p=4972

  • Anonymous
    October 03, 2007
    Hi Mike, I know my post has nothing to do with your article, but I have a quick question for you? Do you intend to open SSIS DOM a little a bit and allow access to the package object at runtime? It is really painful and impossible to implement newer functionality. In my case I need access the package object from custom connection manager I'm implementing. I don't see a reason, why this wasn't allowed in the first place? I don't see how the performance can be impacted in any way. Regards, Ivan

  • Anonymous
    November 03, 2007
    Hi Mike, Regarding the "BIN" comparison with no relation to the data source's collation - i believe it is a leak. The simplest example is using conditional split. What are the plans for SSIS 2008 and what are the work arounds for SSIS 2005? Thanks in advance. Asaf.

  • Anonymous
    October 18, 2008
    Over the past couple of months I've been putting together a presentation on the Lookup Transform. I presented