How does [BlobInput] work?

The Azure WebJobs SDK supports running functions when a new blob is added.  IE, you can write code like this:

         public static void CopyWithStream(
            [BlobInput("container/in/{name}")] Stream input,
            [BlobOutput("container/out1/{name}")] Stream output
            )
        {
            Debug.Assert(input.CanRead && !input.CanWrite);
            Debug.Assert(!output.CanRead && output.CanWrite);

            input.CopyTo(output);
        }

See modelbinding to blobs for how we bind the blob to types like Stream.  In this entry, I wanted to explain how we handle the blob listening.  The executive summary is:

  1. The existing blobs in your container are processed immediately on startup.
  2. But once you’re in steady state, [BlobInput] detection (from external sources) can take up to 10 minutes. If you need fast responses, use [QueueInput].
  3. [BlobInput] can be triggered multiple times on the same blob. But the function will only run if the input is newer than the outputs.

More details…

Blob listening is tricky since the Azure Storage APIs don’t provide this directly. WebJobs SDK builds this on top of the existing storage APIs by:

1. Determining the set of containers to listen on by scanning  the [BlobInput] attributes in your program via reflection in the JobHost ctor. This is a  fixed list because while the blob names can have { } expressions, the container names must be constants.  IE, in the above case, the container is named “container”, and then we scan for any blobs in that container that match the name “in/{name}”.

2. When JobHost.RunAndBlock is first called, it will kick off a background scan of the containers. This is naively using CloudBlobContainer.ListBlobs.

    a. For small containers, this is quick and gives a nice instant feel.  
    b. For large containers, the scan can take a long time.

3. For steady state, it will scan the azure storage logs. This provides a highly efficient way of getting notifications for blobs across all containers without pulling. Unfortunately, the storage logs are buffered and only updated every 10 minutes, and so that means that the steady state detection for new blobs can have a 5-10 minute lag. For fast response times at scale, our recommendation is to use Queues.

The scanning from #2 and #3 are done in parallel.

4. There is an optimization where any blob written via a [BlobOutput] (as opposed to being written by some external source) will optimistically check for any matching [BlobInputs], without relying on #2 or #3. This lets them chain very quickly. This means that a [QueueInput] can start a chain of blob outputs / inputs, and it can still be very efficient.

See Also

Comments

  • Anonymous
    April 19, 2014
    thanks!

  • Anonymous
    May 27, 2014
    The comment has been removed

  • Anonymous
    May 29, 2014
    Having [BlobInput] run multiple times is similar to how an azure queue message can be executed multiple times. Basically, it needs some evidence that the function has run. Azure doesn't tell us the precise moment when a blob is available, and so we basically need to make a best guess (per rules above). We err on the side of double-guessing as to not miss a blob. Add a bloboutput parameter to your function. The SDK can see the function ran, and not double-executed it. public static Work([BlobInput("container/{name}"] input, ... [BlobOutput("containerout/{name}"] TextWriter tw) {   tw.WriteLine("executed"); } We call this a "receipt" pattern. The bloboutput has no value other than being a "receipt" that the function ran.