Azure AI Search - Prevent rerunning AI Skills on Indexer or Index loss

Carter Musick 10 Reputation points
2024-02-11T19:31:24.5466667+00:00

I'm working on a project that is going to index a large number of blobs from Azure storage (pdfs and images) with an OCR skill to do text extraction for use in Azure Search.

Using OCR to index all of this data is likely to cost multiple thousands of dollars, and I'm trying to figure out the best way to avoid having to redo all of the OCRing if some Azure resource (index or indexer) goes down / get's deleted / has some unrecoverable error, etc. I've got a few initial thoughts:

  1. Manually backup the search indices and restore them if necessary - I want to do a proof of concept here, but am unsure if this would be sufficient in the case of the indexer going down/getting deleted. Presumably this would help with some error with the index.
  2. Use incremental enrichment with cached content - This seems like something we'd likely do anyway, as updates to skills or other things later in the indexing pipeline would be able to reuse results from the cached enrichment skill. I'm not sure this would still adequately handle the problem of some issue with the indexer however. I had considered maybe manually backing up the cache and restoring that cache to a new indexer, however :

Each indexer is assigned a unique and immutable cache identifier that corresponds to the container it is using.> [...]> The lifecycle of the cache is managed by the indexer. If an indexer is deleted, its cache is also deleted.

  1. Remove the OCRing from the indexing pipeline, and store the intermediate results manually. These intermediate results would then be used for the search index. I'm guessing this would be the safest option, but I'd be losing out on some of the convenience of having the OCR enrichment be part of the search indexing. I'm also not sure what the best storage method would be for this data (json blobs in another storage container?). Nor am I sure how best to handle incrementally OCRing new documents as they come in - I could update the source for document upload, but having the indexer process new documents on a schedule is another convenience I'd like to keep if possible.
Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
831 questions
Azure Computer Vision
Azure Computer Vision
An Azure artificial intelligence service that analyzes content in images and video.
338 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,577 questions
{count} vote

1 answer

Sort by: Most helpful
  1. Grmacjon-MSFT 17,286 Reputation points
    2024-03-13T04:01:53.06+00:00

    Hi @Carter Musick you might be overcomplicating thins with your proposed solution. since you're focused on backing up your skills you should consider implementing knowledge store: Knowledge store concepts - Azure AI Search | Microsoft Learn. Any data that you send to an index, can be send to a knowledge store and then you could recreate the index from the knowledge store accordingly by pointing to it as a data source. This would help keep the AI enrichments. Indexer cache can be invalidated under specific conditions, so it doesn't work for an index loss scenario.

    Regarding intermediate results, can you please provide more context about your scenario, so we understand what you're trying to do. why do you need intermediate results saved, what do you mean by intermediate results?

    Best,

    Grace

    0 comments No comments