Get cached responses of Azure OpenAI API requests
APPLIES TO: All API Management tiers
Use the azure-openai-semantic-cache-lookup
policy to perform cache lookup of responses to Azure OpenAI Chat Completion API and Completion API requests from a configured external cache, based on vector proximity of the prompt to previous requests and a specified similarity score threshold. Response caching reduces bandwidth and processing requirements imposed on the backend Azure OpenAI API and lowers latency perceived by API consumers.
Note
- This policy must have a corresponding Cache responses to Azure OpenAI API requests policy.
- For prerequisites and steps to enable semantic caching, see Enable semantic caching for Azure OpenAI APIs in Azure API Management.
- Currently, this policy is in preview.
Note
Set the policy's elements and child elements in the order provided in the policy statement. Learn more about how to set or edit API Management policies.
Supported Azure OpenAI Service models
The policy is used with APIs added to API Management from the Azure OpenAI Service of the following types:
API type | Supported models |
---|---|
Chat completion | gpt-3.5 gpt-4 |
Completion | gpt-3.5-turbo-instruct |
Embeddings | text-embedding-3-large text-embedding-3-small text-embedding-ada-002 |
For more information, see Azure OpenAI Service models.
Policy statement
<azure-openai-semantic-cache-lookup
score-threshold="similarity score threshold"
embeddings-backend-id ="backend entity ID for embeddings API"
embeddings-backend-auth ="system-assigned"
ignore-system-messages="true | false"
max-message-count="count" >
<vary-by>"expression to partition caching"</vary-by>
</azure-openai-semantic-cache-lookup>
Attributes
Attribute | Description | Required | Default |
---|---|---|---|
score-threshold | Similarity score threshold used to determine whether to return a cached response to a prompt. Value is a decimal between 0.0 and 1.0. Learn more. | Yes | N/A |
embeddings-backend-id | Backend ID for OpenAI embeddings API call. | Yes | N/A |
embeddings-backend-auth | Authentication used for Azure OpenAI embeddings API backend. | Yes. Must be set to system-assigned . |
N/A |
ignore-system-messages | Boolean. If set to true , removes system messages from a GPT chat completion prompt before assessing cache similarity. |
No | false |
max-message-count | If specified, number of remaining dialog messages after which caching is skipped. | No | N/A |
Elements
Name | Description | Required |
---|---|---|
vary-by | A custom expression determined at runtime whose value partitions caching. If multiple vary-by elements are added, values are concatenated to create a unique combination. |
No |
Usage
- Policy sections: inbound
- Policy scopes: global, product, API, operation
- Gateways: v2
Usage notes
- This policy can only be used once in a policy section.
Examples
Example with corresponding azure-openai-semantic-cache-store policy
<policies>
<inbound>
<base />
<azure-openai-semantic-cache-lookup
score-threshold="0.05"
embeddings-backend-id ="azure-openai-backend"
embeddings-backend-auth ="system-assigned" >
<vary-by>@(context.Subscription.Id)</vary-by>
</azure-openai-semantic-cache-lookup>
</inbound>
<outbound>
<azure-openai-semantic-cache-store duration="60" />
<base />
</outbound>
</policies>
Related policies
Related content
For more information about working with policies, see:
- Tutorial: Transform and protect your API
- Policy reference for a full list of policy statements and their settings
- Policy expressions
- Set or edit policies
- Reuse policy configurations
- Policy snippets repo
- Author policies using Microsoft Copilot in Azure