Azure OpenAI and APIM - load balancing between instances with different deployments

Question

I'm exploring the possibilities and options to configure shared GenAI platform using Azure OpenAI and APIM. I know APIM has set of capabilities that work directly with the Azure OpenAI service, especially the load balancers.

However I'm unable to find answer for following question: is APIM able to load balance/route requests to Azure OpenAI backends, if there are differences in models/deployments(s) availability across those instances?

Azure OpenAI has different models available in different regions. Lets say I have 3 Azure Open AI instances with following models:

AOI_1 : gpt-4o,embedding-3-small
AOI_2 : gpt-4o,embedding-3-small,gpt-4o-mini
AOI_3 : gpt-4o,gpt-4o-mini

Does APIM know what are the deployments in each backend AOI service? If I send request to APIM for gpt-4o-mini deployment, it should only consider AOI_2 and AOI_3.
When I request embedding-3-small, request should be routed to AOI_1 and AOI_2?

What are the options/possibilities to implement in case it's not out of the box feature?

Thanks in advance!

Answer

Hi @Jerzy Czopek Thanks for posting the question here. This is an interesting scenario.

Looking at the implementation of Load balancing using Azure APIM to multiple instances of Azure Open AI services, we just provide the backend Id's as pool sources for load balancing. Please refer the following implementation sample of load-balancing outlined in the article Using Azure API Management Circuit Breaker and Load balancing with Azure OpenAI Service

User's image

This approach does not expose the underlying Open AI model implemented in the end points.

Please refer the document Intelligent Load Balancing with APIM for OpenAI for different options available and its approach towards load balancing.

It is also advised to have same model of Open AI endpoints when setting up load balancing as the current policies available (Roundrobin, Random, Priority, Weightage) do not have an option to configure what you are seeking.

However, if you decide to use a load balancer with multiple version on Open AI models, you may consider to use the set-backend-service policy to direct an incoming API request to an alternate backend. This approach uses a conditional logic and redirects the requests based on various parameters such as location, gateway that was called from, or other expressions such as versions. Once you this in place, you can have an additional check on the instance end point and route the incoming calls accordingly. Please refer to the documentation Reference backend using set-backend-service policy to know more about this.

Hope this answers your question. Please let us know if you need any additional information.

Answer

Hi Jerzy Czopek, yes you can mix and use various Azure OpenAI models / deployments behind the API-M.

More typical use-case is when you want to use a mix of GPT-3.5, GPT-4 and GPT-4o in various regions, e.g. to use EU-only Azure regions to stay data compliant. You can check how to implement something like this in my GitHub repo here: https://github.com/LazaUK/AOAI-APIM-AIGateway.

If you want to introduce more granular split, it would be required to adjust the policy in API-M with the new conditional check, e.g. to check the name of requested deployment and if you set specific naming convention (like GPT- or Emb- prefixes for deployments), then it would engage only specific backends sub-pool. The challenge with such scenario is its management, as your models / backends change over time.

Share via

Azure OpenAI and APIM - load balancing between instances with different deployments

2 answers

Your answer