Are Azure OpenAI Service Rate Limits Shared Between Deployments?

Arik Levy 0

I have two Azure OpenAI services, one called dev and the other prod, both using the GPT-4 model deployed on UKS with different versions.

The prod service has a TPM of 35k and the dev service has a TPM of 10k.

While testing, I encountered rate limit issues on the dev service, prompting me to create a simple for loop that calls the model, asks a basic question, and waits one second between requests. The token count for these requests is negligible. Despite this, I am hitting a rate limit error after the 10th request, even though the RPM limit is set to 60.

Is it possible that the two deployments are interfering with each other in terms of rate limits?

kothapally Snigdha 260 Reputation points Microsoft Vendor

2024-11-05T22:34:54.4333333+00:00

Hi Arik Levy,

Thanks for Reaching the Microsoft Q&A Forum.

Each deployment has its own limits, which can affect how requests are processed. In your case, the dev service has a TPM of 10k and a corresponding RPM limit that allows for approximately 60 requests per minute

Even though you have implemented a one-second delay between requests in your loop, the service may still be evaluating the request rate over shorter intervals (like 1 or 10 seconds). If you send multiple requests within these intervals, you could exceed the RPM limit.

Each deployment (dev and prod) has its own rate limits. However, if both services share the same subscription or resource group, they may be subject to a global quota that could potentially lead to interference. This means that high usage on one deployment could impact the other if they share limits at the subscription level.

Verify whether both deployments share any resources or quotas that could be leading to conflicts in rate limiting. If they do, consider restructuring your deployments to isolate their usage.

Thank you!

Arik Levy 0

Both deployments are in different subscriptions. I'm just running the code below:

My gpt-4 deployment (I set the TPM low to test this) is at 10k TPM and 60 RPM User's image Why am I hitting a 429 Too Many Requests, error by the 15th request? My initial thought was that my other gpt-4 deployment in prod was the culprit but they don't share a subscription.

for i in range(60):
    try:
        # Send the request to the client
        response = client.chat.completions.create(
            model=deployment_name,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": 'What is the capital of Spain?'}
            ],
            temperature=0
        )
        # Log the count of requests
        logging.info("Request %d: Successful", i + 1)
        time.sleep(2)

        total_tokens += response.usage.total_tokens
        print(total_tokens)

kothapally Snigdha 260 Reputation points Microsoft Vendor

2024-11-06T22:52:41.7466667+00:00

Hi Arik Levy,

The error 429 means that you have submitted too many tokens or requests in a short period of time and have exceeded the number of requests allowed.

Azure services often enforce quotas on transactions per minute (TPM). If your application or service exceeds this quota, it will start receiving 429 errors.

Sometimes, 429 errors can occur due to transient issues while the Azure OpenAI service scales up to meet demand. In such cases, implementing retry logic with exponential backoff can help mitigate the issue.

If you have multiple deployments, ensure that the workload is evenly distributed among them. Sharp changes in workload can trigger rate limits, so gradually increasing the workload can help avoid hitting these limits.

Thank you!
kothapally Snigdha 260 Reputation points Microsoft Vendor

2024-11-07T23:31:00.5833333+00:00

Hi Arik Levy,

Following up to see if the above response was helpful.

1 answer

Arik Levy 0 Reputation points

2024-11-06T10:43:23.0166667+00:00
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Are Azure OpenAI Service Rate Limits Shared Between Deployments?

1 answer

Your answer