Api Rate Abuse in Azure Openai
How API Rate Abuse Manifests in Azure OpenAI
Azure OpenAI exposes each model deployment through a REST endpoint that enforces per‑deployment quotas such as tokens per minute (TPM) and requests per minute (RPM). When an attacker can send a high volume of calls without hitting effective throttling, they can exhaust the quota, trigger frequent 429 Too Many Requests responses, or drive up the customer’s bill through token‑farming attacks. A typical pattern targets the /deployments/{deployment-id}/completions or /embeddings endpoints with a tight loop that ignores the Retry‑After header, forcing the service to reject requests and potentially causing cost overruns if the deployment is configured for over‑quota billing.
Because each deployment is an independent bucket, discovering a deployment name (often leaked in logs, client‑side JavaScript, or misconfigured environment variables) lets an attacker focus all traffic on that specific resource, bypassing any global network‑level controls. Real‑world abuse observed in Microsoft’s security telemetry shows scripts sending thousands of completion requests per minute to harvest model output for data exfiltration or to inflate consumption, which aligns with OWASP API4:2023 – Lack of Resources & Rate Limiting.
Azure OpenAI‑Specific Detection
Indicators of rate‑abuse include a surge of HTTP 429 responses, the presence of a Retry‑After header, and spikes in Azure Monitor metrics such as Total Calls and Blocked Calls for a deployment. Developers can also look at diagnostic logs for repeated QuotaExceeded error codes.
middleBrick automates detection by probing the unauthenticated attack surface. When you submit an Azure OpenAI endpoint, the scanner sends a burst of requests and measures how the service responds, checking for missing or misconfigured rate‑limit headers and for the absence of protective mechanisms like exponential backoff.
Example CLI usage:
middlebrick scan https://myresource.openai.azure.com/openai/deployments/my-model/completions?api-version=2024-02-01
The output includes a dedicated Rate Limiting finding with severity, a short description, and remediation guidance such as “enable client‑side throttling or enforce limits via Azure API Management”. The same check can be added to a CI pipeline with the GitHub Action:
- name: Run middleBrick scan
uses: middlebrick/action@v1
with:
api-url: https://myresource.openai.azure.com/openai/deployments/my-model/completions?api-version=2024-02-01
fail-below: B
Azure OpenAI‑Specific Remediation
Effective mitigation combines client‑side throttling, service‑side policy enforcement, and monitoring.
- Client‑side rate limiting: Use the Azure OpenAI SDK’s built‑in retry with exponential backoff and cap concurrent calls. In Python, the
openaipackage (v1.x) respects themax_retriesparameter and honorsRetry‑After. Adding a semaphore or token bucket further limits the request rate.
import os
import asyncio
from openai import AzureOpenAI
client = AzureOpenAI(
api_key=os.getenv("AZURE_OPENAI_KEY"),
api_version="2024-02-01",
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
max_retries=5 # built‑in exponential backoff
)
async def safe_completion(prompt: str, sem: asyncio.Semaphore):
async with sem:
resp = await client.completions.create(
model=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
prompt=prompt,
max_tokens=100
)
return resp
# Allow at most 10 concurrent requests (~10 RPM if each call takes ~6s)
semaphore = asyncio.Semaphore(10)
limit-call-rate policy (e.g., 120 calls/minute) and a quota policy to reset monthly token consumption. APIM will return 429 with a proper Retry‑After header before the request reaches the OpenAI backend.Blocked Calls metric or on the QuotaExceeded diagnostic log. When the threshold is breached, the alert can trigger a webhook to notify the team or to scale down the deployment temporarily.By combining these controls, you reduce the risk of cost‑exploitation and denial‑of‑service while still allowing legitimate traffic. middleBrick’s findings will reflect the improved posture, showing a higher security score and fewer rate‑limit related findings.