Windows Server Posted March 7 Posted March 7 This article is part of a series of articles on API Management and Generative AI. We believe that adding Azure API Management to your AI projects can help you scale your AI models, make them more secure and easier to manage. In this article, we will shed some light on capabilities in API Management, which are designed to help you govern and manage Generative AI APIs, ensuring that you are building resilient and secure intelligent applications. But why exactly do I need API Management for my AI APIs? Common challenges when implementing Gen AI-Powered solutions include: - Quota, (calculated in tokens-per-minute (TPM)), allocation across multiple client apps, How to control and track token consumption for all users, Mechanisms to attribute costs to specific client apps, activities, or users, Your systems resiliency to backend failures when hitting one or more limits And the list goes on with more challenges and questions. Well, let’s find some answers, shall we? Quota allocation Take a scenario where you have more than one client application, and they are talking to one or more models from Azure OpenAI Service or Azure AI Foundry. With this complexity, you want to have control over the quota distribution for each of the applications. Tracking Token usage & Security I bet you agree with me that it would be unfortunate if one of your applications (most likely that which gets the highest traffic), hogs up all the TPM quota leaving zero tokens remaining for your other applications, right? If this occurs though, there is a high chance that it might be a DDOS Attack, with bad actors trying to bombard your system with purposeless traffic causing service downtime. Yet another reason why you will need more control and tracking mechanisms to ensure this doesn’t happen. Token Metrics As a data-driven company, having additional insights with flexibility to dissect and examine usage data down to dimensions like subscription ID or API ID level is extremely valuable. These metrics go a long way in informing capacity and budget planning decisions. Automatic failovers This is a common one. You want to ensure that your users experience zero service downtime, so if one of your backends is down, does your system architecture allow automatic rerouting and forwarding to healthy services? So, how will API Management help address these challenges? API Management has a set of policies and metrics called Generative AI (Gen AI) gateway capabilities, which empower you to manage and have full control of all these moving pieces and components of your intelligent systems. Minimize cost with Token-based limits and semantic caching How can you minimize operational costs for AI applications as much as possible? By leveraging the `llm-token limit` policy in Azure API Management, you can enforce token-based limits per user on identifiers such as subscription keys and requesting IP addresses. When a caller surpasses their allocated tokens-per-minute quota, they receive a HTTP "Too Many Requests" error along with ‘retry-after’ instructions. This mechanism ensures fair usage and prevents any single user from monopolizing resources. To optimize cost consumption for Large Language Models (LLMs), it is crucial to minimize the number of API calls made to the model. Implementing the `llm-semantic-cache-store` policy and `llm-semantic-cache-lookup` policies allow you to store and retrieve similar completions. This method involves performing a cache lookup for reused completions, thereby reducing the number of calls sent to the LLM backend. Consequently, this strategy helps in significantly lowering operational costs. Ensure reliability with load balancing and circuit breakers Azure API Management allows you to leverage load balancers to distribute the workload across various prioritized LLM backends effectively. Additionally, you can set up circuit breaker rules that redirect requests to a responsive backend if the prioritized one fails, thereby minimizing recovery time and enhancing system reliability. Implementing the semantic-caching policy not only saves costs but also reduces system latency by minimizing the number of calls processed by the backend. Okay. What Next? This article mentions these capabilities at a high level, but in the coming weeks, we will publish articles that go deeper into each of these generative AI capabilities in API Management, with examples of how to set up each policy. Stay tuned! Do you have any resources I can look at in the meantime to learn more? Absolutely! Check out: - Manage your Azure OpenAI APIs with Azure API Management http://aka.ms/apimlove View the full article Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.