Guest Patrice Vignola Posted May 27 Posted May 27 DirectML support for Phi 3 mini launched last month and we’ve since made several improvements, unlocking more models and even better performance! Developers can grab already quantized versions of Phi-3 mini (with variants for the 4k and 128k versions). They can now also get Phi 3 medium (4k and 128k) and Mistral v0.2. Stay tuned for additional pre-quantized models! We’ve also shipped a gradio interface to make easier to test these models with the new ONNX Runtime Generate() API. Learn more. Be sure to check out our Build sessions to learn more. See below for details. See here to learn what our hardware vendor partners have to say: AMD: https://community.amd.com/t5/ai/reduce-memory-footprint-and-improve-performance-running-llms-on/ba-p/686157 Intel: https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Accelerating-Language-Models-Intel-and-Microsoft-Collaborate-to/post/1598013 NVIDIA: https://blogs.nvidia.com/blog/microsoft-build-optimized-ai-developers [HEADING=1]What is quantization?[/HEADING] Memory bandwidth is often a bottleneck for getting models to run on entry-level and older hardware, especially when it comes to language models. This means that making language models smaller directly translates to increasing the breadth of devices developers can target. There’s been a lot of research into reducing model size through quantization, a process that reduces the precision and therefore size of model weights. Our goal is to ensure scalability, while also maintaining model accuracy, so we integrated support for models that have had Activation-Aware Quantization (AWQ) applied to them. AWQ is a technique that lets us reap the memory savings from quantization with only a minimal impact on accuracy. AWQ achieves this by identifying the top 1% of salient weights that are needed for maintaining model accuracy and then quantizes the remaining 99% of weights. This leads to much less accuracy loss with AWQ compared to other techniques. The average person reads up to 5 words/second. Thanks to the significant memory wins from AWQ, Phi-3-mini runs at this speed or faster on older discrete GPUs and even laptop integrated GPUs. This translates into being able to run Phi-3-mini on hundreds of millions of devices! Check out our Build talk below to see this in action! [HEADING=1]Perplexity measurements[/HEADING] Perplexity is a measure used to quantify how well a model predicts a sample. Without getting into the math of it all, a lower perplexity score means the model is more certain about its predictions and suggests that the model's probability distribution is closer to the true distribution of the data. Perplexity can be thought of as a way to quantify the average number of branches in front of a model at each decision point. At each step, a lower perplexity would mean that the model has fewer, more confident choices to make, which reflects a more refined understanding of the topic. A higher perplexity would mean more, less confident choices and therefore choices that are less predictable, relevant, and/or varied in quality. As you can see below our data shows that AWQ leads to a small loss in model accuracy with only a small increase in perplexity. In return, using AWQ means 4x smaller model weights, leading to a dramatic increase in the number of devices that can run Phi-3-mini! Model variant Dataset Base model perplexity AWQ perplexity Difference Phi3 mini 128k wikitext2 14.42 14.81 0.39 Phi3 mini 128k ptb 31.39 33.63 2.24 Phi3 mini 4k wikitext2 15.83 16.52 0.69 Phi3 mini 4k ptb 31.98 34.3 2.32 [HEADING=1]Learn more[/HEADING] Be sure check out the these sessions at Build to learn more: BRK240: Bring AI experiences to all your Windows Devices BRK247: Create Generative AI experiences using Phi LAB371: Test Drive AI on Windows with DirectML, ONNX Runtime, and Olive [HEADING=1]Get Started[/HEADING] Check out the ONNX Runtime Generate() API repo to get started today: GitHub - microsoft/onnxruntime-genai: Generative AI extensions for onnxruntime See here for our chat app with a handy gradio interface: onnxruntime-genai/examples/chat_app at main · microsoft/onnxruntime-genai This lets developers choose from different types of language models that work best for their specific use case. Stay tuned for more! [HEADING=1]Drivers[/HEADING] We recommend upgrading to the latest drivers for the best performance. AMD: improved driver acceleration for generative AI including large language models (AMD Software: Adrenalin Edition 23.40.27.06 for DirectML) Intel is excited to partner with Microsoft and provide a driver optimized for these AWQ scenarios across a wide range of hardware – please download our publicly available WHQL certified driver with full support today, available here NVIDIA: R555 Game Ready, Studio or NVIDIA RTX Enterprise Continue reading... Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.