Quantization with DirectML helps you scale further on Windows

Patrice Vignola · May 27, 2024

DirectML support for Phi 3 mini launched last month and we’ve since made several improvements, unlocking more models and even better performance! Developers can grab already quantized versions of Phi-3 mini (with variants for the 4k and 128k versions). They can now also get Phi 3 medium (4k and 128k) and Mistral v0.2. Stay tuned for additional pre-quantized models! We’ve also shipped a gradio interface to make easier to test these models with the new ONNX Runtime Generate() API. Learn more. Be sure to check out our Build sessions to learn more. See below for details. See here to learn what our hardware vendor partners have to say:

What is quantization?

Memory bandwidth is often a bottleneck for getting models to run on entry-level and older hardware, especially when it comes to language models. This means that making language models smaller directly translates to increasing the breadth of devices developers can target. There’s been a lot of research into reducing model size through quantization, a process that reduces the precision and therefore size of model weights. Our goal is to ensure scalability, while also maintaining model accuracy, so we integrated support for models that have had Activation-Aware Quantization (AWQ) applied to them. AWQ is a technique that lets us reap the memory savings from quantization with only a minimal impact on accuracy. AWQ achieves this by identifying the top 1% of salient weights that are needed for maintaining model accuracy and then quantizes the remaining 99% of weights. This leads to much less accuracy loss with AWQ compared to other techniques. The average person reads up to 5 words/second. Thanks to the significant memory wins from AWQ, Phi-3-mini runs at this speed or faster on older discrete GPUs and even laptop integrated GPUs. This translates into being able to run Phi-3-mini on hundreds of millions of devices! Check out our Build talk below to see this in action!

Perplexity measurements

Perplexity is a measure used to quantify how well a model predicts a sample. Without getting into the math of it all, a lower perplexity score means the model is more certain about its predictions and suggests that the model's probability distribution is closer to the true distribution of the data. Perplexity can be thought of as a way to quantify the average number of branches in front of a model at each decision point. At each step, a lower perplexity would mean that the model has fewer, more confident choices to make, which reflects a more refined understanding of the topic. A higher perplexity would mean more, less confident choices and therefore choices that are less predictable, relevant, and/or varied in quality. As you can see below our data shows that AWQ leads to a small loss in model accuracy with only a small increase in perplexity. In return, using AWQ means 4x smaller model weights, leading to a dramatic increase in the number of devices that can run Phi-3-mini!

Model variant	Dataset	Base model perplexity	AWQ perplexity	Difference
Phi3 mini 128k	wikitext2	14.42	14.81	0.39
Phi3 mini 128k	ptb	31.39	33.63	2.24
Phi3 mini 4k	wikitext2	15.83	16.52	0.69
Phi3 mini 4k	ptb	31.98	34.3	2.32

Learn more

Be sure check out the these sessions at Build to learn more:

BRK240: Bring AI experiences to all your Windows Devices
BRK247: Create Generative AI experiences using Phi
LAB371: Test Drive AI on Windows with DirectML, ONNX Runtime, and Olive

Get Started

Check out the ONNX Runtime Generate() API repo to get started today: GitHub - microsoft/onnxruntime-genai: Generative AI extensions for onnxruntime See here for our chat app with a handy gradio interface: onnxruntime-genai/examples/chat_app at main · microsoft/onnxruntime-genai This lets developers choose from different types of language models that work best for their specific use case. Stay tuned for more!

Drivers

We recommend upgrading to the latest drivers for the best performance.

AMD: improved driver acceleration for generative AI including large language models (AMD Software: Adrenalin Edition 23.40.27.06 for DirectML)
Intel is excited to partner with Microsoft and provide a driver optimized for these AWQ scenarios across a wide range of hardware – please download our publicly available WHQL certified driver with full support today, available here
NVIDIA: R555 Game Ready, Studio or NVIDIA RTX Enterprise

Quantization with DirectML helps you scale further on Windows

What is quantization?​

Perplexity measurements​

Learn more​

Get Started​

Drivers​

Similar threads

What is quantization?

Perplexity measurements

Learn more

Get Started

Drivers