2-bit VPTQ: 6.5x Smaller LLMs whereas Preserving 95% Accuracy -

Very correct 2-bit quantization for working 70B LLMs on a 24 GB GPU

Latest developments in low-bit quantization for LLMs, like AQLM and AutoRound, are actually displaying acceptable ranges of degradation in downstream duties, particularly for big fashions. That mentioned, 2-bit quantization nonetheless introduces noticeable accuracy loss usually.

One promising algorithm for low-bit quantization is VPTQ (MIT license), proposed by Microsoft. It was launched in October 2024 and has since proven glorious efficiency and effectivity in quantizing giant fashions.

On this article, we are going to:

Assessment the VPTQ quantization algorithm.
Exhibit tips on how to use VPTQ fashions, lots of that are already obtainable. As an example, we will simply discover low-bit variants of Llama 3.3 70B, Llama 3.1 405B, and Qwen2.5 72B.
Consider these fashions and focus on the outcomes to know when VPTQ fashions is usually a sensible choice for LLMs in manufacturing.

Remarkably, 2-bit quantization with VPTQ virtually achieves efficiency akin to the unique 16-bit mannequin on duties akin to MMLU. Furthermore, it permits working Llama 3.1 405B on a single GPU, whereas utilizing much less reminiscence than a 70B mannequin!

2-bit VPTQ: 6.5x Smaller LLMs whereas Preserving 95% Accuracy

Very correct 2-bit quantization for working 70B LLMs on a 24 GB GPU

Automating E-Commerce Descriptions with Multi-Agent Programs

The AI Product Supervisor. An introduction to an thrilling position… | by Anna Through | Jan, 2025

OpenAI releases its new o3-mini reasoning mannequin without spending a dime

Align Your Information Structure for Common Information Provide | by Bernd Wessely | Jan, 2025