LLMs Quantization : Tools & Techniques

Netra Prasad Neupane
5 min readNov 23, 2024

--

Quantization is the process of reducing the size and memory footprint of deep learning models, making them more suitable for deployment on resource-constrained environment(CPU/GPU & memory) without affecting significant performance. It is possible to run the model having billion of parameters in your pocket laptop with the help of quantization. If you are new to the quantization, please go through following blogpost first. It will help you to understand quantization and it’s types.

In this blog, I will share about different techniques of quantization and tools that can be used to quantize the model. Let’s talk on different techniques of quantization first.

Quantization Techniques

There are serveral types of quantization techniques but in this blog we will understand about QLoRA, GPTQ, GGML/GGUF & AWQ quantization techniques:

fig 1: tools and techniques for LLMs quantization

QLoRA

QLoRA is a quantization based fine-tuning technique used to efficiently adapt pre-trained LLMs using low-rank adaptaion(LoRA).LoRA is parameter efficient fine-tuning(PEFT) technique that reduces the memory requirements by further training the base LLM by freezing its weights and fine-tuning small set of additional weights which are called adapters. It carries quantization through two mechanisms: 4-bit NormalFloat(NF4) data type and Double quantization(DQ).

  • NF4 is specialized 4-bit numerical format optimized for storing neural network weights. It which normalizes each weight to a value between -1 and 1 for a more accurate representation of the lower precision weight values compared to a conventional 4-bit float.
  • Double quantization(DQ) is a quantization technique where a second layer of quantization is applied to already quantized model. It is often used to achieve the extreme level of compression or efficiently in large language models.

GPTQ

GPTQ refers to post-training quantization for generative pre-trained transformers. This quantization technique is desinged to reduce the size of models so that they can run on a single GPU. GPTQ works through a form of layer-wise quantization: an approach that quantizes a model a layer at a time, with the aim of discovering the quantized weights that minimize output error (the mean squared error (MSE), i.e., the squared error between the outputs of the original, i.e., full-precision, layer and the quantized layer.)

In the first step, the model’s weights are transformed into a matrix, which is then processed in batches of 128 columns using a technique known as lazy batch updating. This process involves quantizing the weights in each batch, calculating the Mean Squared Error (MSE), and adjusting the weights to minimize this error. Once the calibration batch has been processed, the remaining weights in the matrix are updated based on the MSE from the initial batch. Finally, the individual layers are recombined to generate the quantized model.

GGML/GGUF

GGML refers to Georgi Gerganov Machine Learning. It is C based machine learning library which is designed to supported weight quantization of Llama models and it quantize the model in such a way that the quantize the model on GGML binary format that can run on a CPU.

GGML quantizes models through a process called the k-quant system, which uses value representations of different bit widths depending on the chosen quant method. Depending on the selected quant-method, the most important weights are quantized to a higher-precision data type, while the rest are assigned to a lower-precision type. For example, the q2_k quant method converts the largest weights to 4-bit integers and the remaining weights to 2-bit.

GGUF (GPT-Generated Unified Format) is a successor to GGML and is designed to address its limitations of GGML(Supports only Llama model). GGUF enables the quantization of non-Llama models. The method begins by grouping the similar weights from the model into smaller, more manageable clusters. Then each weight group is quantized uniformly.

GGUF quantization is particularly useful in scenarios where the trade-off between model size and accuracy needs to be carefully managed.

AWQ

AWQ (Activation-aware weight quantization) is an advance quantization technique where the quantization of weights is done while taking account the distribution of the activations of the model.Unlike traditional weight-only quantization, which only focuses on the weights themselves, activation-aware weight quantization aims to optimize both the weights and their interactions with the activations.

In this method, quantization of weights are adjusted based on distribution of activations in the model. Weights interacting with higher-variance activations are quantized with higher-precisions and weight associated with low-variance activations quantized with lower-precisions. It also minimizes the Mean Square Error(MSE) between the activation of original and quantize models, ensuring the quantization does not degrade the model performance.

If you want to run LLMs on your pocket computer, read the following blog.

Quantization Tools

There are different quantization tools which follows their own quantization techniques. Let’s understand tools one by one:

  • llama.cpp: Llama.cpp is both quantization and inference tools for large language models. It allows to quantize the model in GGML/GGUF format, especially the weight quantization. It enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware — locally and in the cloud.
  • GPTQ-for-Llama: This library does 4-bit quantization of LLaMA model using the weight quantization method. It supports only the linux machines.
  • Auto-GPTQ: Auto-GPTQ is based on GPTQ(weight-only quantization) algorithm. It supports various open-source LLM architectures such as falcon, llama, bloom etc.
  • bitandbytes: It is lightweight python wrapper around CUDA custoom functions and supports 8-bit and 4-bit quantizations.
  • Quantao: Quantao is pytorch quantization toolkit which supports optimum backend. The quantization method used is the linear quantization. It supports both activation and weights quantization into 8-bit float, 8-bit integer, 4-bit integer and 2-bit integer. It also supports quanization aware training(QAT).
  • llm-awq: This supports activation aware llm quantization. It is fast and accurate on low-bit weight quantization (INT3/4) for LLMs. It supports instruction-tuned models and multi-model LLMs.
  • AutoAWQ: AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs

Choosing the right quantization technique and tool can be a daunting task, but with a bit of guidance, you can make an informed decision based on your specific needs. For weight quantization, I personally prefer using llama.cpp due to its simplicity and strong community support, making it a go-to option for many. For activation-aware quantization, I rely on llm-awq . As for other quantization methods, I plan to experiment with various tools and share my findings with you, so stay tuned for updates on the best practices in the field.

I hope you find this blog insightful and learn about LLMs quantization tools and techniques. In the next blog, I will explain the step-by-step instruction on weight and activation quantization and evaluation of quantized models.See you in next blog soon!

Additionally, I’d love to hear your feedback and any tips or experiences you have to share. Don’t forget to clap if you found this useful, and please share it with others who might benefit. Your support and insights help me improve!

References:

--

--

Netra Prasad Neupane
Netra Prasad Neupane

Written by Netra Prasad Neupane

Machine Learning Engineer with expertise in Computer Vision, Deep Learning, NLP and Generative AI. https://www.linkedin.com/in/netraneupane/

No responses yet