Introduction to Large Language Models (LLMs) Quantization

Netra Prasad Neupane
6 min readNov 14, 2024

--

Large language Model’s not only have the large number of parameters and trained on the massive text datasets but also consume the huge resources due to their large parameters & model size. Which posses significant challenges for the deployment of the model on low resource environment/devices and can obstruct the inference speed and performance on real-time. To address the above problem quantization comes into play. In this blog, I will walk you through LLMs quantization & different types of quantization.

Introduction

Quantization in context of LLMs is the process of reducing the precision of model parameters, such as weight and biases while maintaining performance. For example, Reducing the parameter of Llama-2–7B from 32-bit floating-point to 8-bit integers.

fig 1: Quantization ( source: https://inside-machinelearning.com/)

Let me explain the quantization using following three examples:

  • The LLaMA3.1–70B model with FP32 precision requires a massive 336 GB VRAM for inference, necessitating the use of multiple high-end GPUs. However, through 4-bit quantization, its memory footprint can be reduced to a more manageable 42 GB (~90% less), allowing it to run efficiently on a single A100 GPU. This demonstrates the transformative power of quantization in making large models accessible to a wider range of users and hardware.
  • Kathmandu is one of the densely populated city in the world. Imagine each house in the city representing a parameter in the LLM. Here, quantization transforms this city into a more manageable city by keeping only the most important houses(parameters) and replacing the others with smaller ones(lower precision), or alternatively removing the very unimportant houses and creating the open space(zeros) in between.Now, it will be more easy and faster to reach the house in the resulted city.
fig 2: Quantization Example
  • Mindset is the most famous book by Dr. Carol Dweck. Imaging the each paragraph in the book representing a parameter in the LLM. Here, quantization transforms this book into small summary which consist of only few paragraphs(parameters). This summary version takes up less space and is easier to read and understand but might lose some of the information present in the original book.
fig 3: Quantization example

In above examples, we convert the dense model to more sparse model i.e resulted model have large number of zero-valued parameters which makes model more computationally efficient and faster to process. Additionally, the open space (zeroes) between the retained parameters reduces the overall model size and complexity, further enhancing efficiency.

Quantization Strategies:

There are several strategies for applying quantization to different components of a deep learning model.

  1. Weights: Reduces model size and memory footprint
  2. Activations: Saves memory, especially when combined with weight quantization
  3. KV Cache: Speeds up sequence generation for models handling long sequences
  4. Gradients: Reduces communication overhead in large-scale training and potentially accelerates training

Types of Quantization

Post-Training vs Quantization-Aware Training(QAT)

Quantization can be performed at different points in time. There are two techniques of LLM model quantization according to time of quantization:

fig 4: QAT (Quantization-Aware Training) vs PTQ (Post-Training Quantization)
  1. Post-Training Quantization(PTQ): This method is used to convert the weights of an already trained model to a lower precision without any retraining. This technique is comparatively easy to implement and might degrade the model’s performance slightly due to loss of precision.
  2. Quantization-Aware Training(QAT): Unlike PTQ, QAT integrates the weight conversion process during the training stage. This often results in superior model performance, but it’s more computationally demanding.

Naive Quantization vs K-means Quantization

Quantization can be divided into two types on the basis of techniques of reducing precision of parameters i.e weights and biases:

  1. Naive Quantization: Naive quantization uniformly reduces the precision of all parameters. It is also called as linear quantization. It is similar to dividing a densely populated city into equal regions without considering the location of houses. This quantization can produce some regions having many houses and other having none.

Mathematically, following steps are required to perform the Linear quantization:

  • Calculate the minimum and maximum values: For each tensor, we need to get the minimum and maximum values to define the range of the floating-point values to quantize. The data type we want to convert to will give the minimum and maximum of the quantized range. For example, in the case of an unsigned integer, the range would be from 0 to 255 for 8-bit.
  • Compute the scale (s) and the zero-point (s) values: The scale adjusts the range of floating-point values to fit within the integer range. The zero-point ensures that zero in the floating-point range is accurately represented by an integer, maintaining numerical accuracy and stability, especially for values close to zero.
  • Quantize the values (q): This step involves mapping floating-point values to a lower precision integer range using a scale factor s and a zero point s computed in the previous step. The rounding operation ensures that the final result is a discrete integer, suitable for storage and computation in lower precision formats.
  • Dequantize the values: During inference, the dequantized values are used for calculations to achieve higher precision, although only the quantized weights are stored. This step will also allow you to compute the quantization error.

2. K-means Quantization: K-means quantization creates clusters based on actual location of data points(houses), resulting in a more accurate and efficient representation. In this method, all the data points are assigned to nearest representative points(centroids) by choosing the representative points. Some k-means quantization may include an additional pruning step, where parameters with values below certain threshold are set to zero (removing the house) which can increase the model’s sparsity.

fig 5: Naive vs K-means quantization

Weight-only vs Weight activation vs KV Cache Quantization

  1. Weight-only Quantization: In this techniques weights of a neural network (the parameters learned during training) are quantized, while other components, such as activations, remain in their original precision (usually 16-bit or 32-bit floating point).
  2. Weight-Activation Quantization: This technique reduces the precision of both weights (parameters of the network) and activations (the outputs of neurons during forward propagation) in a neural network. This approach provides significant computational and memory efficiency gains compared to weight-only quantization.
  3. KV Cache Quantization: This technique is used in transformer-based models, particulary during the inference.It involves compressing the key and value tensors stored in the KV cache. These tensors are part of the attention mechanism, which allows models to efficiently process sequences in a non-sequential manner.

Pros and Cons of Quantization

Pros

  • Faster inference: Due to small parameter precision, computation becomes more efficient and hence inference becomes faster.
  • Inference at edge: Inference at the edge can be helpful because it can make things faster, keep data more private, and use less bandwidth.
  • Lesser memory configuration: Model size is decreased hence require less memory to load & inference on the model.
  • Smaller model size: Due to smaller model size, model can be deployed on resource-constrained environments.

Cons

  • Potential loss in performance: The performance of the model get reduce due to squeezing of high precision parameters to lower precision.

I hope you understand LLM’ quantization more clearly now. In the next blog, I will explain the tools and techniques for llms quantization.Maybe It will be already published when you are reading this blog:)

Additionally, I’d love to hear your feedback and any tips or experiences you have to share. Don’t forget to clap if you found this useful, and please share it with others who might benefit. Your support and insights help me improve!

References:

--

--

Netra Prasad Neupane
Netra Prasad Neupane

Written by Netra Prasad Neupane

Machine Learning Engineer with expertise in Computer Vision, Deep Learning, NLP and Generative AI. https://www.linkedin.com/in/netraneupane/

No responses yet