Hands-on LLMs Quantization

Netra Prasad Neupane
14 min readDec 12, 2024

--

Suppose, You are looking to develop your personal assistant using large language model(LLM) which will help you on managing your daily tasks, meetings, tracking your expenses, medications and many more. However, there is challenge of limited resources and computation power. Your PC has only 8GB of CPU RAM with intel core 5 processor which is not enough to run any open-source state-of-the-arts(SOTA) LLMs. Then what would you do to address that?

fig: Quantization (source: rinf.tech)

This is where the quantization will help you. Quantization is the process of reducing the size and memory footprint of deep learning models, making them more suitable for deployment on resource-constrained environment(CPU/GPU & memory) without affecting significant performance. It is possible to run the model having billion of parameters in your pocket laptop with the help of quantization. If you are new to the quantization, please go through following article first. It will help you to understand intuition behind the quantization and it’s types.

Due to the massive growth of open-source development in AI/ML, a large number of tools and techniques have been developed to perform the quantization of deep learning models (e.g., LLMs).If you are want to know more deep understanding of existing tools and technique please go through following article first.

In this article we will be using following three popular libraries for quantization:

By using above tools, we will perform quantization of popular open source models such as Mistral-7B and Llama-8B. You can access the model file from the Hugging Face. Just make a click on following link, you will be redirected to model repository on the Hugging Face. Let’s start our quantization journey :)

Llama.cpp

Llama.cpp is both quantization and inference tools for large language models. It allows to quantize the model in GGML/GGUF format, especially the weight quantization. It enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware — locally and in the cloud.

Firstly, Let’s install Llama.cpp library by cloning the open-source Github repository and build it from the source.

# clone the llama.cpp library
!git clone https://github.com/ggerganov/llama.cpp

# build llama.cpp
!mkdir llama.cpp/build && cd llama.cpp/build && cmake .. && cmake --build . --config Release

Llama.cpp library supports different quantization techniques such as naive/linear quantization, bpw quantization, k-means quantization ….. etc.

# See about the quantization techniques
!./llama.cpp/build/bin/llama-quantize --help

Above line of code will display the different types of quantization supported by Llama.cpp.

usage: ./llama.cpp/build/bin/llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]

--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
--pure: Disable k-quant mixtures and quantize all tensors to the same type
--imatrix file_name: use data in file_name as importance matrix for quant optimizations
--include-weights tensor_name: use importance matrix for this/these tensor(s)
--exclude-weights tensor_name: use importance matrix for this/these tensor(s)
--output-tensor-type ggml_type: use this ggml_type for the output.weight tensor
--token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
--keep-split: will generate quantized model in the same shards as input
--override-kv KEY=TYPE:VALUE
Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used together

Allowed quantization types:
2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B
3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B
8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B
9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B
19 or IQ2_XXS : 2.06 bpw quantization
20 or IQ2_XS : 2.31 bpw quantization
28 or IQ2_S : 2.5 bpw quantization
29 or IQ2_M : 2.7 bpw quantization
24 or IQ1_S : 1.56 bpw quantization
31 or IQ1_M : 1.75 bpw quantization
36 or TQ1_0 : 1.69 bpw ternarization
37 or TQ2_0 : 2.06 bpw ternarization
10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B
21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B
23 or IQ3_XXS : 3.06 bpw quantization
26 or IQ3_S : 3.44 bpw quantization
27 or IQ3_M : 3.66 bpw quantization mix
12 or Q3_K : alias for Q3_K_M
22 or IQ3_XS : 3.3 bpw quantization
11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B
12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B
13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B
25 or IQ4_NL : 4.50 bpw non-linear quantization
30 or IQ4_XS : 4.25 bpw non-linear quantization
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B
15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B
17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B
18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B
7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B
33 or Q4_0_4_4 : 4.34G, +0.4685 ppl @ Llama-3-8B
34 or Q4_0_4_8 : 4.34G, +0.4685 ppl @ Llama-3-8B
35 or Q4_0_8_8 : 4.34G, +0.4685 ppl @ Llama-3-8B
1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B
32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing

As you have seen above llama.cpp provides several way of quantizations. In this blog we will quantize using 4-bit quantization (Q4_K_M) and 2-bit quantization(Q2_K).

What does the Q#_K_M and Q#_k mean in quantized models?

In the context of llama.cpp, Q4_K_M refers to a specific type of k-means quantization method. The naming convention is as follows:

  • Q stands for Quantization.
  • 4 indicates the number of bits used in the quantization process.
  • K refers to the use of k-means clustering in the quantization.
  • M represents the size of the model after quantization.
    (S = Small, M = Medium, L = Large).

Similarly, Q2_K refers to specific type of k-means quantization too. The naming convention is as follow:

  • Q stands for Quantization.
  • 2 indicates the number of bits used in the quantization process.
  • K refers to the use of k-means clustering in the quantization.

Quantization using Llama.cpp

In this part of blog, I will describe how to download a model from Hugging Face and quantize it and also run some performance tests.

step 1: Download a model from Hugging Face
Here, I am using Hugging Face hub to download the model. To get access to the model, you must have Hugging Face account. Please create a access token with read privilege by visiting sitting the Hugging Face website to access the model. Here, we have downloaded Mistral-7B-Instruct-v0.3 model.

from dotenv import load_dotenv
from huggingface_hub import snapshot_download
from pathlib import Path

load_dotenv()

access_token = os.getenv("HUGGINGFACEHUB_API_TOKEN") # read api token

dest_mistral_models_path = Path.home().joinpath('mistral_models', 'Mistral-7B-Instruct-v0.3')
dest_mistral_models_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="mistralai/Mistral-7B-Instruct-v0.3", repo_type="model", local_dir=dest_mistral_models_path, token=access_token)

Step 2: Convert the Model to GGML FP16 format
GGML is a tensor library developed by Georgi Gerganov for machine learning to enable large models and high performance on commodity hardware. FP16 is considered “half-precision” (FP32 is full-precision). Here, we reduced all the weight precisions from 32 bits to 16 bits.

# !python llama.cpp/convert_hf_to_gguf.py source_model_file --outtype f16 --outfile dest_model_file
!python llama.cpp/convert_hf_to_gguf.py ./mistral_models/Mistral-7B-Instruct-v0.3/ --outtype f16 --outfile ./mistral_models/quantized_models/Mistral-7B-Instruct-v0.3-f16.gguf

step 3: Quantize model to n-bits
Now, we will be usingMistral-7B-Instruct-v0.3-f16.gguf model as the starting point for the further quantizations.

  • 4-bit quantization: Here we will quantize all the parameters of Mistral-7B-Instruct-v0.3 model to 4-bits using Mistral-7B-Instruct-v0.3-f16.gguf as the base model.
!cd llama.cpp/build/bin && ./llama-quantize ../../../mistral_models/quantized_models/Mistral-7B-Instruct-v0.3-f16.gguf ../../../mistral_models/quantized_models/Mistral-7B-Instruct-v0.3_Q4_K_M.gguf Q4_K_M
  • 2-bit quantization: Let’s quantize all the parameters of Mistral-7B-Instruct-v0.3 model to 2-bits using Mistral-7B-Instruct-v0.3-f16.gguf as the base model.
!cd llama.cpp/build/bin && ./llama-quantize ../../../mistral_models/quantized_models/Mistral-7B-Instruct-v0.3-f16.gguf ../../../mistral_models/quantized_models/Mistral-7B-Instruct-v0.3_Q2_K.gguf Q2_K

Inference on quantized model

Now, it’s time to perform the inference on the quantized model. To test the performance of model we are doing the inference using cli.

!./llama.cpp/build/bin/llama-cli -m ./mistral_models/quantized_models/Mistral-7B-Instruct-v0.3_Q4_K_M.gguf -cnv -p "Why self-attention needed in transformer?"

If you want to learn how to perform the inference using python, please go through following article.

Evaluate the performance of Quantized model using Llama.cpp

  • llama-batched-bench: Batched bench benchmarks is used to calculated the batched decoding performance of llama.cpp. According to author, there are two modes of operations
  • prompt not shared - each batch has a separate prompt of size PP (i.e. N_KV = B*(PP + TG))
  • prompt is shared - there is a common prompt of size PP used by all batches (i.e. N_KV = PP + B*TG)

Let’s try out batched bench on the f16 version of model.

!cd llama.cpp/build/bin && ./llama-batched-bench -m ../../../mistral_models/quantized_models/Mistral-7B-Instruct-v0.3-f16.gguf -c 16384 -b 2048 -ub 512 -ngl 99 -ntg 128,256 -npl 1,2,4,8,16,32

According to documentation it should have generate the following evaluation metrics in tabular form.

  • PP - prompt tokens per batch
  • TG - generated tokens per batch
  • B - number of batches
  • N_KV - required KV cache size
  • T_PP - prompt processing time (i.e. time to first token)
  • S_PP - prompt processing speed ((B*PP)/T_PP or PP/T_PP)
  • T_TG - time to generate all batches
  • S_TG - text generation speed ((B*TG)/T_TG)
  • T - total time
  • S - total speed (i.e. all tokens / total time)

But, When I ran above code, it has generated following output in my machine where output is not formatted in table and is hard to understand.

llama_model_loader: loaded meta data with 37 key-value pairs and 291 tensors from ../../../mistral_models/quantized_models/Mistral-7B-Instruct-v0.3-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Mistral 7B Instruct v0.3
llama_model_loader: - kv 3: general.version str = v0.3
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Mistral
llama_model_loader: - kv 6: general.size_label str = 7B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Mistral 7B v0.3
llama_model_loader: - kv 10: general.base_model.0.version str = v0.3
llama_model_loader: - kv 11: general.base_model.0.organization str = Mistralai
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mist...
llama_model_loader: - kv 13: llama.block_count u32 = 32
llama_model_loader: - kv 14: llama.context_length u32 = 32768
llama_model_loader: - kv 15: llama.embedding_length u32 = 4096
llama_model_loader: - kv 16: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 17: llama.attention.head_count u32 = 32
llama_model_loader: - kv 18: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 20: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: general.file_type u32 = 1
llama_model_loader: - kv 22: llama.vocab_size u32 = 32768
llama_model_loader: - kv 23: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 24: tokenizer.ggml.add_space_prefix bool = true
llama_model_loader: - kv 25: tokenizer.ggml.model str = llama
llama_model_loader: - kv 26: tokenizer.ggml.pre str = default
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,32768] = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv 28: tokenizer.ggml.scores arr[f32,32768] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,32768] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 32: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 34: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 35: tokenizer.chat_template str = {%- if messages[0]["role"] == "system...
llama_model_loader: - kv 36: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type f16: 226 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1731 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32768
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 7.25 B
llm_load_print_meta: model size = 13.50 GiB (16.00 BPW)
llm_load_print_meta: general.name = Mistral 7B Instruct v0.3
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 781 '<0x0A>'
llm_load_print_meta: EOG token = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: CPU buffer size = 13825.02 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 16384
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: CPU output buffer size = 4.00 MiB
llama_new_context_with_model: CPU compute buffer size = 1088.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 0, n_gpu_layers = 99, n_threads = 8, n_threads_batch = 8

| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|

llama_perf_context_print: load time = 2539.40 ms
llama_perf_context_print: prompt eval time = 804.10 ms / 16 tokens ( 50.26 ms per token, 19.90 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 2539.41 ms / 17 tokens
  • Perplexity: Perplexity is a crucial metric for evaluating the performance of language models. It measures how well the model is predicting a given set of data.Perplexity evaluates a language model’s ability to predict the next word or character based on the context of previous words or characters. A lower perplexity score indicates that the model is better at predicting the next word, while a higher perplexity score suggests that the model is more uncertain or “perplexed” about the next word.

Let’s calculate the perplexity score on the 4-bits quantized model.Within llama.cpp the perplexity of base models is used primarily to judge the quality loss from e.g. quantized models vs. FP16. The convention among contributors is to use the Wikitext-2 test set for testing unless noted otherwise (can be obtained with scripts/get-wikitext-2.sh).

!cd llama.cpp/build/bin && ./llama-perplexity -m ../../../mistral_models/quantized_models/Mistral-7B-Instruct-v0.3_Q4_K_M.gguf -f ../wiki.test.raw

It will take few hours to generate the perplexity score of the quantized model on the wiki text data.

The lowest possible value for perplexity is 1, which occurs when the model perfectly predicts the test data (i.e., the model assigns a probability of 1 to the correct next token in every case). This score is depends on the size of dataset that we are measuring the perplexity score. It means model might have different perplexity score for different datasets.

If you are facing any problem while implementing above stuff for quantization, please refer following code_snippets in my Github repository.

bitsandbytes

bitsandbytes is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model’s performance. 4-bit quantization compresses a model even further, and it is commonly used with QLoRA to finetune quantized LLMs.

Quantization using bitsandbytes

Let’s install bitsandbytes before jumping into quantization. We will quantize meta-llama/Meta-Llama-3–8B model to 8 bit quantization.

!pip install transformers accelerate bitsandbytes>0.37.0

We will require transformers as well because we are doing this quantization while loading the model and we will be using transformers library to load the llama3–8b. Following five line of code imports the necessary packages and quantize the model to 8-bit.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_file = "meta-llama/Meta-Llama-3-8B"

bnb_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_file)
model = AutoModelForCausalLM.from_pretrained(model_file, quantization_config=bnb_config, torch_dtype=torch.float32).to("cuda")

Inference on quantized model using transformers

Now, it’s time to perform the inference on the quantized model. You can try to perform generation on quantized model on any prompts you want.

prompt = "I am suffering from flu, give me home remedies?"

# Tokenizing input text for the model.
input_ids = tokenizer([prompt], return_tensors="pt").input_ids.to("cuda") # .to(model.device)


# Generating output based on the input_ids.
# You can adjust the max_length parameter as necessary for your use case.
generated_tokens = model.generate(input_ids, max_length=50)

# Decoding the generated tokens to produce readable text.
generated_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print(generated_text)

You can find entire code snippets of quantization using bitsandbytes in following github repository.

Quantao

Quantao is pytorch quantization toolkit which supports optimum backend. The quantization method used is the linear quantization. It supports both activation and weight quantization into 8-bit float, 8-bits integer, 4-bits integer and 2-bits integer. It also supports quanization aware training(QAT).

It is also possible to quantize any model, regardless of the modality using quanto. For example we can quantize openai/whisper-large-v3 to 2/4/8-bits using same technique.You can find more about Optimum-Quantao here.

Quantization using Quantao

Let’s install Quantao first:

!pip install transformers accelerat optimum-quanto

Quantao is seamlessly integrated in the Hugging Face transformers library. We can quantize any model by passing a QuantoConfig to from_pretrained. In this example, we are going to quantize model weights into to 4-bits.

from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

quantization_config = QuantoConfig(weights="int4") # weight quantization
# quantization_config = QuantoConfig(activations="int4") # activation quantization

quantized_model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config= quantization_config
)

Inference on quantized model using transformers

Now it’s time to perform the inference on the model using .generate() method.

prompt = "What is multi-head attention in context of transformer?"

# Tokenizing input text for the model.
input_ids = tokenizer([prompt], return_tensors="pt").input_ids.to("cuda") # .to(model.device)

# Generating output based on the input_ids.
# You can adjust the max_length parameter as necessary for your use case.
generated_tokens = quantized_model.generate(input_ids, max_length=50)

# Decoding the generated tokens to produce readable text.
generated_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print(generated_text)

You can find entire code snippets for quantization using Quantao in following Github repository.

In conclusion, The hands-on guide demonstrated quantizing LLMs using llama.cpp, bitsandbytes & Quantao, walking through downloading models, conversion, quantization techniques, and evaluation metrics like batched-bench and perplexity.

As large language models (LLMs) keep advancing, quantization will be key to making them easier to use and deploy in different settings. Whether you are a researcher, practitioner, or NLP enthusiast, learning about quantization can be an important step in working with these powerful models.

I hope you find this blog helpful to quantize LLMs and run in your local machine regardless of resource constraints. Additionally, I’d love to hear your feedback and any tips or experiences you have to share. Don’t forget to clap if you found this useful, and please share it with others who might benefit. Your support and insights help me improve. See you in next blog soon!

References:

--

--

Netra Prasad Neupane
Netra Prasad Neupane

Written by Netra Prasad Neupane

Machine Learning Engineer with expertise in Computer Vision, Deep Learning, NLP and Generative AI. https://www.linkedin.com/in/netraneupane/

No responses yet