How to run LLMs locally?

14 min readMay 24, 2024

Large language models are deep learning models trained on immense amounts of data making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks such as summarization, question answering, content creation, language translation etc. Most LLMs use transformer decoder architecture and contain Billions of parameters.

As Big tech giants like OpenAI, Meta, Google etc are racing in Generative AI to beat each other, lots of open-source LLMs have been released yet by different AI startups and by some giants. All the major software engineering companies are now focusing on delivering the AI solution using Gen AI. Currently, major LLMs are available on a pay-per-use basis like GPT-4, Mixtral-large, Claude-3 etc. But what, if you don’t want to use that API because of some privacy reasons? To get rid of such issue:

You should deploy open-source LLMs/ fine-tuned LLMs on your data in your own server or
You should set LLMs locally for the testing on the local environment

In this blog, I will explain how to run the following three open-source LLMs locally using llama.cpp.

Mixtral — 7B
Phi-3-Mini
Gemma

You can also run LLMs locally using other approaches such as using PrivateGPT, Ollama, GPT4All, llamafile etc. But for this tutorial, I will stick to llamacpp.

What is llamacpp?

llamacpp is Python binding for llama.cpp. llama.cpp is an open-source software library written in C++, that performs inference on various Large Language Models such as Llama. It is co-developed alongside the ggml library, a general-purpose tensor library [1].

Let’s start the implementation. It is best practice to create a virtual environment isolated from your existing Python system environment and install all the dependencies inside that environment. Please follow my blog on Python virtual environments if you need any help with virtual environment setup.

Virtual Environments in Python

In software development, we are often stuck in situations where we need to switch/test on multiple versions of…

netraneupane.medium.com

To install llamacpp using the pip package manager type the following command in the terminal (make sure that you have activated your virtual environment).

pip install llamacpp

Alternatively, If you want to install by building from the source, then clone the repository first,

!git clone git@github.com:thomasantony/llamacpp-python.git

And then, move inside llamacpp-python/ directory and build from the source.

pip install .

Mistral-7B

Mistral 7B is an open-source transformer-based large language model developed by Mistral AI. It has 7 Billion parameters and also outperforms other LLMs having large number of parameters. This model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost[2].

Mistral AI provides two types of models: open-weight models (Mistral 7B, Mixtral 8x7B, Mixtral 8x22B) and optimized commercial models (Mistral Small, Mistral Medium, Mistral Large, and Mistral Embeddings)[6]. Let's discuss more on Mistral AI open-source models:

Mistral 7B: This model has 7 Billion parameters. It is the first dense model released by Mistral AI, perfect for experimentation, customization, and quick iteration. At the time of the release, it matched the capabilities of models up to 30B parameters. It has a context length of 32k.
Mistral 8 x7B: This model has 45 billion parameters and uses only about 12 Billion parameters during the inference. It is a sparse mixture of expert models and leads to better inference throughput. It has a 32k context size i.e. equal to the smaller Mistral 7B model.
Mixtral 8x22B: It is a bigger sparse mixture of expert models. As such, it leverages up to 141B parameters but only uses about 39B during inference, leading to better inference throughput. It has a context length of 64k.

To run Mistral-7B locally, At first we need to download the model. This model is available in huggingface. Hugging Face provides access to thousands of pre-trained transformer models for natural language processing, computer vision, audio and many more. It also hosts datasets and applications. There are mainly two ways to download the model:

Download directly from the Hugging Face hub by logging Hugging Face account.
Download using huggingface-cli / huggingface_hub library.

Here, I will show how to download the model using both huggingface_hub library and huggingface-cli. So, let’s install huggingface-hub library using pip package manager.

pip install huggingface-hub[cli]

Once you have installed huggingface-hub, now run the following Python code to download the model in your working directory. First, we need the Hugging Face hub API token, you can get it by logging into your Hugging Face account. Please store the token in .env file and place the file in your working directory.

import os
from huggingface_hub import hf_hub_download

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_file = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"

HF_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
model_path = hf_hub_download(model_name,
                             filename=model_file,
                             local_dir='.',

Alternatively, If you want to download the model using huggingface-cli then enter the following commands in your terminal.

huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir ./models

Here, we have downloaded the 4-bit quantized model. Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activation with low-precision data types like 4-bit integers (int4) instead of the usual 32-bit floating point (float32).

Now, It’s time to infer the model. Here, we defined query as What are the application of the LLMs in healthcare?and initialized model using llama_cpp Llama module. Currently, we are using CPU for the inference but you can switch to GPU if you have GPU in your machine. Then, we perform inference on the model by passing a query inside the prompt.

from llama_cpp import Llama

# initialized the model
llm = Llama(
        model_path="./mistral-7b-instruct-v0.2.Q4_K_M.gguf",  # path to GGUF file
        n_ctx=4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
        n_threads=8,  # The number of CPU threads to use, tailor to your system and the resulting performance
        n_gpu_layers=0, # The number of layers to offload to GPU, if you have GPU acceleration available. Set to 0 if no GPU acceleration is available on your system.
        verbose=False,
    )

# define query
query = "What are the application of the LLMs in healthcare?"

# inference
output = llm(
    prompt=f"<|user|>\n{query}<|end|>\n<|assistant|>",
    max_tokens=4096,  # Generate up to 4096 tokens
    stop=["<|end|>"],
    echo=False,  # Whether to echo the prompt
)
print(f"response:{output['choices'][0]['text']}")

The response of the model for the given query is

LLMs, or Master of Laws degrees with a specialization in Healthcare Law, have several applications within the healthcare industry. Here are some of them:

1. Legal Compliance and Risk Management: With an LLM in Healthcare Law, professionals can help ensure their organizations comply with various regulations such as HIPAA, Stark Law, and Anti-Kickback Statute. They can also manage risks associated with healthcare lawsuits, disputes, and investigations.

2. Contract Negotiation: In the healthcare industry, contracts are a crucial part of business operations. An LLM in Healthcare Law can prepare professionals to draft, review, and negotiate various types of contracts including service agreements, employment contracts, vendor contracts, and managed care contracts.

3. Intellectual Property Protection: Healthcare organizations often invest in research and development for new medical technologies or treatments. A healthcare law specialist can help protect these intellectual properties through patents, trademarks, and copyrights.

4. Clinical Trials and Regulatory Affairs: An LLM in Healthcare Law can be beneficial for those involved in clinical trials and regulatory affairs. They can help ensure compliance with the Food and Drug Administration (FDA) regulations, European Medicines Agency (EMA), and other international healthcare regulatory bodies.

5. Compliance Education and Training: Many organizations require their employees to undergo regular training on healthcare laws and regulations. A professional with an LLM in Healthcare Law can design, develop, and deliver effective compliance education programs for different audiences within the organization.

6. Policy Development and Advocacy: With a strong understanding of healthcare law, professionals can contribute to policy development at local, state, or national levels. They can also advocate for changes in laws and regulations that benefit their organizations or patients.

7. Healthcare Administration: In executive roles within the healthcare industry, an LLM in Healthcare Law can add value by providing a deep understanding of legal aspects related to patient care, billing and reimbursement, compliance, risk management, and other operational areas.

The model has generated a factual response to the query. The answer looks very promising.

That’s it! We have set up LLM and run it on our own machine. If you have a low-end CPU then it might take a while to complete inference. Please try to run the model in a machine having GPU if you want fast inference.

Phi-3-Mini

Phi-3 is an open-source transformer-based small language model developed by Microsoft.Phi-3 models are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and even bigger size across a variety of language, reasoning, coding, and math benchmarks[3]. Phi-3 family has multiple models:

fig: Phi-3 vs other SLMs ( source: news.microsoft.com/)

Phi-3-mini: It has 3.8 Billion Parameters. This is the smallest and most versatile model, well-suited for deployment on devices with limited resources or for cost-sensitive applications. It comes in two variants: a model having 4k context length is ideal for tasks requiring shorter text inputs and faster response times and 128K Context Length is used for a significantly longer context window, enabling it to handle and reason over larger pieces of text like documents or code.
Phi-3-small: It has 7 Billion Parameters. This model has not been released yet. Phi-3-small offers a balance between performance and resource efficiency.
Phi-3-medium: It has 14 Billion Parameters. This upcoming model pushes the boundaries of Phi-3’s capabilities, targeting tasks requiring the highest level of performance.

I hope you have installed the huggingface-hub. Now it’s time to download the Phi-3-mini from Hugging Face.

from huggingface_hub import hf_hub_download

# download model:
model_name = "microsoft/Phi-3-mini-4k-instruct-gguf"
model_file = "Phi-3-mini-4k-instruct-q4.gguf"

# get huggingfacehub api token
HF_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
model_path = hf_hub_download(model_name,
                             filename=model_file,
                             local_dir='.',
                             token=HF_TOKEN)

Here, we have downloaded, a 4-bit quantized model having 4k context length in our working directory. Now, it’s time to infer the model. For testing purposes, we will use a simple prompt similar to the above prompt:What are the application of the LLMs in healthcare?and initialized the model using llama_cpp Llama module. Currently, we are using CPU for the inference but you can switch to GPU if you have GPU in your machine. Then, we perform inference on the model by passing a query inside the prompt.

from llama_cpp import Llama

# initialize the model
llm = Llama(
    model_path="../../models/Phi-3-mini-4k-instruct-q4.gguf",  # path to GGUF file
    n_ctx=4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
    n_threads=8,  # The number of CPU threads to use, tailor to your system and the resulting performance
    n_gpu_layers=0, # The number of layers to offload to GPU, if you have GPU acceleration available. Set to 0 if no GPU acceleration is available on your system.
    verbose=False,
)

# define query
query = "What are the application of the LLMs in healthcare?"

# inference
output = llm(
    prompt=f"<|user|>\n{query}<|end|>\n<|assistant|>",
    max_tokens=4096,  # Generate up to 4096 tokens
    stop=["<|end|>"],
    echo=False,  # Whether to echo the prompt
)

print(f"response:{output['choices'][0]['text']}")

The response of the model for the given query is

 Large Language Models (LLMs) like GPT-3 have numerous potential applications in the field of healthcare, offering promising benefits for various sectors. Here are some key areas where these models can be applied:

1. Medical documentation and coding: LLMs can help automate medical record keeping by processing clinical notes and extracting relevant information to populate electronic health records (EHR) and billing systems efficiently, reducing the burden on administrative staff.

2. Clinical decision support: By analyzing vast amounts of medical literature and research papers, LLMs can provide evidence-based recommendations for diagnosis, treatment plans, and drug interactions to assist clinicians in making informed decisions quickly and accurately.

3. Medical language processing: LLMs can be utilized for natural language understanding tasks such as extracting relevant information from unstructured data (e.g., patient reports, radiology images), enabling better insights into patients' conditions and supporting clinicians in their decision-making process.

4. Patient communication and engagement: LLMs can enhance the patient experience by providing more personalized responses to frequently asked questions on medical websites or through chatbots during telehealth appointments, offering support for appointment scheduling, medication reminders, and health education.

5. Medical research assistance: By analyzing scientific literature and data sets, LLMs can help researchers identify trends and potential new drug candidates more efficiently, ultimately accelerating the pace of medical breakthroughs.

6. Clinical trials management: LLMs can assist in streamlining clinical trial procedures by automating tasks such as patient screening, consent generation, monitoring adverse events, or data analysis.

7. Healthcare policy and regulation: LLMs can analyze large volumes of legislative documents to provide insights into the impact of new policies on healthcare systems, helping regulators make informed decisions and policymakers develop better-informed strategies.

8. Medical education and training: LLMs can be used as educational tools for medical students by simulating patient interactions or providing realistic case studies to enhance learning experiences and improve clinical skills.

9. Healthcare fraud detection: By analyzing large amounts of data, including insurance claims and billing information, LLMs may help identify patterns indicative of healthcare fraud and abuse, enabling early intervention and prevention.

10. Personalized medicine: With the growing availability of genomic and other omics data, LLMs can help analyze patient genetic profiles to tailor treatment plans according to individual needs for better outcomes.

It's important to note that while LLMs hold significant potential in healthcare applications, ethical considerations must be taken into account when designing systems using these models. This includes ensuring the privacy and security of patient data, as well as avoiding bias in decision-making processes based on incomplete or unrepresentative training data.

As we expected, the response of the model is also clear and concise for the given prompt.

Gemma-2B

Gemma is a family of lightweight, state-of-the-art open models from Google, and has the same architecture as Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights, pre-trained variants, and instruction-tuned variants. Gemma models are well-suited for a variety of text-generation tasks, including question-answering, summarization, and reasoning[4]. Gemma has two variants:

fig: MMLU Benchmark, Gemma and other SLMs(source: ai.google.dev)

Gemma-2B: Gemma-2B is a smaller text-to-text open-source model having 2 Billion parameters.
Gemma-7B: Gemma-7B: Gemma-7B is a larger text-text open-source model having 7 Billion parameters.

I hope you have installed the huggingface-hub, So, let’s download the Gemma-2B from Hugging Face.

from huggingface_hub import hf_hub_download
import os

# download model
model_name = "google/gemma-2b-it"
model_file = "gemma-2b-it.gguf"

HF_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
model_path = hf_hub_download(model_name,
                             filename=model_file,
                             local_dir='.',
                             token=HF_TOKEN)

Here, we have downloaded gemma-2b-it model in our working directory. It is an instruction fine-tuned model in the foundational gemma-2b model. Now, it’s time to infer the model. For testing purpose, we are using a simple prompt same as above and CPU machine.

from llama_cpp import Llama

# initialize the model
llm = Llama(
        model_path="./gemma-2b-it.gguf",  # path to GGUF file
        n_ctx=4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
        n_threads=8,  # The number of CPU threads to use, tailor to your system and the resulting performance
        n_gpu_layers=0,
        # The number of layers to offload to GPU, if you have GPU acceleration available.
        # Set to 0 if no GPU acceleration is available on your system.
        verbose=False,
    )

# define query
query = "What are the application of the LLMs in healthcare?"

# inference
output = llm(
    prompt=query,
    max_tokens=4096,  # Generate up to 4096 tokens
    echo=False,  # Whether to echo the prompt
)
print(f"response:{output['choices'][0]['text'].strip("?")}")

The response of the model for the given query is

Large Language Models (LLMs) have emerged as a transformative technology with vast potential to revolutionize various industries, including healthcare. Here are some key applications of LLMs in healthcare:

**1. Natural Language Processing (NLP):**
- **Medical Text Summarization:** LLMs can be trained to summarize complex medical documents and reports, making them easier for healthcare professionals to digest and understand.
- **Medical Dialogue Systems:** LLMs can facilitate natural and engaging communication between patients and healthcare providers, enabling more effective information exchange.

**2. Drug Discovery:**
- **Drug Target Identification:** LLMs can analyze vast datasets of scientific literature and genomic information to identify novel drug targets, potentially leading to the development of new therapeutic approaches.
- **Personalized Medicine:** LLMs can help tailor treatment plans based on an individual's genetic makeup and medical history.

**3. Disease Diagnosis and Prediction:**
- **Medical Image Analysis:** LLMs can analyze medical images (e.g., X-rays, CT scans, MRIs) to assist in disease diagnosis, prognosis, and risk assessment.
- **Patient Risk Assessment:** LLMs can predict patients' risk of developing specific diseases based on their medical history and genetic data.

**4. Drug Safety:**
- **Safety Assessment:** LLMs can analyze vast datasets of drug safety reports and conduct thorough toxicity evaluations to identify potential risks associated with new drugs.
- **Drug Repurposing:** LLMs can help identify potential uses for existing drugs that may be beneficial in treating different diseases.

**5. Public Health:**
- **Disease Surveillance:** LLMs can monitor and analyze large datasets of health data, detecting outbreaks and identifying emerging infectious diseases more effectively.
- **Health Education and Awareness:** LLMs can generate personalized health education content tailored to individual patient needs.

**6. Patient Engagement and Empowerment:**
- **Chatbots and Virtual Assistants:** LLMs can create chatbots that provide personalized health information, symptom tracking, and support to patients.
- **Patient Communication:** LLMs can facilitate communication between patients and healthcare providers, allowing for more active participation in their healthcare journey.

Overall, LLMs have the potential to significantly enhance healthcare by improving disease diagnosis, personalized treatment, drug discovery, safety assessment, public health surveillance, and patient engagement.

The response of Gemma-2B looks illuminating regardless of having a small number of parameters. In my machine, it took a comparatively longer time than the previous models. It might be because we haven’t quantized the Gemma-2B model. How the model is performing in your machine? Please don’t forget to provide feedback, this will help other readers as well.

In summary, llama.cpp supports inference for many LLMs model such as Mistral, Gemma, Phi, Qwen, Llama, tinyllama etc locally and offers both flexibility and privacy. In addition to this, It supports both CPU and GPU environments. Now, you can run different models available on huggingface hub by downloading on your local machine using llama.cpp.

I hope this guide on running large language models locally has been helpful. I’d love to hear your feedback and any tips or experiences you have to share. Don’t forget to clap if you found this useful, and please share it with others who might benefit. Your support and insights help me improve!