Invoice Extraction Bot using Streamlit and Gemini

Netra Prasad Neupane
7 min readFeb 16, 2024

--

The large language model has been a very hot topic since the release of ChatGPT in late 2022. There is more than one LLM release every month and have better performance as well. A few months ago, Google released their multimodel LLM called Gemini. During the release of Gemini, they claimed that Gemini is the first model to outperform human experts on MMLU (Massive Multitask Language Understanding), one of the most popular methods to test the knowledge and problem-solving abilities of AI models. According to their claims, Gemini outperforms ChatGPT-4 in most benchmarks like text, image, and video but not in audio processing. Gemini has the following three different models:

  • Gemini-nano: Gemini-nano is a small model for on-device tasks. It can handle advanced tasks like summarizing text, suggesting smart replies in context, and checking grammar etc. It can be used in smartphones as well such as Google Pixel 8.
  • Gemini-pro: According to Google Gemini-pro is the best model for scaling across a wide range of tasks. Gemini-pro is more powerful in comparison to Gemini-nano and can process text, images, code, audio and videos. Gemini-pro is currently available as API as well and can be accessed from Bard(which was renamed as Gemini recently). In this blog, we will be using Gemini-pro-vision for processing images.
  • Gemini-ultra: Gemini-ultra is the most capable and largest model for highly complex tasks and is said to be available for enterprises and data centres.
fig: invoice extraction bot using Gemini

That’s enough for Gemini right? Let’s talk about our problem: Information Extraction. Information extraction from the image document is a very challenging task. The image quality, orientation of text, language, performance of OCR and information extraction engine etc affect the accuracy of the extraction. If you are from an AI/ML background, you might have worked on different OCRs to extract the information from images by properly aligning the image. However, extraction results depend on the quality of the image and the performance of OCR. If the OCR engine doesn’t work better then we can’t expect proper extraction results. So, there are lots of modules which need to work perfectly to get accurate results from the extraction system. The following figure illustrates the components of a traditional information extraction pipeline from image documents.

fig: invoice data extraction using OCR (reference: https://www.docsumo.com/blog/ocr-invoicing)

But, multi-model LLM helps us to get rid of the above lengthy process of information extraction. A few days ago, I was tested gemini-pro-vision for invoice document extraction. I was surprised by the performance of the model on the task. It works better than expected and the accuracy of the model is also good. But, how it works? you might have the same opinion on it. Let me explain it below:

  • LLMs are trained on a diverse range of image-text pairs during pretraining, enabling them to understand the relationship between textual and visual information without explicitly parsing text from images.
  • These models learn to generate or classify text based on the visual content of the image and vice versa. While they can interpret text in images as part of their overall understanding, they don’t rely on traditional OCR methods, as their approach is more holistic and integrated.

Now, let’s deep dive into practical implementation. Here, we will be using gemini-pro-vision model through Gemini API. You can find Gemini API at https://makersuite.google.com/app/apikey. Currently, you can do 60 queries per minute in gemini-pro-vision using the API key collected from the above location. See the following image to get more insight on pricing.

fig: Gemini-Pro pricing (reference: https://ai.google.dev/pricing)

Here, I am using google-generativeai python library to access the Gemini models and Streamlit to create the invoice extraction bot UI. Streamlit is an open-source Python framework for machine learning and data science to quickly prototype, share, and deploy your data-driven applications without needing expertise in web development. We will use the pillow library to open the uploaded image. We will also use the python-dotenv library to read the variables from the environment file.

At first, we import the required dependencies like Streamlit, Gemini, Pillow, dotenv etc and then configure the Gemini API.

# import dependencies
import os
import streamlit as st
import google.generativeai as genai
from PIL import Image
from dotenv import load_dotenv

# configure API by loading key from .env file
# load environment variables
load_dotenv()
# configure key
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
genai.configure(api_key=GEMINI_API_KEY)

Great, we configured Gemini API. It’s time to initialize the model. Currently, Gemini-pro has two models: gemini-pro for the text to texts processing and gemini-pro-vision for image and text to texts processing. For the invoice extraction let’s create a function to initialize the Gemini-pro-vision model.


# initialize gemini pro model
def initialize_model(model_name="gemini-pro-vision"):
model = genai.GenerativeModel(model_name)
return model

Once, we initialized the model, now we can use the model to query invoices. For this, let’s create the Streamlit UI for both the prompt and image upload interface.

# create the streamlit ui and get prompt along with image
st.set_page_config("Invoice Extractor")
st.header("Inovice Extractor")
# Read teh prompt in text box
prompt = st.text_input("Enter your prompt" ,key="prompt")
# interface to upload image
uploaded_image = st.file_uploader("Choose an image", type=["jpg", "png", "jpeg"])

After implementing the above code to generate streamlit UI you will end up with the following interface to query over invoices.

fig: Invoice Extractor UI

Now, let’s call the Gemini model initialized above and pass both the image and prompt to the model. Here, we need to convert the uploaded image to bytes and feed that bytes of the image to the model using get_image_bytes()function.

def get_image_bytes(uploaded_image):
if uploaded_image is not None:
# read the uploaded image in bytes
image_bytes = uploaded_image.getvalue()

image_info = [
{
"mime_type": uploaded_image.type,
"data": image_bytes
}
]
return image_info
else:
raise FileNotFoundError("Upload Valid image file!")

Now, let’s complete our remaining workflow from model initialization to prompting on the image.

def get_response(model, model_behavior, image, prompt):
response = model.generate_content([model_behavior, image[0], prompt])
return response.text


# initialize the gemini-pro-vision
model = initialize_model("gemini-pro-vision")

if uploaded_image is not None: # file upload handling
image = Image.open(uploaded_image)
# display the invoice image
st.image(image, caption="Your image", use_column_width=True)

# create submit button, to submit image along with image
submit = st.button("submit")

# set the model behavior
model_behavior = """
Your are an expert who understand invoice overall structures and has deep knowledge on it.
We will upload the invoice image and you have to answer the question bashed on information
present in the invoice image.
"""

# if user pressed submit button
if submit or prompt:
if len(prompt) > 0:
# get uploaded image file in bytes
image_info = get_image_bytes(uploaded_image)
response = get_response(model, model_behavior, image_info, prompt)
st.write(response)
else:
raise ValueError("Please Enter Valid prompt!")

In the above code, we set the behaviour of the model to get the optimum performance in invoice extraction. We have also handled exceptions that might occur during prompting as well by raising exceptions. So, we have completed the implementation parts. You can find all the code snippets on my Github. Now it’s time to test the performance of the model on invoices. Here, I am testing on two invoices and three questions in each.

Let’s query on following invoice image using the bot. You just need to upload the image and write a query on it.

fig: invoice sample 1

The following image shows the response from our Bot once we upload an image and prompt on it.

fig: Invoice Extraction Bot Demo Image

Query 1: Give me a table for particulars and price lists.

Response:

| Particulars       | Price    |
|-------------------|----------|
| 8 PDR MENU | 2800.00 |
| 8 TERRINE | 360.00 |
| 1 P BASS RVV 09 | 150.00 |
| 3 250 WINE PAIRING| 750.00 |
| DEPOSIT RDM | 1500.00 |
| GST | 314.65 |
| TTL PAYMENT | 2874.65 |

Query 2: What is the total cost of food?

Response: 3160.00

Query 3: When the invoice is billed?

Response: Invoice is billed on 29-AUG-2011 07:10 PM

Fantastic! it works perfectly fine. I am totally satisfied with the performance of gemini-pro-vision in invoice extraction application. Now, let’s query the following invoice image as well.

fig: invoice sample

Query 1: How much does the veggie app cost?

Response: The veggie app costs $0.00.

Query 2: How much is alcohol tax?

Response: The alcohol tax is $4.88.

Query 3: what is the total bill amount?

Response: The total bill amount is $95.50.

In conclusion, gemini-pro-vision works perfectly if the invoices are well structured( all the related information belongs to the same row). In the above invoice samples, Bot answered all the questions correctly without any mistakes despite average lightning and low print quality. After doing some experiments, I found that it fails sometimes if the invoices are not fully structured. Please don’t forget to provide feedback on how it works in your data. I am very excited to see your thoughts on it.

Finally, I hope that after reading this blog, you can easily use this multimodel LLM for information extraction-related tasks. If you have any queries I am happy to answer them if possible. If you liked it, then please don’t forget to clap and share it with your friends. See you in the next blog:

References

--

--

Netra Prasad Neupane

Machine Learning Engineer with expertise in Computer Vision, Deep Learning, NLP and Generative AI. https://www.linkedin.com/in/netraneupane/