Extracting and Highlighting key-value pairs in pdf Documents using Python
Information extraction is crucial process of extracting structured information from unstructured/semi-structured textual documents. Due to massive increment in the internet usage, the the size of data has been increased exponentially, almost 90% of data was created in last two years. In every two years, the volume of data across the world doubles in size. Out of this, you may know that, the quantity of the unstructured(images, pdf, video etc) data is very large in comparision to structured(relational tables) data. And, It is more harder to extract the information from the unstructured data in comparision to the structured data. To extract the information easily from unstructured data, we should incorporate some technical knowledge like Machine Learning. In this blog, I will explain a very simple method to extract key-value pairs data from unstructured pdf’s.
PDF stands for “portable document format”. Essentially, it is used when we need to save files that cannot be modified but still need to be easily shared and printed. It is still mostly used file format to save data in the buisness houses. Here, I will use three python libraries: pdfplumber for both extraction and highlighting , PyMuPDF for highlighting pdf, and pandas to visualize information in tabular representation. We can do all the stuffs using PyMuPDF, but I am using extraction using pdfplumber. Above libraries can be installed using pip package manager.
!pip install pdfplumber
!pip install PyMuPDF
!pip install pandas
For simplicity, we are using simple pdf, which contains minimal information. You can find following pdf on my github.
Now, let’s read the pdf using pdfplumber. Here I am extracting all the text words from pdf along with bounding box.
import pdfplumber
input_pdf = "./resources/personal_information.pdf"
with pdfplumber.open(input_pdf) as pdf:
document_info = dict()
for i,page in enumerate(pdf.pages):
word_bboxes = []
words = page.extract_words()
for word in words:
text = word["text"]
bbox = [word["x0"], word["top"], word["x1"], word["bottom"]] # Bounding box: (left, top, right, bottom)
word_bboxes.append({"text": text, "bbox": bbox})
document_info[i] = {"document":word_bboxes, "dimension":[page.width, page.height]}
I have written three utility functions:
- group_bboxes_by_lines() groups the words in a single line.
- group_nearest_words() function groups nearest words in single bounding box.
- get_common_bbox_for_neighbors() function returns the single bounding box for the grouped words.
def group_bboxes_by_lines(bboxes, threshold=10):
grouped_lines = []
current_line = [bboxes[0]]
is_parsed = [bboxes[0]]
# print(current_line[-1])
for bbox in bboxes[1:]:
# print(bbox)
if abs(bbox["bbox"][1] - current_line[-1]["bbox"][1]) < threshold:
current_line.append(bbox)
else:
grouped_lines.append(current_line)
current_line = [bbox]
grouped_lines.append(current_line)
return grouped_lines
# get nearest words in the given lines into same bbox
def group_nearest_words(line_bboxes, h_threshold = 15):
formatted_line_bboxes = []
for line in line_bboxes: # boxes in the lines are horizontally sorted
formatted_line = []
current_words = [line[0]]
for i, word in enumerate(line[1:]):
if abs(word["bbox"][0] - current_words[-1]["bbox"][2]) < h_threshold:
current_words.append(word)
else: # mayn't reaches to end words
formatted_line.append(current_words)
current_words = [word]
if i == len(line[1:]) - 1 and current_words not in formatted_line: # if it reaches to end words
formatted_line.append(current_words)
formatted_line_bboxes.append(formatted_line)
print(formatted_line)
return formatted_line_bboxes
# Get common bbox for the neighbor words
def get_common_bbox_for_neighbors(formatted_line_bboxes):
formatted_lines = []
for line in formatted_line_bboxes:
formatted_line = []
for neighbor_words in line:
text = neighbor_words[0]["text"]
x_min, y_min, x_max, y_max = neighbor_words[0]["bbox"]
if len(neighbor_words) > 1:
for word in neighbor_words[1:]:
x1, y1, x2, y2 = word["bbox"]
# update common bbox
x_min, y_min = min(x1, x_min), min(y1, y_min)
x_max, y_max = max(x2, x_max), max(y2, y_max)
text += " "+ word["text"]
formatted_line.append({"text":text, "bbox":[x_min, y_min, x_max, y_max]})
formatted_lines.append(formatted_line)
return formatted_lines
Now, let’s call above function and separate the key and values from extracted information.
for j in range(pages):
print(j)
width, height = document_info[j]["dimension"]
page_document_info = document_info[j]["document"]
page_document_info_sorted = sorted(page_document_info, key = lambda z:(z["bbox"][1], z["bbox"][0]))
line_bboxes = group_bboxes_by_lines(page_document_info_sorted)
formatted_line_bboxes = group_nearest_words(line_bboxes)
formatted_lines = get_common_bbox_for_neighbors(formatted_line_bboxes)
# Now we have only two bboxes in the line, so let's formulate logic to extract key-value pairs:
# key values are populated in the way that,
# there is some space between key and values
# and values are always started from at least 1/3 rd of the total width from the left
kv_threshold = 50
extracted_infos = []
for line in formatted_lines:
if len(line) == 1: # there is no key-value pairs in the line
pass
elif abs(line[1]["bbox"][0] - line[0]["bbox"][2]) > kv_threshold and line[1]["bbox"][0]:
key_values = {
"key": line[0],
"value": line[1]
}
extracted_infos.append(key_values)
# For simiplicity let's write extracted informationn to DataFrame
df = pd.DataFrame(columns=["key", "value"])
for i in range(len(extracted_infos)):
df.loc[len(df)] = [extracted_infos[i]["key"]["text"], extracted_infos[i]["value"]["text"]]
The extracted information looks like follows:
Now, we can highlight pdf, using pdfplumber by converting each page of pdf to image.
import pdfplumber
import pandas as pd
color = (255, 255, 0, 120)
input_pdf = "./resources/personal_information.pdf"
with pdfplumber.open(input_pdf) as pdf:
document_info = dict()
for i,page in enumerate(pdf.pages):
word_bboxes = []
words = page.extract_words()
for word in words:
text = word["text"]
bbox = [word["x0"], word["top"], word["x1"], word["bottom"]] # Bounding box: (x0, top, x1, bottom)
word_bboxes.append({"text": text, "bbox": bbox})
document_info[i] = {"document":word_bboxes, "dimension":[page.width, page.height]}
width, height = document_info[i]["dimension"]
page_document_info = document_info[i]["document"]
page_document_info_sorted = sorted(page_document_info, key = lambda z:(z["bbox"][1], z["bbox"][0]))
# print(page_document_info_sorted)
line_bboxes = group_bboxes_by_lines(page_document_info_sorted)
# print(line_bboxes)
formatted_line_bboxes = group_nearest_words(line_bboxes)
# print(formatted_line_bboxes)
formatted_lines = get_common_bbox_for_neighbors(formatted_line_bboxes)
# print(formatted_lines)
# Now we have only two bboxes in the line, so let's formulate logic to extract key-value pairs:
# key values are populated in the way that,
# there is some space between key and values
# and values are always started from at least 1/3 rd of the total width from the left
kv_threshold = 50
extracted_infos = []
for line in formatted_lines:
if len(line) == 1: # there is no key-value pairs in the line
pass
elif abs(line[1]["bbox"][0] - line[0]["bbox"][2]) > kv_threshold and line[1]["bbox"][0]:
key_values = {
"key": line[0],
"value": line[1]
}
extracted_infos.append(key_values)
df = pd.DataFrame(columns=["key", "value"])
for j in range(len(extracted_infos)):
df.loc[len(df)] = [extracted_infos[j]["key"]["text"], extracted_infos[j]["value"]["text"]]
im = page.to_image() # create image for the page
key_val_rects = [val["key"]["bbox"] for val in extracted_infos] + [val["value"]["bbox"] for val in extracted_infos]
im.draw_rects(key_val_rects, fill=color, stroke=None)
# save each pages as the image
im.save(f"./resources/outputs/personal_information_highlighted_{i}.png", format="PNG", quantize=True, colors=256, bits=8)
The saved image for first page of pdf looks like follows:
This is not great highlighting right, And one more drawback of using pdfplumber to highlight pdf is we have to convert each page of pdf to image. So, we lost overall structure of pdf, which we don’t want in most of cases. Thus, To preserve the pdf structure and properly highlight pdf , let’s use PyMuPDF instead of pdfplumber for highlighting.
Here, we will be adding few lines of code to highlight pdf. For this we are using bounding box information extracted from lastly extracted key-value information.
# Now time come to highlight the pdf
import fitz
def add_highlights(page, bbox):
rectangle = fitz.Rect(bbox)
highlight = page.add_highlight_annot(rectangle)
highlight.set_colors({"stroke": (1, 1, 0)})
return page
def highlight_pdf(pdf_file,output_file,page_num, bbox, extracted_info):
# Create a PDF document object
pdf_document = fitz.open(pdf_file)
# Get the page you want to work with (for example, page index 0)
page = pdf_document[page_num]
# loop over each eactracted info
for key_value_pairs in extracted_info:
key_bbox = key_value_pairs["key"]["bbox"]
keys_highlighted_page = add_highlights(page, key_bbox) # hightlight keys
value_bbox = key_value_pairs["value"]["bbox"]
key_value_highlighted_page = add_highlights(keys_highlighted_page, value_bbox) # highlight values
# Update the PDF with the new annotation
pdf_document.save(output_file)
# Close the PDF document
pdf_document.close()
# Experiment here
output_pdf = "./resources/outputs/personal_information_highlighted.pdf"
page_num = 0 # Page number where the text is located
bbox = (100, 500, 300, 550) # Example bbox coordinates (x0, y0, x1, y1)
highlight_pdf(input_pdf, output_pdf, page_num, bbox, extracted_infos)
Now, we highlight pdf and saved the highlighted pdf. It looks like follows:
You may observe that, above highlighting looks awesome and it preserve the structure of the pdf as well. If you want to download all the source code and run yourself, then you can visit my Github repository called blog_code_snippets and download from there.
I hope that the above blog is helpful to extract information form pdf and highlight for the quick visualization and quality control. If it is helpful to you, then please don’t forget to clap and share it with your friends. See you in the next blog…