In this article, I want to share with you, how to create your python wrapper, that solves the basic problem of the tesseract engine – the small speed of recognizing multiple pages in one document.
The basic idea is to use python’s built-in multiprocessing features to split documents into separate pages and run multiple tesseract engine instances for parallel page recognition.
Tesseract uses one core to recognize images, in average cases, it will be enough, but if you have “heavy” documents, that have many sheets, it will be very slow, because tesseract by default will use only one CPU core.
So, in some cases, when you cannot copy content from pdf-document (pdf has different formats itself, one of them – being a scanned pdf document, which is just an image of the actual text), we need software, that will recognize text from image.
This technology is called OCR (Optical Character Recognition). One of the most popular and free OCR software is tesseract. Tesseract originally was developed by Hewlett-Packard in 2006, then it was sold to Google. Google made tesseract free and open source. Today, the tesseract is being developed by a group of enthusiasts for free.
For developing our python wrapper, we need the latest version of python, currently, it is 3.11, download link: https://www.python.org/downloads/release/python-3110/
and pipenv
(for the virtual environment). When you finished downloading python, run it on your computer. Don’t forget to select “Add python.exe to PATH” for running it from your PowerShell or cmd.exe
.
Next, install pipenv
through the pip install pipenv
command in cmd.exe. Finally, you will see “Successfully installed”. Also, we need to install pdf2image
(pip install pdf2image
) and download a poppler for it.
You will see the “Successfully created virtual environment” and its full path. Remember the full path, you will use it in VS code IDE. Open VS Code and select python interpreter with the combination ctrl + shift + p
then select “Python: Select Interpreter”.
Select our newly created virtual environment from the menu.
Next, we need to download the tesseract engine (v5.2.0.20220712) and put it inside our project folder (create a tesseract folder inside the project folder.
After installation, we are ready to go.
Create main.py
file and paste this code.
import tempfile
from pdf2image import convert_from_path
import pytesseract
import time
pytesseract.pytesseract.tesseract_cmd=r'C:/Users/Nuriq/Desktop/python_wrapper/tesseract/tesseract.exe'
def start_ocr(pdf_path, poppler_path):
full_raw_text = ""
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path(
pdf_path=pdf_path,
output_folder=path,
paths_only=True,
fmt="jpeg",
poppler_path=poppler_path,
dpi=250,
grayscale=True
)
for img_path in images_from_path:
full_raw_text += pytesseract.image_to_string(img_path)
return full_raw_text
if __name__ == '__main__':
start_time = time.time()
full_raw_text = start_ocr(
'C:/Users/Nuriq/Desktop/python_wrapper/PublicWaterMassMailing.pdf',
'C:/Users/Nuriq/Desktop/python_wrapper/poppler-0.68.0/bin'
)
print(full_raw_text)
end_time = time.time()
print(f"it took: {end_time-start_time}")
The pytesseract.pytesseract.tesseract_cmd
line is the full path to our tesseract.exe engine. We use python “TemporaryDirectory” to store temporary files, which works faster than physically storing them on our HDD/SDD. To start using tesseract, we need to convert a single pdf document to images (tiff/jpeg files), where 1 page = 1 image.
Store it in the top folder and run full_raw_text += by tesseract.image_to_string(img_path)
, where img_path
is a full path to the image.
Finally, we got a 51-second result to proceed with all 32 pages (pdf2image
took 8 seconds to convert pdf to images). Then write the next code.
import tempfile
from pdf2image import convert_from_path
import pytesseract
import os
import time
import concurrent.futures
from concurrent.futures import ProcessPoolExecutor
pytesseract.pytesseract.tesseract_cmd = r'C:/Users/Nuriq/Desktop/python_wrapper/tesseract/tesseract.exe'
def start_ocr(pdf_path, poppler_path):
images_from_path = []
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path(
pdf_path=pdf_path,
output_folder=path,
paths_only=True,
fmt="jpeg",
poppler_path=poppler_path,
dpi=250,
grayscale=True
)
with ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
tasks = {executor.submit(pytesseract.image_to_string, img_path): img_path for img_path in images_from_path}
for future in concurrent.futures.as_completed(tasks):
page_number = tasks[future]
data = future.result(), page_number[-5]
yield data
def sort_text(text):
return(sorted(text, key = lambda x: x[1]))
def pdf_to_string(pdf_path, poppler_path):
full_raw_text = start_ocr(
pdf_path,
poppler_path
)
full_text = ""
text = sort_text(full_raw_text)
for page_text, _ in text:
full_text += page_text
return full_text
if __name__ == '__main__':
start_time = time.time()
text = pdf_to_string(
'C:/Users/Nuriq/Desktop/python_wrapper/PublicWaterMassMailing.pdf',
'C:/Users/Nuriq/Desktop/python_wrapper/poppler-0.68.0/bin'
)
print(text)
end_time = time.time()
print(f"it took: {end_time-start_time}")
Run it and finally, we got 7 seconds (on Core i7 12700), which is 7.3x faster than the previous code sample.
Finally, using ProcessPoolExecutor
we run 19 instances of tesseract for every file on the list and do parallel calculations. Then we save every process result, and we put the page number to it. After all, using the sort_text()
function, we sort them and merge them into one text.