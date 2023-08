Too Long; Didn't Read

Company Mentioned

The basic idea is to use python’s built-in multiprocessing features to split documents into separate pages and run multiple tesseract engine instances for parallel page recognition. Tesseract uses one core to recognize images, in average cases, it will be enough, but if you have “heavy” documents, that have many sheets, it would be very slow. The technology is called OCR (Optical Character Recognition) One of the most popular and free OCR software is free and open source.