Member-only story
How did I analyze 5 million PDF’s using Tesseract OCR on Ec2?
Before going to project details I would like to explain to you about Tesseract;
When I heard about Tesseract for the first time, I thought it would be something related to Marvel Universe 🔨 but it was not. Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License.
Today, the extraction of information from scanned documents such as letters, write-ups, invoices, etc. has become an integral part of any business process. To accomplish this task, you need to set up OCR software to extract the information from these scanned documents or pdfs.
While doing this project I faced multiple challenges like;
- How to use Tesseract 5.0 because it comes with the default 4.x.x version?
- How to download millions of files on Ec2?
- How to use an existing file that I already downloaded while I was doing it on my local machine? 😛
- How to decide and choose a cost-effective Ec2?
Install Tesseract 5.0 on ubuntu machine:
- For this project, I used Amazon Web Services. If you don’t have an AWS account, then create one before moving on to the next steps.