How did I analyze 5 million PDF’s using Tesseract OCR on Ec2?
Before going to project details I would like to explain to you about Tesseract;
When I heard about Tesseract for the first time, I thought it would be something related to Marvel Universe 🔨 but it was not. Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License.
Today, the extraction of information from scanned documents such as letters, write-ups, invoices, etc. has become an integral part of any business process. To accomplish this task, you need to set up OCR software to extract the information from these scanned documents or pdfs.
While doing this project I faced multiple challenges like;
- How to use Tesseract 5.0 because it comes with the default 4.x.x version?
- How to download millions of files on Ec2?
- How to use an existing file that I already downloaded while I was doing it on my local machine? 😛
- How to decide and choose a cost-effective Ec2?
Install Tesseract 5.0 on ubuntu machine:
- For this project, I used Amazon Web Services. If you don’t have an AWS account, then create one before moving on to the next steps.
- Choose Ec2 → click launch Ec2 → to take a default configuration for launching the Ec2 machine for this I used t4g.micro (arm64 Architecture)
- Install Tesseract 5.0 on an ubuntu machine
- Install Cloudwatch agent for monitoring CPU and memory
at this point, the server is ready to analyze PDFs.
For SFTP client I use Cyberduck because it’s easy to use
https://cyberduck.io
Run Python code on ubuntu machine:
- Make sure to run your python script in the background. Otherwise, you could lose your process from unwanted shell termination 😜.
nohup python3 test17.py &
- After analyzing 50 PDFs on the t4g machine, I realized that it was taking a lot of time as per my requirement😴.
Changed Ec2 machine with highly CPU optimized SPOT EC2 machine.
- As our server was taking a very long time for processing 50 PDFs so I decided to use a c6g.large machine with 2core and 4Gib ram and 50Gib of nvm EBS volume.
- Took AMI from the previous machine and launched another Ec2 machine with the same configuration for the SPOT instance. Keep in mind the Persistent request, otherwise, you can lose your data if the spot instance price goes higher than your bid price.
- After changing the Ec2 machine I got a little better performance as compared to the previous t4g.micro Ec2.
This time I analyze around 100 PDFs 4–5 MB each within 1–2 Hours
- Approx cost that I paid for running 1 c6g.large spot Ec2 for 2 days
- After analyzing t4g.mico vs c6g.large Ec2, I chose c6g.2xlarge for parsing the data from 5 million PDFs into one excel.