Extracting Information from Scanned Documents with LLM Vision Models

April 27, 2024

Dhaval Nagar / CEO

We recently experimented on a project to extract specific information from scanned documents. By leveraging the latest Large Language Models (LLMs) with Vision capabilities, we were able to extract meaningful information efficiently and cost-effectively. This approach demonstrates the growing potential of LLM Vision Models in transforming traditional paper-based processes into more accessible digital formats.

Electricity Bill in local Gujarati to scanned JSON Format

With the use of advanced LLMs, businesses can process large volumes of scanned documents more accurately and with fewer resources. These models not only improve the accuracy of data extraction but also help in reducing the costs associated with manual processing, providing a scalable solution for various industries.

Tech Stack

Claude 3 family of models has reasonably good vision capabilities, better performance and they are cost effective. We decided to use Claude 3 Haiku for the initial testing. Depending on the use case, we prefer to use Serverless services to keep the solution scalable but cost-efficient.

LLM Model: Amazon Bedrock with Claude 3 Haiku
Cloud Storage: Amazon S3
Database: Aurora MySQL
Compute: Lambda with Step Functions

Output

Let's look at few of the example scans and their output. The input resolution is adjusted to around 1000 Tokens for each of the following images. That is just to calculate the cost of processing per image.

Note: Some of these images were taken from Google Image search.

Electricity Bill in local Gujarati Language

The copy of the bill is uploaded a mobile photo for further processing. The input prompt is adjusted to extract only meaningful information along with doing a math as well. So instead of just parsing, I am also parsing and doing the sum to derive new value, NetPayableAmount.

Permanent Address Number (PAN) Photo Copy

This is a photo taken from camera. The image is of my father's PAN card. In several tries, it detects the same information consistently.

FedEx Receipt Copy

This is a sample copy for a FedEx transfer receipt. The image is pretty clean and has limited information. However, the handwriting is difficult to read. In repeating attempts, the output was pretty consistent.

Fax Copy

This is a financial transaction copy, very old one. If you look at the structure of the form and the scanned image - it's very well structured and most importantly it's computer input and not handwritten. However, from the scanning point of view, we had to adjust the prompt so that it can consider this entire form as a multi-page form and generate the JSON accordingly.

Pilot Paperwork Sheet Copy

This was by far the most complicated scan. The scanned page had medium resolution and text are hand-written, difficult to read even for a person. For this, instead of scanning the whole image we only wanted to extract certain fields so the prompt was adjusted to focus and extract only those values. Depending on the hand-writing, it was making some consistent errors in parsing, but mostly for the descriptive texts.

Cost and Performance

Newer LLMs are now supporting multi-model inputs like Text, Audio and Images. This opens up new category of applications that previous needed specific process flow. The idea of this use case was to highlight how efficiently we can extract complex information without using infrastructure and services like OCR and pre-trained ML Models.

Anthropic Claude 3 Haiku is currently priced at $0.00025 per 1000 Tokens. In comparision, Google Gemini 1.0 Pro Vision model charges $0.0025 Per Image. For use cases like this, where expected output is pretty much same from different models, cost per use and performance becomes critical parameters for scaling.

Summary

We are still scratching the surface of what all can be done using latest and greatest LLM Models. I believe, the number of applications will grow multi-fold in near future where traditionally complex processes will become much easier to do, with just an API Call.

APPGAMBiT

APPGAMBiT

Our offices

Follow us

Extracting Information from Scanned Documents with LLM Vision Models

Tech Stack

Output

Cost and Performance

Summary

Reach out to us if you want to explore similar use cases.

More articles

From GPT to 3D Print in Minutes

AI for Everyone: The Surge of Large Language Models and Accessible Tools

Tell us about your project

Our office