Extracting Information from Scanned Documents with LLM Vision Models

Dhaval Nagar / CEO

We recently experimented on a project to extract specific information from scanned documents. By leveraging the latest Large Language Models (LLMs) with Vision capabilities, we were able to extract meaningful information efficiently and cost-effectively. This approach demonstrates the growing potential of LLM Vision Models in transforming traditional paper-based processes into more accessible digital formats.

Electricity Bill in local Gujarati to scanned JSON Format

With the use of advanced LLMs, businesses can process large volumes of scanned documents more accurately and with fewer resources. These models not only improve the accuracy of data extraction but also help in reducing the costs associated with manual processing, providing a scalable solution for various industries.

Tech Stack

Claude 3 family of models has reasonably good vision capabilities, better performance and they are cost effective. We decided to use Claude 3 Haiku for the initial testing. Depending on the use case, we prefer to use Serverless services to keep the solution scalable but cost-efficient.

  • LLM Model: Amazon Bedrock with Claude 3 Haiku
  • Cloud Storage: Amazon S3
  • Database: Aurora MySQL
  • Compute: Lambda with Step Functions

Output

Let's look at few of the example scans and their output. The input resolution is adjusted to around 1000 Tokens for each of the following images. That is just to calculate the cost of processing per image.

Note: Some of these images were taken from Google Image search.

Electricity Bill in local Gujarati Language

The copy of the bill is uploaded a mobile photo for further processing. The input prompt is adjusted to extract only meaningful information along with doing a math as well. So instead of just parsing, I am also parsing and doing the sum to derive new value, NetPayableAmount.

Permanent Address Number (PAN) Photo Copy

This is a photo taken from camera. The image is of my father's PAN card. In several tries, it detects the same information consistently.

FedEx Receipt Copy

This is a sample copy for a FedEx transfer receipt. The image is pretty clean and has limited information. However, the handwriting is difficult to read. In repeating attempts, the output was pretty consistent.

Fax Copy

This is a financial transaction copy, very old one. If you look at the structure of the form and the scanned image - it's very well structured and most importantly it's computer input and not handwritten. However, from the scanning point of view, we had to adjust the prompt so that it can consider this entire form as a multi-page form and generate the JSON accordingly.

Pilot Paperwork Sheet Copy

This was by far the most complicated scan. The scanned page had medium resolution and text are hand-written, difficult to read even for a person. For this, instead of scanning the whole image we only wanted to extract certain fields so the prompt was adjusted to focus and extract only those values. Depending on the hand-writing, it was making some consistent errors in parsing, but mostly for the descriptive texts.

Cost and Performance

Newer LLMs are now supporting multi-model inputs like Text, Audio and Images. This opens up new category of applications that previous needed specific process flow. The idea of this use case was to highlight how efficiently we can extract complex information without using infrastructure and services like OCR and pre-trained ML Models.

Anthropic Claude 3 Haiku is currently priced at $0.00025 per 1000 Tokens. In comparision, Google Gemini 1.0 Pro Vision model charges $0.0025 Per Image. For use cases like this, where expected output is pretty much same from different models, cost per use and performance becomes critical parameters for scaling.

Summary

We are still scratching the surface of what all can be done using latest and greatest LLM Models. I believe, the number of applications will grow multi-fold in near future where traditionally complex processes will become much easier to do, with just an API Call.

Reach out to us if you want to explore similar use cases.

More articles

From GPT to 3D Print in Minutes

Few days back a colleague shared a video where the guy built a Voice to (3D) Print automation using ChatGPT, Python, Rhino 3D, and Grasshopper. We thought it would be cool to give it a try.

Read more

AI for Everyone: The Surge of Large Language Models and Accessible Tools

Artificial intelligence (AI) isn't some far-off, futuristic concept anymore. It's already part of our day-to-day life, rapidly evolving, and becoming more accessible than ever before. The driving forces behind this momentum shift are the incredible advancements in Large Language Models (LLMs) and the rise of powerful, user-friendly AI tools.

Read more

Tell us about your project

Our office

  • 408-409, SNS Platina
    Opp Shrenik Residecy
    Vesu, Surat, India
    Google Map