Large and small businesses often deal with large volumes of invoices, which can be time-consuming and error-prone to process manually. To tackle this challenge, advancements in AI are transforming the way companies handle repetitive tasks like invoice processing. Enter Llama 3.2-11B, a powerful AI model designed to streamline complex tasks such as extracting and understanding text from images, making it an ideal solution for automating bulk invoice processing.
This guide will walk you through the process of using Llama 3.2-11B for bulk invoice processing. Whether you’re new to AI or looking to implement it into your workflow, we’ll cover everything from setting up the system to extracting data and generating results in just a few simple steps. By the end of this guide, you'll be able to automate invoice processing efficiently, saving time and reducing the risk of manual errors.
The USP of Llama 3.2-11B
Llama 3.2-11B is a state-of-the-art multimodal large language model developed by Meta, featuring 11 billion parameters. This model is part of the Llama 3.2 series, which integrates both text and image processing capabilities, enabling it to perform complex tasks that involve reasoning over visual and textual data.
Key Features
- Multimodal Capabilities: Llama 3.2-11B can process both images and text, allowing it to handle tasks such as image captioning, visual question answering, and document understanding. This makes it suitable for applications that require comprehension of visual content alongside textual information.
- High-Resolution Image Processing: The model is designed to work with high-resolution images, leveraging a unique architecture that includes an image encoder integrated through cross-attention layers. This integration allows the model to effectively bridge the gap between visual inputs and language outputs.
- Extended Context Length: It supports a context length of up to 128,000 tokens, significantly enhancing its ability to manage extensive data inputs.
Performance and Applications
Llama 3.2-11B has been optimized for various tasks, including:
- Visual Recognition: Identifying objects and scenes in images.
- Image Reasoning: Answering questions about images based on their content.
- Caption Generation: Creating descriptive text for images.
- Document Visual Question Answering: Interpreting documents with visual elements like graphs or charts.
The model has been trained on a large dataset of approximately 6 billion image-text pairs, which enhances its performance across diverse benchmarks, outperforming many existing multimodal models.
How to Get Started with TIR AI Platform
You can get started with the TIR AI / ML Platform here. Here are some screenshots to help you navigate through the platform.
Go to the Nodes option on the left side of the screen and open the dropdown menu. In our case, 100GB will work.
Select the size of your disk as 50GB – it works just fine for our use case. But you might need to increase it if your use case changes.
Hit Launch to get started with your TIR Node.
When the Node is ready to be used, it’ll show the Jupyter Lab logo. Hit on the logo to activate your workspace.
Select the Python3 pykernel, then select the option to get your Jupyter Notebook ready. Now you are ready to start coding.
Let’s Code
Step 1: Install Required Libraries
Begin by installing the libraries needed for the project.
These libraries include:
- sentence-transformers: For text embeddings.
- transformers: Provides pre-trained models for various NLP tasks.
- qdrant-client: Used for vector similarity search with Qdrant.
- gradio: For building web interfaces to interact with your models.
- Pillow: For handling image data.
Step 2: Import Required Libraries
Once the libraries are installed, import them into your environment:
Step 3: Initialize the Models and Services
This block of code is responsible for setting up and initializing the various models and services required for the bulk invoice processing system.
- Loading the Llama Model: The first part of the code initializes the core of the project, the Llama-3.2-11B Vision model. The MllamaForConditionalGeneration class is used to load the pre-trained Llama model. This model is capable of understanding and processing visual data (in this case, images of invoices) and can generate textual responses based on that input. The device_map="auto" ensures that the model is mapped to the available hardware (such as GPUs) for faster processing, while torch_dtype=torch.bfloat16 helps in optimizing memory usage without sacrificing precision.
- Setting Up the Processor: The MllamaProcessor is another key component, responsible for preparing the input data (invoices or images) for the model. It processes images, converts them into a format that the model can understand, and adds any necessary prompts or templates required for generating responses. This processor bridges the gap between the raw data and the Llama model, ensuring smooth communication.
- Initializing the Embedder: The SentenceTransformer from the sentence-transformers library is loaded next. Specifically, the "all-mpnet-base-v2" model is used to generate high-quality text embeddings. This model converts text into a fixed-size vector representation, making it easier to compare, search, and analyze the text in a mathematical way. These embeddings are critical for efficiently storing and retrieving relevant information from the Qdrant database.
- Setting Up Qdrant Client: The final part of the block initializes a Qdrant client in memory (:memory:). Qdrant is a vector database designed to handle the embeddings generated from the text. It allows the system to perform fast and efficient searches, making it possible to retrieve relevant data quickly. For the sake of this example, the database is stored in memory, but it can be configured to persist data for larger-scale operations.
Step 4: Extract Text from Images
The function convert_url_to_text(url) is designed to take an image URL and extract any text present in the image. It works by first fetching the image from the URL using the requests library and then opening it with the help of the Python Imaging Library (PIL). Once the image is loaded, it is passed through the Llama 3.2-11B model, which has been pre-trained to recognize and extract text from images. The text extracted is then returned for further use. This part of the code is vital because it serves as the foundation of the entire invoice processing pipeline—turning visual information into readable, machine-usable text.
Step 5: Generate Embeddings from Extracted Text
The generate_embeddings function plays a crucial role in converting the extracted text from images into a format that can be effectively processed by the system. It takes the raw text as input and generates embeddings, which are numerical representations (vectors) of the text.
The core of this function lies in the embedder.encode() method, where the actual transformation happens. The embedder is a pre-trained model that understands the meaning and context of the text, turning it into a vector that reflects its semantic content. By using these embeddings, the system can perform similarity searches, compare different texts, and retrieve related information much faster and more efficiently.
This embedding step is foundational for any advanced query-based system. It ensures that the extracted invoice text is stored in a way that the system can search and retrieve relevant information accurately, making the process of bulk invoice handling much more streamlined and scalable.
Step 6: Save Embeddings to the Qdrant Vector Database
This block of code is responsible for storing the text embeddings, along with the associated text, into the Qdrant vector database. Here's how it works:
- Function Purpose: The function save_embedding_to_qdrant takes three parameters:some text
- embedding: This is the vector representation of the extracted text (created by the embedding model).
- text: The original text that was processed from the image.
- vector_id: A unique identifier for the entry, which helps in distinguishing between different pieces of data in the Qdrant database.
- Upserting the Data: Inside the function, the qdrant_client.upsert() method is used to add or update data in the Qdrant vector database. The method ensures that the data is inserted into a specific collection, in this case, "image_text_collection". A collection in Qdrant is similar to a table in a traditional database — it stores related data points.
- Structuring the Data: The PointStruct object is used to organize the data before it’s inserted into the database. Each point has:some text
- An id (which is the vector_id in this case) to uniquely identify it.
- A vector, which is the embedding created from the extracted text. This vector is a fixed-size representation that allows for quick similarity searches.
- A payload, which contains the original text that was processed from the image, so that the text can be easily retrieved along with its embedding.
By saving the text embeddings in the Qdrant database, this function allows the system to later query and retrieve relevant data efficiently. Storing embeddings in this format enables the system to perform vector-based searches, which are crucial for comparing and retrieving similar texts when processing bulk invoices.
Step 7: Retrieve the Relevant Texts
Once the text is indexed, we need a way to retrieve relevant information based on a query. This is where the function retrieve_relevant_text(query) is used. Given a user’s query, it generates an embedding for the query text using the same Llama 3.2 model. Then, it searches through the Qdrant database to find similar vectors, i.e., the texts that are most relevant to the query. This allows for a fast and accurate way to find specific information within the large dataset of invoice text. It’s like using a search engine tailored for the specific data stored in Qdrant.
Step 8: Generate the Final Answer
The final step in the process is executed by the generate_answer(context, query) function. After the relevant texts have been retrieved, this function utilizes the Llama 3.2-11B Vision model to generate a coherent and detailed response. The function takes the retrieved context and the user's query as inputs and formulates a human-friendly answer. Using a structured message template, it prompts the model to provide a clear, readable response. The output is then decoded, refined, and returned to the user, offering a precise answer based on the uploaded invoices and the user’s query.
Step 9: Main Function to Process Images and Query the System
This block of code is the core function that ties together the various components of the system, processing images, extracting text, generating embeddings, and querying the database to provide meaningful responses. Here's how it works:
- Function Overview: The process function takes in two inputs:some text
- images: A list of uploaded image files, which contain the data to be processed.
- query: A text-based query that will be used to search through the extracted text for relevant information.
- ID Tracking for Qdrant: The function starts by initializing a vector_id at 1, which is used to uniquely identify and store each embedding into the Qdrant database. Additionally, it maintains a list all_contexts to collect the extracted text from all the images.
- Processing Each Image: For every image in the uploaded list:some text
- The image.name is extracted to keep track of the image file being processed.
- Text Extraction: It calls the extract_text_from_image function to extract the text content from the image, which is then stored in all_contexts.
- Generate Embeddings: The extracted text is passed through the generate_embeddings function to create a vector representation.
- Save to Qdrant: The generated embedding, along with the extracted text and a unique ID (vector_id), is saved into the Qdrant vector database using the save_embedding_to_qdrant function.
- Combining Contexts: Once all the images are processed, the texts collected from each image are combined into a single string (combined_context). This is used to provide a more comprehensive context for generating answers.
- Handling the Query:some text
- Embedding the Query: The query text is also passed through the generate_embeddings function to create a vector representation, similar to the image text.
- Querying Qdrant: The function queries the Qdrant database using the query’s embedding to retrieve any matching or relevant text entries that were stored from the images.
- Formatting Retrieved Results: The retrieved results are structured in a readable format, showing the ID and text of each matching result.
- Generating a Final Answer: The generate_answer function is used to produce a final answer by using the combined context from all images and the provided query. This ensures that the response is relevant and detailed, based on all available information.
- Return Values: The function returns the combined context (the extracted text from all images) and the final answer to the query. These are displayed to the user via the Gradio interface.
This function acts as the main engine for processing images and generating responses, orchestrating the flow between extracting text, embedding data, querying the database, and providing a coherent answer based on the user’s input.
Step 10: Gradio Interface for User Interaction
Finally, the gradio_interface() function brings everything together into a user-friendly interface using Gradio. It allows users to upload images (invoices) and ask queries about the content. When the user provides a query, the interface calls the process() function, which processes the images, extracts the text, stores it in Qdrant, and generates the final answer. The Gradio interface simplifies interaction by allowing users to directly upload files and input queries through a web-based UI, making the bulk invoice processing workflow accessible to non-technical users.
Results
Let’s take a look at the results:
- First result:
- Here’s the second one:
- The third:
- The fourth and final result:
Summary
In conclusion, this guide provides a streamlined approach to using Llama 3.2-11B for bulk invoice processing. Whether you're new to AI or looking to integrate it into your workflow, we've covered the essential steps, from system setup to data extraction and result generation. By following this guide, you’ll be equipped to automate invoice processing efficiently, reducing manual errors and saving valuable time.
To get started with your own project, sign up to E2E Cloud today, and launch a cloud GPU node, or head to TIR. E2E Cloud offers the most price-performant cloud GPUs in the Indian market, and enables developers to use advanced GPUs like H200, H100, A100 for application development.