Introduction
In the field of document analysis, the ongoing difficulty of extracting organized data from unstructured information has found a solution in the emergence of the Table Transformer. This pioneering adaptation of the DETR (DEtection TRansformer) model developed by Microsoft Research, housed within the Hugging Face Transformers framework, marks a significant advancement. This inventive model represents a major step forward by combining convolutional backbones with encoder-decoder Transformers, enabling exceptional performance in detecting tables and recognizing structures within documents.
Overview
The Table Transformer model, introduced in the research paper ‘PubTables-1M: Towards comprehensive table extraction from unstructured documents’ by Brandon Smock, Rohith Pesala, and Robin Abraham, presents a novel dataset, PubTables-1M. This dataset aims to set a benchmark for advancements in table extraction from unstructured documents, focusing on table structure recognition and functional analysis. The authors trained two DETR models within the Table Transformer framework: one for table detection and another for table structure recognition.
What Is DETR ?
The groundbreaking DETR model was introduced in the paper ‘End-to-End Object Detection with Transformers’ authored by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. DETR comprises a convolutional backbone followed by an encoder-decoder Transformer, offering an end-to-end training approach for object detection. It simplifies the complexities inherent in models such as Faster-R-CNN and Mask-R-CNN, which rely on techniques like region proposals, non-maximum suppression, and anchor generation. Furthermore, DETR exhibits the potential for extension into panoptic segmentation by adding a mask head to the decoder outputs.
Abstract of DETR
This novel method presents object detection as a direct set prediction problem, marking a departure from conventional approaches. By streamlining the detection pipeline, this model eliminates the need for various manually crafted components such as non-maximum suppression and anchor generation, which typically encode task-specific prior knowledge. At the core of this new framework, named DEtection TRansformer (DETR), lies a set-based global loss that ensures unique predictions through bipartite matching and an architecture involving Transformer encoder-decoder layers. Using a predefined set of learned object queries, DETR comprehends object relations and the overall image context to directly generate a final set of predictions concurrently. This model is conceptually straightforward and doesn't rely on a specialized library, unlike several contemporary detectors. On the challenging COCO object detection dataset, DETR exhibits comparable accuracy and runtime performance to the well-established and highly optimized Faster R-CNN baseline. Moreover, DETR offers straightforward generalization for producing panoptic segmentation, outperforming competitive baselines significantly.
Abstract of TATR
The paper's abstract highlights recent progress in using machine learning for inferring and extracting table structures from unstructured documents. It acknowledges the challenge of creating large-scale datasets with accurate ground truth and introduces PubTables-1M as a solution. This dataset includes nearly one million tables from scientific articles, supporting various input formats and offering detailed header and location information for table structures. To enhance accuracy, it addresses ground truth inconsistencies observed in previous datasets by employing a novel canonicalization procedure. The research demonstrates the dataset's improvements in training and evaluating model performance for table structure recognition. Moreover, Transformer-based object detection models trained on PubTables-1M exhibit outstanding results across detection, structure recognition, and functional analysis without requiring specialized customization for these tasks.
Understanding the Table Transformer
Central to its design, the DETR model, initially conceived for object detection and panoptic segmentation, relies on a foundational convolutional backbone such as ResNet-50 or ResNet-101, succeeded by an encoder-decoder Transformer architecture. What sets it apart is its streamlined methodology. In contrast to predecessors like Faster R-CNN or Mask R-CNN, which depend on intricate mechanisms like region proposals and anchor generation, DETR functions in an end-to-end manner. This simplicity is supported by an advanced loss function known as the bipartite matching loss, enabling uncomplicated training and refinement resembling the approach used in BERT models.
Advantages Over OCR
For a long time, Optical Character Recognition (OCR) has served as the conventional means of document analysis. Nevertheless, the Table Transformer offers distinct benefits:
- Structure Recognition: While OCR is proficient in text extraction, the Table Transformer surpasses it by not only extracting text but also recognizing and reconstructing table structures. Consequently, it retains the relational context of the data, showcasing a superior capability.
- End-to-End Training: In contrast to traditional OCR methods that often require multiple pre-processing steps and domain-specific adjustments, the Table Transformer's end-to-end training streamlines the workflow. This approach reduces the necessity for complex preprocessing, enhancing efficiency.
- Reduced Dependence on Templates: OCR heavily relies on predetermined templates, leading to rigidity in handling variations in document layouts. The Table Transformer's adaptability to diverse document structures significantly bolsters its robustness, reducing reliance on rigid templates.
Tutorial - Using TATR on E2E Cloud
If you require extra GPU resources for the tutorials ahead, you can explore the offerings on E2E CLOUD. E2E provides a diverse selection of GPUs, making them a suitable choice for more advanced LLM-based applications.
To get one, head over to MyAccount, and sign up. Then launch a GPU node as is shown in the screenshot below:
Make sure you add your ssh keys during launch, or through the security tab after launching.
Once you have launched a node, you can use VSCode Remote Explorer to ssh into the node and use it as a local development environment.
Tutorial: Table Transformer (TATR) for Table Detection and Extraction for OCR Application
Set Up the Environment
Let's start by installing Hugging Face Transformers and EasyOCR (an open-source OCR engine).
Next, we load a Table Transformer pre-trained for table detection. We use the ‘no_timm’ version here to load the checkpoint with a Transformers-native backbone.
We move the model to a GPU if it's available (predictions will be faster).
Next, we can load a PDF image.
Preparing the image for the model can be done as follows:
Next, we forward the pixel values through the model. The model outputs logits of shape (batch_size, num_queries, num_labels + 1). The +1 is for the ‘no object’ class.
Next, we take the prediction that has an actual class (i.e. not ‘no object’).
We can visualize the detection on the image.
Next, we crop the table out of the image.
Next, we load a Table Transformer pre-trained for table structure recognition.
We prepare the cropped table image for the model, and perform a forward pass.
Next, we get the predicted detections.
We can visualize all recognized cells using PIL's ImageDraw module.
An alternative way of plotting is to select one class to visualize, like ‘table row’:
Apply OCR Row by Row
First, we get the coordinates of the individual cells, row by row, by looking at the intersection of the rows and columns. Next, we apply OCR on each individual cell, row-by-row.
Alternatively, one could also do OCR column by column, and so on.
We end up with a CSV file containing the data.
Conclusion
In summary, the Table Transformer represents a significant advancement in the field of document analysis, particularly concerning PDFs containing intricate tables. Its innovative integration of multi-modal capabilities has redefined the extraction of information from these complex documents.
This pioneering model, constructed upon the DETR framework, ushers in a new era by not only interpreting text within PDFs but also comprehensively identifying, reconstructing, and preserving the detailed structures of tables. Through the seamless combination of convolutional backbones and encoder-decoder Transformer architecture, it excels in both detecting tables and recognizing their structures.
Its advantages over traditional Optical Character Recognition (OCR) are manifold: from its ability to discern and rebuild table layouts to its end-to-end training that simplifies workflows and diminishes reliance on inflexible templates. The Table Transformer's adaptability to various document structures highlights its strength and flexibility.
As document analysis progresses, the Table Transformer's emergence as a potent tool represents a pivotal moment, promising efficiency, precision, and a more streamlined approach to extracting structured information from the vast collections of PDFs containing valuable tabular data. Its impact not only transforms document processing but also introduces possibilities for broader applications in diverse domains reliant on comprehensive information extraction from multi-modal documents.
References
TATR Repo: https://github.com/microsoft/table-transformer
Research Paper: PubTables-1M: Towards comprehensive table extraction from unstructured documents
Research Paper: End-to-End Object Detection with Transformers