Table Detection and Transformation Using TATR (Table Transformer): Using TATR on E2E Cloud

January 15, 2024

Introduction

In the field of document analysis, the ongoing difficulty of extracting organized data from unstructured information has found a solution in the emergence of the Table Transformer. This pioneering adaptation of the DETR (DEtection TRansformer) model developed by Microsoft Research, housed within the Hugging Face Transformers framework, marks a significant advancement. This inventive model represents a major step forward by combining convolutional backbones with encoder-decoder Transformers, enabling exceptional performance in detecting tables and recognizing structures within documents.

Overview

The Table Transformer model, introduced in the research paper ‘PubTables-1M: Towards comprehensive table extraction from unstructured documents’ by Brandon Smock, Rohith Pesala, and Robin Abraham, presents a novel dataset, PubTables-1M. This dataset aims to set a benchmark for advancements in table extraction from unstructured documents, focusing on table structure recognition and functional analysis. The authors trained two DETR models within the Table Transformer framework: one for table detection and another for table structure recognition.

What Is DETR ?

The groundbreaking DETR model was introduced in the paper ‘End-to-End Object Detection with Transformers’ authored by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. DETR comprises a convolutional backbone followed by an encoder-decoder Transformer, offering an end-to-end training approach for object detection. It simplifies the complexities inherent in models such as Faster-R-CNN and Mask-R-CNN, which rely on techniques like region proposals, non-maximum suppression, and anchor generation. Furthermore, DETR exhibits the potential for extension into panoptic segmentation by adding a mask head to the decoder outputs.

Abstract of DETR

This novel method presents object detection as a direct set prediction problem, marking a departure from conventional approaches. By streamlining the detection pipeline, this model eliminates the need for various manually crafted components such as non-maximum suppression and anchor generation, which typically encode task-specific prior knowledge. At the core of this new framework, named DEtection TRansformer (DETR), lies a set-based global loss that ensures unique predictions through bipartite matching and an architecture involving Transformer encoder-decoder layers. Using a predefined set of learned object queries, DETR comprehends object relations and the overall image context to directly generate a final set of predictions concurrently. This model is conceptually straightforward and doesn't rely on a specialized library, unlike several contemporary detectors. On the challenging COCO object detection dataset, DETR exhibits comparable accuracy and runtime performance to the well-established and highly optimized Faster R-CNN baseline. Moreover, DETR offers straightforward generalization for producing panoptic segmentation, outperforming competitive baselines significantly.

Abstract of TATR

The paper's abstract highlights recent progress in using machine learning for inferring and extracting table structures from unstructured documents. It acknowledges the challenge of creating large-scale datasets with accurate ground truth and introduces PubTables-1M as a solution. This dataset includes nearly one million tables from scientific articles, supporting various input formats and offering detailed header and location information for table structures. To enhance accuracy, it addresses ground truth inconsistencies observed in previous datasets by employing a novel canonicalization procedure. The research demonstrates the dataset's improvements in training and evaluating model performance for table structure recognition. Moreover, Transformer-based object detection models trained on PubTables-1M exhibit outstanding results across detection, structure recognition, and functional analysis without requiring specialized customization for these tasks.

Understanding the Table Transformer

Central to its design, the DETR model, initially conceived for object detection and panoptic segmentation, relies on a foundational convolutional backbone such as ResNet-50 or ResNet-101, succeeded by an encoder-decoder Transformer architecture. What sets it apart is its streamlined methodology. In contrast to predecessors like Faster R-CNN or Mask R-CNN, which depend on intricate mechanisms like region proposals and anchor generation, DETR functions in an end-to-end manner. This simplicity is supported by an advanced loss function known as the bipartite matching loss, enabling uncomplicated training and refinement resembling the approach used in BERT models.

Advantages Over OCR

For a long time, Optical Character Recognition (OCR) has served as the conventional means of document analysis. Nevertheless, the Table Transformer offers distinct benefits:

Structure Recognition: While OCR is proficient in text extraction, the Table Transformer surpasses it by not only extracting text but also recognizing and reconstructing table structures. Consequently, it retains the relational context of the data, showcasing a superior capability.
End-to-End Training: In contrast to traditional OCR methods that often require multiple pre-processing steps and domain-specific adjustments, the Table Transformer's end-to-end training streamlines the workflow. This approach reduces the necessity for complex preprocessing, enhancing efficiency.
Reduced Dependence on Templates: OCR heavily relies on predetermined templates, leading to rigidity in handling variations in document layouts. The Table Transformer's adaptability to diverse document structures significantly bolsters its robustness, reducing reliance on rigid templates.

Tutorial - Using TATR on E2E Cloud

If you require extra GPU resources for the tutorials ahead, you can explore the offerings on E2E CLOUD. E2E provides a diverse selection of GPUs, making them a suitable choice for more advanced LLM-based applications.

To get one, head over to MyAccount, and sign up. Then launch a GPU node as is shown in the screenshot below:

Make sure you add your ssh keys during launch, or through the security tab after launching.

Once you have launched a node, you can use VSCode Remote Explorer to ssh into the node and use it as a local development environment.

Tutorial: Table Transformer (TATR) for Table Detection and Extraction for OCR Application

Set Up the Environment

Let's start by installing Hugging Face Transformers and EasyOCR (an open-source OCR engine).


!pip install -q easyocr

Next, we load a Table Transformer pre-trained for table detection. We use the ‘no_timm’ version here to load the checkpoint with a Transformers-native backbone.


from transformers import AutoModelForObjectDetection
model = AutoModelForObjectDetection.from_pretrained("microsoft/table-transformer-detection", revision="no_timm")


model.config.id2label

We move the model to a GPU if it's available (predictions will be faster).


import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print("")

Next, we can load a PDF image.


from PIL import Image
from huggingface_hub import hf_hub_download
# Loading an example image
file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename="image.png")
image = Image.open(file_path).convert("RGB")
# let's display it a bit smaller
width, height = image.size
display(image.resize((int(0.6*width), (int(0.6*height)))))

Preparing the image for the model can be done as follows:


from torchvision import transforms
class MaxResize(object):
    def __init__(self, max_size=800):
        self.max_size = max_size
    def __call__(self, image):
        width, height = image.size
        current_max_size = max(width, height)
        scale = self.max_size / current_max_size
        resized_image = image.resize((int(round(scale*width)), int(round(scale*height))))
        return resized_image
detection_transform = transforms.Compose([
    MaxResize(800),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])


pixel_values = detection_transform(image).unsqueeze(0)
pixel_values = pixel_values.to(device)
print(pixel_values.shape)

Next, we forward the pixel values through the model. The model outputs logits of shape (batch_size, num_queries, num_labels + 1). The +1 is for the ‘no object’ class.


import torch
with torch.no_grad():
  outputs = model(pixel_values)

Next, we take the prediction that has an actual class (i.e. not ‘no object’).


# for output bounding box post-processing
def box_cxcywh_to_xyxy(x):
    x_c, y_c, w, h = x.unbind(-1)
    b = [(x_c - 0.5 * w), (y_c - 0.5 * h), (x_c + 0.5 * w), (y_c + 0.5 * h)]
    return torch.stack(b, dim=1)
def rescale_bboxes(out_bbox, size):
    img_w, img_h = size
    b = box_cxcywh_to_xyxy(out_bbox)
    b = b * torch.tensor([img_w, img_h, img_w, img_h], dtype=torch.float32)
    return b
# update id2label to include "no object"
id2label = model.config.id2label
id2label[len(model.config.id2label)] = "no object"
def outputs_to_objects(outputs, img_size, id2label):
    m = outputs.logits.softmax(-1).max(-1)
    pred_labels = list(m.indices.detach().cpu().numpy())[0]
    pred_scores = list(m.values.detach().cpu().numpy())[0]
    pred_bboxes = outputs['pred_boxes'].detach().cpu()[0]
    pred_bboxes = [elem.tolist() for elem in rescale_bboxes(pred_bboxes, img_size)]
    objects = []
    for label, score, bbox in zip(pred_labels, pred_scores, pred_bboxes):
        class_label = id2label[int(label)]
        if not class_label == 'no object':
            objects.append({'label': class_label, 'score': loat(score),                             'bbox': [float(elem) for elem in bbox]})
    return objects


objects = outputs_to_objects(outputs, image.size, id2label)

print(objects)

We can visualize the detection on the image.


import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.patches import Patch
def fig2img(fig):
    """Convert a Matplotlib figure to a PIL Image and return it"""
    import io
    buf = io.BytesIO()
    fig.savefig(buf)
    buf.seek(0)
    img = Image.open(buf)
    return img
def visualize_detected_tables(img, det_tables, out_path=None):
    plt.imshow(img, interpolation="lanczos")
    fig = plt.gcf()
    fig.set_size_inches(20, 20)
    ax = plt.gca()
    for det_table in det_tables:
        bbox = det_table['bbox']
        if det_table['label'] == 'table':
            facecolor = (1, 0, 0.45)
            edgecolor = (1, 0, 0.45)
            alpha = 0.3
            linewidth = 2
            hatch='//////'
        elif det_table['label'] == 'table rotated':
            facecolor = (0.95, 0.6, 0.1)
            edgecolor = (0.95, 0.6, 0.1)
            alpha = 0.3
            linewidth = 2
            hatch='//////'
        else:
            continue
        rect = patches.Rectangle(bbox[:2], bbox[2]-bbox[0], box[3]-bbox[1], linewidth=linewidth, edgecolor='none', facecolor=facecolor, alpha=0.1)
        ax.add_patch(rect)
        rect = patches.Rectangle(bbox[:2], bbox[2]-bbox[0], bbox[3]-bbox[1], linewidth=linewidth, edgecolor=edgecolor, facecolor='none',linestyle='-', alpha=alpha)
        ax.add_patch(rect)
        rect = patches.Rectangle(bbox[:2], bbox[2]-bbox[0], bbox[3]-bbox[1], linewidth=0, edgecolor=edgecolor, facecolor='none', linestyle='-', hatch=hatch, alpha=0.2)
        ax.add_patch(rect)
    plt.xticks([], [])
    plt.yticks([], [])
    legend_elements = [Patch(facecolor=(1, 0, 0.45), edgecolor=(1, 0, 0.45), label='Table', hatch='//////', alpha=0.3),                      Patch(facecolor=(0.95, 0.6, 0.1), edgecolor=(0.95, 0.6, 0.1),                               label='Table (rotated)', hatch='//////', alpha=0.3)]
    plt.legend(handles=legend_elements, bbox_to_anchor=(0.5, -0.02), loc='upper center', borderaxespad=0, fontsize=10, ncol=2)
    plt.gcf().set_size_inches(10, 10)
    plt.axis('off')
    if out_path is not None:
      plt.savefig(out_path, bbox_inches='tight', dpi=150)
    return fig
fig = visualize_detected_tables(image, objects)
visualized_image = fig2img(fig)

Next, we crop the table out of the image.


def objects_to_crops(img, tokens, objects, class_thresholds, padding=10):
    table_crops = []
    for obj in objects:
        if obj['score'] < class_thresholds[obj['label']]:
            continue
        cropped_table = {}
        bbox = obj['bbox']
        bbox = [bbox[0]-padding, bbox[1]-padding, bbox[2]+padding, bbox[3]+padding]
        cropped_img = img.crop(bbox)
        table_tokens = [token for token in tokens if iob(token['bbox'], bbox) >= 0.5]
        for token in table_tokens:
            token['bbox'] = [token['bbox'][0]-bbox[0],
                             token['bbox'][1]-bbox[1],
                             token['bbox'][2]-bbox[0],
                             token['bbox'][3]-bbox[1]]
        if obj['label'] == 'table rotated':
            cropped_img = cropped_img.rotate(270, expand=True)
            for token in table_tokens:
                bbox = token['bbox']
                bbox = [cropped_img.size[0]-bbox[3]-1,
                        bbox[0],
                        cropped_img.size[0]-bbox[1]-1,
                        bbox[2]]
                token['bbox'] = bbox
        cropped_table['image'] = cropped_img
        cropped_table['tokens'] = table_tokens
        table_crops.append(cropped_table)
    return table_crops


tokens = []
detection_class_thresholds = {
    "table": 0.5,
    "table rotated": 0.5,
    "no object": 10
}
crop_padding = 10
tables_crops = objects_to_crops(image, tokens, objects, detection_class_thresholds, padding=0)
cropped_table = tables_crops[0]['image'].convert("RGB")
cropped_table
cropped_table.save("table.jpg")

Next, we load a Table Transformer pre-trained for table structure recognition.


from transformers import TableTransformerForObjectDetection
new v1.1 checkpoints require no timm anymore
structure_model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-structure-recognition-v1.1-all")
structure_model.to(device)
print("")

We prepare the cropped table image for the model, and perform a forward pass.


structure_transform = transforms.Compose([
    MaxResize(1000),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
pixel_values = structure_transform(cropped_table).unsqueeze(0)
pixel_values = pixel_values.to(device)
print(pixel_values.shape)
forward pass
with torch.no_grad():
  outputs = structure_model(pixel_values)

Next, we get the predicted detections.


update id2label to include "no object"
structure_id2label = structure_model.config.id2label
structure_id2label[len(structure_id2label)] = "no object"
cells = outputs_to_objects(outputs, cropped_table.size, structure_id2label)
print(cells)

We can visualize all recognized cells using PIL's ImageDraw module.


from PIL import ImageDraw
cropped_table_visualized = cropped_table.copy()
draw = ImageDraw.Draw(cropped_table_visualized)
for cell in cells:
    draw.rectangle(cell["bbox"], outline="red")
cropped_table_visualized

An alternative way of plotting is to select one class to visualize, like ‘table row’:


def plot_results(cells, class_to_visualize):
    if class_to_visualize not in structure_model.config.id2label.values():
      raise ValueError("Class should be one of the available classes")
    plt.figure(figsize=(16,10))
    plt.imshow(cropped_table)
    ax = plt.gca()
    for cell in cells:
        score = cell["score"]
        bbox = cell["bbox"]
        label = cell["label"]
        if label == class_to_visualize:
          xmin, ymin, xmax, ymax = tuple(bbox)
          ax.add_patch(plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin, fill=False, color="red", linewidth=3))
          text = f'{cell["label"]}: {score:0.2f}'
          ax.text(xmin, ymin, text, fontsize=15, bbox=dict(facecolor='yellow', alpha=0.5))
          plt.axis('off')
plot_results(cells, class_to_visualize="table row")

Apply OCR Row by Row

First, we get the coordinates of the individual cells, row by row, by looking at the intersection of the rows and columns. Next, we apply OCR on each individual cell, row-by-row.

Alternatively, one could also do OCR column by column, and so on.


def get_cell_coordinates_by_row(table_data):
    # Extract rows and columns
    rows = [entry for entry in table_data if entry['label'] == 'table row']
    columns = [entry for entry in table_data if entry['label'] == 'table column']
    # Sort rows and columns by their Y and X coordinates, respectively
    rows.sort(key=lambda x: x['bbox'][1])
    columns.sort(key=lambda x: x['bbox'][0])
    # Function to find cell coordinates
    def find_cell_coordinates(row, column):
        cell_bbox = [column['bbox'][0], row['bbox'][1], column['bbox'][2], row['bbox'][3]]
        return cell_bbox
    # Generate cell coordinates and count cells in each row
    cell_coordinates = []
    for row in rows:
        row_cells = []
        for column in columns:
            cell_bbox = find_cell_coordinates(row, column)
            row_cells.append({'column': column['bbox'], 'cell': cell_bbox})
        # Sort cells in the row by X coordinate
        row_cells.sort(key=lambda x: x['column'][0])
        # Append row information to cell_coordinates
        cell_coordinates.append({'row': row['bbox'], 'cells': row_cells, 'cell_count': len(row_cells)})
    # Sort rows from top to bottom
    cell_coordinates.sort(key=lambda x: x['row'][1])
    return cell_coordinates
cell_coordinates = get_cell_coordinates_by_row(cells)


len(cell_coordinates)
len(cell_coordinates[0]["cells"])
for row in cell_coordinates:
  print(row["cells"])


import numpy as np
import csv
import easyocr
from tqdm.auto import tqdm
reader = easyocr.Reader(['en']) # this needs to run only once to load the model into memory
def apply_ocr(cell_coordinates):
    # let's OCR row by row
    data = dict()
    max_num_columns = 0
    for idx, row in enumerate(tqdm(cell_coordinates)):
      row_text = []
      for cell in row["cells"]:
        # crop cell out of image
        cell_image = np.array(cropped_table.crop(cell["cell"]))
        # apply OCR
        result = reader.readtext(np.array(cell_image))
        if len(result) > 0:
          # print([x[1] for x in list(result)])
          text = " ".join([x[1] for x in result])
          row_text.append(text)
      if len(row_text) > max_num_columns:
          max_num_columns = len(row_text)
      data[idx] = row_text
    print("Max number of columns:", max_num_columns)
    # pad rows which don't have max_num_columns elements
    # to make sure all rows have the same number of columns
    for row, row_data in data.copy().items():
        if len(row_data) != max_num_columns:
          row_data = row_data + ["" for _ in range(max_num_columns - len(row_data))]
        data[row] = row_data
    return data
data = apply_ocr(cell_coordinates)
for row, row_data in data.items():
    print(row_data)

We end up with a CSV file containing the data.

Conclusion

In summary, the Table Transformer represents a significant advancement in the field of document analysis, particularly concerning PDFs containing intricate tables. Its innovative integration of multi-modal capabilities has redefined the extraction of information from these complex documents.

This pioneering model, constructed upon the DETR framework, ushers in a new era by not only interpreting text within PDFs but also comprehensively identifying, reconstructing, and preserving the detailed structures of tables. Through the seamless combination of convolutional backbones and encoder-decoder Transformer architecture, it excels in both detecting tables and recognizing their structures.

Its advantages over traditional Optical Character Recognition (OCR) are manifold: from its ability to discern and rebuild table layouts to its end-to-end training that simplifies workflows and diminishes reliance on inflexible templates. The Table Transformer's adaptability to various document structures highlights its strength and flexibility.

As document analysis progresses, the Table Transformer's emergence as a potent tool represents a pivotal moment, promising efficiency, precision, and a more streamlined approach to extracting structured information from the vast collections of PDFs containing valuable tabular data. Its impact not only transforms document processing but also introduces possibilities for broader applications in diverse domains reliant on comprehensive information extraction from multi-modal documents.