## Optimized Table Tokenization for Table Structure

Maksym Lysak

and Peter Staar

IBM Research

{mly,ahn,nli,cau,taa}@zurich.ibm.com

Abstract.

Keywords:

## 1

Tables are ubiquitous in documents such as scientific papers, patents, reports,

In modern document understanding systems [1,15], table extraction is typi-

Fig. 1.

<!-- image -->

today,

Recently

While the majority of research in TSR is currently focused on the develop-

The main contribution of this paper is the introduction of a new optimised ta-

The paper is structured as follows. In section 2, we give an overview of the

## 2

Approaches to formalize the logical structure and layout of tables in electronic

Other work [20] aims at predicting a grid for each table and deciding which cells

Within the

Im2Seq approaches have shown to be well-suited for the TSR task and allow a

## 3

All known Im2Seq based models for TSR fundamentally work in similar ways.

ulary and can be interpreted as a table structure. For example, with the HTML

Fig. 2.

<!-- image -->

Obviously, HTML and other general-purpose markup languages were not de-

Additionally, it would be desirable if the representation would easily allow

In a valid HTML table, the token sequence must describe a 2D grid of table

generation. Implicitly, this also means that Im2Seq models need to learn these

In practice, we observe two major issues with prediction quality when train-

## 4

To mitigate the issues with HTML in Im2Seq-based TSR models laid out before,

## 4.1

In Figure 3, we illustrate how the OTSL is defined. In essence, the OTSL defines

The OTSL vocabulary is comprised of the following tokens:

- -
- -
- -
- -
- -

A notable attribute of OTSL is that it has the capability of achieving lossless

Fig. 3.

<!-- image -->

## 4.2

The OTSL representation follows these syntax rules:

- 1.
- 2.

## 3.

- 4.
- 5.
- 6.
- The left neighbour of an "X" cell must be either another "X" cell or a "U"

The application of these rules gives OTSL a set of unique properties. First

These characteristics can be easily learned by sequence generator networks,

reduces significantly the column drift seen in the HTML based models (see Fig-

## 4.3

The design of OTSL allows to validate a table structure easily on an unfinished

## 5

To evaluate the impact of OTSL on prediction accuracy and inference times, we

Fig. 4.

<!-- image -->

We rely on standard metrics such as Tree Edit Distance score (TEDs) for

order to compute the TED score. Inference timing results for all experiments

## 5.1

We have chosen the PubTabNet data set to perform HPO, since it includes a

Table

## 5.2

We picked the model parameter configuration that produced the best prediction

Additionally,

Table 2.

## 5.3

To illustrate

Fig. 5.

<!-- image -->

Fig. 6.

<!-- image -->

## 6

We demonstrated that representing tables in HTML for the task of table struc-

First and foremost, given the same network configuration, inference time for

Secondly, OTSL has more inherent structure and a significantly restricted vo-

## References

- 1.
- 2.
- 3.
- 4.

- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.

- 18.
- 19.
- 20.
- 21.
- 22.
- 23.