3.1 KiB
Optimized Table Tokenization for Table Structure
Maksym Lysak
and Peter Staar
IBM Research
{mly,ahn,nli,cau,taa}@zurich.ibm.com
Abstract.
Keywords:
1
Tables are ubiquitous in documents such as scientific papers, patents, reports,
In modern document understanding systems [1,15], table extraction is typi-
Fig. 1.
today,
Recently
While the majority of research in TSR is currently focused on the develop-
The main contribution of this paper is the introduction of a new optimised ta-
The paper is structured as follows. In section 2, we give an overview of the
2
Approaches to formalize the logical structure and layout of tables in electronic
Other work [20] aims at predicting a grid for each table and deciding which cells
Within the
Im2Seq approaches have shown to be well-suited for the TSR task and allow a
3
All known Im2Seq based models for TSR fundamentally work in similar ways.
ulary and can be interpreted as a table structure. For example, with the HTML
Fig. 2.
Obviously, HTML and other general-purpose markup languages were not de-
Additionally, it would be desirable if the representation would easily allow
In a valid HTML table, the token sequence must describe a 2D grid of table
generation. Implicitly, this also means that Im2Seq models need to learn these
In practice, we observe two major issues with prediction quality when train-
4
To mitigate the issues with HTML in Im2Seq-based TSR models laid out before,
4.1
In Figure 3, we illustrate how the OTSL is defined. In essence, the OTSL defines
The OTSL vocabulary is comprised of the following tokens:
A notable attribute of OTSL is that it has the capability of achieving lossless
Fig. 3.
4.2
The OTSL representation follows these syntax rules:
3.
-
-
-
-
The left neighbour of an "X" cell must be either another "X" cell or a "U"
The application of these rules gives OTSL a set of unique properties. First
These characteristics can be easily learned by sequence generator networks,
reduces significantly the column drift seen in the HTML based models (see Fig-
4.3
The design of OTSL allows to validate a table structure easily on an unfinished
5
To evaluate the impact of OTSL on prediction accuracy and inference times, we
Fig. 4.
We rely on standard metrics such as Tree Edit Distance score (TEDs) for
order to compute the TED score. Inference timing results for all experiments
5.1
We have chosen the PubTabNet data set to perform HPO, since it includes a
Table
5.2
We picked the model parameter configuration that produced the best prediction
Additionally,
Table 2.
5.3
To illustrate
Fig. 5.
Fig. 6.
6
We demonstrated that representing tables in HTML for the task of table struc-
First and foremost, given the same network configuration, inference time for
Secondly, OTSL has more inherent structure and a significantly restricted vo-