mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-28 21:14:23 +00:00
201 lines
3.1 KiB
Markdown
201 lines
3.1 KiB
Markdown
## Optimized Table Tokenization for Table Structure
|
|
|
|
Maksym Lysak
|
|
|
|
and Peter Staar
|
|
|
|
IBM Research
|
|
|
|
{mly,ahn,nli,cau,taa}@zurich.ibm.com
|
|
|
|
Abstract.
|
|
|
|
Keywords:
|
|
|
|
## 1
|
|
|
|
Tables are ubiquitous in documents such as scientific papers, patents, reports,
|
|
|
|
In modern document understanding systems [1,15], table extraction is typi-
|
|
|
|
Fig. 1.
|
|
<!-- image -->
|
|
|
|
today,
|
|
|
|
Recently
|
|
|
|
While the majority of research in TSR is currently focused on the develop-
|
|
|
|
The main contribution of this paper is the introduction of a new optimised ta-
|
|
|
|
The paper is structured as follows. In section 2, we give an overview of the
|
|
|
|
## 2
|
|
|
|
Approaches to formalize the logical structure and layout of tables in electronic
|
|
|
|
Other work [20] aims at predicting a grid for each table and deciding which cells
|
|
|
|
Within the
|
|
|
|
Im2Seq approaches have shown to be well-suited for the TSR task and allow a
|
|
|
|
## 3
|
|
|
|
All known Im2Seq based models for TSR fundamentally work in similar ways.
|
|
|
|
ulary and can be interpreted as a table structure. For example, with the HTML
|
|
|
|
Fig. 2.
|
|
<!-- image -->
|
|
|
|
Obviously, HTML and other general-purpose markup languages were not de-
|
|
|
|
Additionally, it would be desirable if the representation would easily allow
|
|
|
|
In a valid HTML table, the token sequence must describe a 2D grid of table
|
|
|
|
generation. Implicitly, this also means that Im2Seq models need to learn these
|
|
|
|
In practice, we observe two major issues with prediction quality when train-
|
|
|
|
## 4
|
|
|
|
To mitigate the issues with HTML in Im2Seq-based TSR models laid out before,
|
|
|
|
## 4.1
|
|
|
|
In Figure 3, we illustrate how the OTSL is defined. In essence, the OTSL defines
|
|
|
|
The OTSL vocabulary is comprised of the following tokens:
|
|
|
|
- -
|
|
|
|
A notable attribute of OTSL is that it has the capability of achieving lossless
|
|
|
|
Fig. 3.
|
|
<!-- image -->
|
|
|
|
## 4.2
|
|
|
|
The OTSL representation follows these syntax rules:
|
|
|
|
- 1.
|
|
|
|
- 2.
|
|
|
|
## 3.
|
|
|
|
- 4.
|
|
|
|
- 5.
|
|
|
|
- 6.
|
|
|
|
- The left neighbour of an "X" cell must be either another "X" cell or a "U"
|
|
|
|
The application of these rules gives OTSL a set of unique properties. First
|
|
|
|
These characteristics can be easily learned by sequence generator networks,
|
|
|
|
reduces significantly the column drift seen in the HTML based models (see Fig-
|
|
|
|
## 4.3
|
|
|
|
The design of OTSL allows to validate a table structure easily on an unfinished
|
|
|
|
## 5
|
|
|
|
To evaluate the impact of OTSL on prediction accuracy and inference times, we
|
|
|
|
Fig. 4.
|
|
<!-- image -->
|
|
|
|
We rely on standard metrics such as Tree Edit Distance score (TEDs) for
|
|
|
|
order to compute the TED score. Inference timing results for all experiments
|
|
|
|
## 5.1
|
|
|
|
We have chosen the PubTabNet data set to perform HPO, since it includes a
|
|
|
|
Table
|
|
|
|
|
|
|
|
## 5.2
|
|
|
|
We picked the model parameter configuration that produced the best prediction
|
|
|
|
Additionally,
|
|
|
|
Table 2.
|
|
|
|
|
|
|
|
## 5.3
|
|
|
|
To illustrate
|
|
|
|
Fig. 5.
|
|
<!-- image -->
|
|
|
|
Fig. 6.
|
|
<!-- image -->
|
|
|
|
## 6
|
|
|
|
We demonstrated that representing tables in HTML for the task of table struc-
|
|
|
|
First and foremost, given the same network configuration, inference time for
|
|
|
|
Secondly, OTSL has more inherent structure and a significantly restricted vo-
|
|
|
|
## References
|
|
|
|
- 1.
|
|
|
|
- 2.
|
|
|
|
- 3.
|
|
|
|
- 4.
|
|
|
|
- 5.
|
|
|
|
- 6.
|
|
|
|
- 7.
|
|
|
|
- 8.
|
|
|
|
- 9.
|
|
|
|
- 10.
|
|
|
|
- 11.
|
|
|
|
- 12.
|
|
|
|
- 13.
|
|
|
|
- 14.
|
|
|
|
- 15.
|
|
|
|
- 16.
|
|
|
|
- 17.
|
|
|
|
- 18.
|
|
|
|
- 19.
|
|
|
|
- 20.
|
|
|
|
- 21.
|
|
|
|
- 22.
|
|
|
|
- 23. |