docling/tests/data/groundtruth/docling_v1/2305.03393v1.md
Christoph Auer e00f362405
Some checks failed
Run Docs CI / build-docs (push) Failing after 1m26s
Run CI / code-checks (push) Failing after 6m37s
Update tests, use TextCell.from_ocr property
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-13 16:04:08 +01:00

201 lines
3.1 KiB
Markdown

## Optimized Table Tokenization for Table Structure
Maksym Lysak
and Peter Staar
IBM Research
{mly,ahn,nli,cau,taa}@zurich.ibm.com
Abstract.
Keywords:
## 1
Tables are ubiquitous in documents such as scientific papers, patents, reports,
In modern document understanding systems [1,15], table extraction is typi-
Fig. 1.
<!-- image -->
today,
Recently
While the majority of research in TSR is currently focused on the develop-
The main contribution of this paper is the introduction of a new optimised ta-
The paper is structured as follows. In section 2, we give an overview of the
## 2
Approaches to formalize the logical structure and layout of tables in electronic
Other work [20] aims at predicting a grid for each table and deciding which cells
Within the
Im2Seq approaches have shown to be well-suited for the TSR task and allow a
## 3
All known Im2Seq based models for TSR fundamentally work in similar ways.
ulary and can be interpreted as a table structure. For example, with the HTML
Fig. 2.
<!-- image -->
Obviously, HTML and other general-purpose markup languages were not de-
Additionally, it would be desirable if the representation would easily allow
In a valid HTML table, the token sequence must describe a 2D grid of table
generation. Implicitly, this also means that Im2Seq models need to learn these
In practice, we observe two major issues with prediction quality when train-
## 4
To mitigate the issues with HTML in Im2Seq-based TSR models laid out before,
## 4.1
In Figure 3, we illustrate how the OTSL is defined. In essence, the OTSL defines
The OTSL vocabulary is comprised of the following tokens:
- -
A notable attribute of OTSL is that it has the capability of achieving lossless
Fig. 3.
<!-- image -->
## 4.2
The OTSL representation follows these syntax rules:
- 1.
- 2.
## 3.
- 4.
- 5.
- 6.
- The left neighbour of an "X" cell must be either another "X" cell or a "U"
The application of these rules gives OTSL a set of unique properties. First
These characteristics can be easily learned by sequence generator networks,
reduces significantly the column drift seen in the HTML based models (see Fig-
## 4.3
The design of OTSL allows to validate a table structure easily on an unfinished
## 5
To evaluate the impact of OTSL on prediction accuracy and inference times, we
Fig. 4.
<!-- image -->
We rely on standard metrics such as Tree Edit Distance score (TEDs) for
order to compute the TED score. Inference timing results for all experiments
## 5.1
We have chosen the PubTabNet data set to perform HPO, since it includes a
Table
## 5.2
We picked the model parameter configuration that produced the best prediction
Additionally,
Table 2.
## 5.3
To illustrate
Fig. 5.
<!-- image -->
Fig. 6.
<!-- image -->
## 6
We demonstrated that representing tables in HTML for the task of table struc-
First and foremost, given the same network configuration, inference time for
Secondly, OTSL has more inherent structure and a significantly restricted vo-
## References
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.