docling/2305.03393v1.md at e00f36240521d175656b5a857836e54e794ca062

mirrors/docling

Fork 0

mirror of https://github.com/DS4SD/docling.git synced 2025-07-27 04:24:45 +00:00

Christoph Auer e00f362405

Run Docs CI / build-docs (push) Failing after 1m26s

Details

Run CI / code-checks (push) Failing after 6m37s

Details

Update tests, use TextCell.from_ocr property

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

2025-03-13 16:04:08 +01:00

3.1 KiB

Raw Blame History

Optimized Table Tokenization for Table Structure

Maksym Lysak

and Peter Staar

IBM Research

{mly,ahn,nli,cau,taa}@zurich.ibm.com

Abstract.

Keywords:

1

Tables are ubiquitous in documents such as scientific papers, patents, reports,

In modern document understanding systems [1,15], table extraction is typi-

Fig. 1.

today,

Recently

While the majority of research in TSR is currently focused on the develop-

The main contribution of this paper is the introduction of a new optimised ta-

The paper is structured as follows. In section 2, we give an overview of the

2

Approaches to formalize the logical structure and layout of tables in electronic

Other work [20] aims at predicting a grid for each table and deciding which cells

Within the

Im2Seq approaches have shown to be well-suited for the TSR task and allow a

3

All known Im2Seq based models for TSR fundamentally work in similar ways.

ulary and can be interpreted as a table structure. For example, with the HTML

Fig. 2.

Obviously, HTML and other general-purpose markup languages were not de-

Additionally, it would be desirable if the representation would easily allow

In a valid HTML table, the token sequence must describe a 2D grid of table

generation. Implicitly, this also means that Im2Seq models need to learn these

In practice, we observe two major issues with prediction quality when train-

4

To mitigate the issues with HTML in Im2Seq-based TSR models laid out before,

4.1

In Figure 3, we illustrate how the OTSL is defined. In essence, the OTSL defines

The OTSL vocabulary is comprised of the following tokens:

A notable attribute of OTSL is that it has the capability of achieving lossless

Fig. 3.

4.2

The OTSL representation follows these syntax rules:

3.

The left neighbour of an "X" cell must be either another "X" cell or a "U"

The application of these rules gives OTSL a set of unique properties. First

These characteristics can be easily learned by sequence generator networks,

reduces significantly the column drift seen in the HTML based models (see Fig-

4.3

The design of OTSL allows to validate a table structure easily on an unfinished

5

To evaluate the impact of OTSL on prediction accuracy and inference times, we

Fig. 4.

We rely on standard metrics such as Tree Edit Distance score (TEDs) for

order to compute the TED score. Inference timing results for all experiments

5.1

We have chosen the PubTabNet data set to perform HPO, since it includes a

Table

5.2

We picked the model parameter configuration that produced the best prediction

Additionally,

Table 2.

5.3

To illustrate

Fig. 5.

Fig. 6.

6

We demonstrated that representing tables in HTML for the task of table struc-

First and foremost, given the same network configuration, inference time for

Secondly, OTSL has more inherent structure and a significantly restricted vo-

3.1 KiB Raw Blame History

Optimized Table Tokenization for Table Structure

1

2

3

4

4.1

4.2

3.

4.3

5

5.1

5.2

5.3

6

References

3.1 KiB

Raw Blame History