mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 20:58:11 +00:00
feat: Expose equation exports (#869)
* pin new docling-core and exploit it via assembler changes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update test results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * update with docling-core release Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
This commit is contained in:
@@ -52,11 +52,11 @@ To meet the design criteria listed above, we developed a new model called TableF
|
||||
|
||||
The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe
|
||||
|
||||
its results & performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
|
||||
its results & performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
|
||||
|
||||
## 2. Previous work and State of the Art
|
||||
|
||||
Identifying the structure of a table has been an outstanding problem in the document-parsing community, that motivates many organised public challenges [6, 4, 14]. The difficulty of the problem can be attributed to a number of factors. First, there is a large variety in the shapes and sizes of tables. Such large variety requires a flexible method. This is especially true for complex column- and row headers, which can be extremely intricate and demanding. A second factor of complexity is the lack of data with regard to table-structure. Until the publication of PubTabNet [37], there were no large datasets (i.e. > 100 K tables) that provided structure information. This happens primarily due to the fact that tables are notoriously time-consuming to annotate by hand. However, this has definitely changed in recent years with the deliverance of PubTabNet [37], FinTabNet [36], TableBank [17] etc.
|
||||
Identifying the structure of a table has been an outstanding problem in the document-parsing community, that motivates many organised public challenges [6, 4, 14]. The difficulty of the problem can be attributed to a number of factors. First, there is a large variety in the shapes and sizes of tables. Such large variety requires a flexible method. This is especially true for complex column- and row headers, which can be extremely intricate and demanding. A second factor of complexity is the lack of data with regard to table-structure. Until the publication of PubTabNet [37], there were no large datasets (i.e. > 100 K tables) that provided structure information. This happens primarily due to the fact that tables are notoriously time-consuming to annotate by hand. However, this has definitely changed in recent years with the deliverance of PubTabNet [37], FinTabNet [36], TableBank [17] etc.
|
||||
|
||||
Before the rising popularity of deep neural networks, the community relied heavily on heuristic and/or statistical methods to do table structure identification [3, 7, 11, 5, 13, 28]. Although such methods work well on constrained tables [12], a more data-driven approach can be applied due to the advent of convolutional neural networks (CNNs) and the availability of large datasets. To the best-of-our knowledge, there are currently two different types of network architecture that are being pursued for state-of-the-art tablestructure identification.
|
||||
|
||||
@@ -115,7 +115,7 @@ Given the image of a table, TableFormer is able to predict: 1) a sequence of tok
|
||||
|
||||
## 4.1. Model architecture.
|
||||
|
||||
We now describe in detail the proposed method, which is composed of three main components, see Fig. 4. Our CNN Backbone Network encodes the input as a feature vector of predefined length. The input feature vector of the encoded image is passed to the Structure Decoder to produce a sequence of HTML tags that represent the structure of the table. With each prediction of an HTML standard data cell (' < td > ') the hidden state of that cell is passed to the Cell BBox Decoder. As for spanning cells, such as row or column span, the tag is broken down to ' < ', 'rowspan=' or 'colspan=', with the number of spanning cells (attribute), and ' > '. The hidden state attached to ' < ' is passed to the Cell BBox Decoder. A shared feed forward network (FFN) receives the hidden states from the Structure Decoder, to provide the final detection predictions of the bounding box coordinates and their classification.
|
||||
We now describe in detail the proposed method, which is composed of three main components, see Fig. 4. Our CNN Backbone Network encodes the input as a feature vector of predefined length. The input feature vector of the encoded image is passed to the Structure Decoder to produce a sequence of HTML tags that represent the structure of the table. With each prediction of an HTML standard data cell (' < td > ') the hidden state of that cell is passed to the Cell BBox Decoder. As for spanning cells, such as row or column span, the tag is broken down to ' < ', 'rowspan=' or 'colspan=', with the number of spanning cells (attribute), and ' > '. The hidden state attached to ' < ' is passed to the Cell BBox Decoder. A shared feed forward network (FFN) receives the hidden states from the Structure Decoder, to provide the final detection predictions of the bounding box coordinates and their classification.
|
||||
|
||||
CNN Backbone Network. A ResNet-18 CNN is the backbone that receives the table image and encodes it as a vector of predefined length. The network has been modified by removing the linear and pooling layer, as we are not per-
|
||||
|
||||
@@ -123,7 +123,7 @@ Figure 3: TableFormer takes in an image of the PDF and creates bounding box and
|
||||
|
||||
<!-- image -->
|
||||
|
||||
Figure 4: Given an input image of a table, the Encoder produces fixed-length features that represent the input image. The features are then passed to both the Structure Decoder and Cell BBox Decoder . During training, the Structure Decoder receives 'tokenized tags' of the HTML code that represent the table structure. Afterwards, a transformer encoder and decoder architecture is employed to produce features that are received by a linear layer, and the Cell BBox Decoder. The linear layer is applied to the features to predict the tags. Simultaneously, the Cell BBox Decoder selects features referring to the data cells (' < td > ', ' < ') and passes them through an attention network, an MLP, and a linear layer to predict the bounding boxes.
|
||||
Figure 4: Given an input image of a table, the Encoder produces fixed-length features that represent the input image. The features are then passed to both the Structure Decoder and Cell BBox Decoder . During training, the Structure Decoder receives 'tokenized tags' of the HTML code that represent the table structure. Afterwards, a transformer encoder and decoder architecture is employed to produce features that are received by a linear layer, and the Cell BBox Decoder. The linear layer is applied to the features to predict the tags. Simultaneously, the Cell BBox Decoder selects features referring to the data cells (' < td > ', ' < ') and passes them through an attention network, an MLP, and a linear layer to predict the bounding boxes.
|
||||
|
||||
<!-- image -->
|
||||
|
||||
@@ -133,7 +133,7 @@ Structure Decoder. The transformer architecture of this component is based on th
|
||||
|
||||
The transformer encoder receives an encoded image from the CNN Backbone Network and refines it through a multi-head dot-product attention layer, followed by a Feed Forward Network. During training, the transformer decoder receives as input the output feature produced by the transformer encoder, and the tokenized input of the HTML ground-truth tags. Using a stack of multi-head attention layers, different aspects of the tag sequence could be inferred. This is achieved by each attention head on a layer operating in a different subspace, and then combining altogether their attention score.
|
||||
|
||||
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTML structure tags become the object query.
|
||||
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTML structure tags become the object query.
|
||||
|
||||
The encoding generated by the CNN Backbone Network along with the features acquired for every data cell from the Transformer Decoder are then passed to the attention network. The attention network takes both inputs and learns to provide an attention weighted encoding. This weighted at-
|
||||
|
||||
@@ -145,9 +145,9 @@ Loss Functions. We formulate a multi-task loss Eq. 2 to train our network. The C
|
||||
|
||||
The loss used to train the TableFormer can be defined as following:
|
||||
|
||||
$$l$\_{box}$ = λ$\_{iou}$l$\_{iou}$ + λ$\_{l}$$_{1}$ l = λl$_{s}$ + (1 - λ ) l$_{box}$ (1)$$
|
||||
<!-- formula-not-decoded -->
|
||||
|
||||
where λ ∈ [0, 1], and λ$\_{iou}$, λ$\_{l}$$_{1}$ ∈$_{R}$ are hyper-parameters.
|
||||
where λ ∈ [0, 1], and λ$_{iou}$, λ$_{l}$$\_{1}$ ∈$\_{R}$ are hyper-parameters.
|
||||
|
||||
## 5. Experimental Results
|
||||
|
||||
@@ -155,7 +155,7 @@ where λ ∈ [0, 1], and λ$\_{iou}$, λ$\_{l}$$_{1}$ ∈$_{R}$ are hyper-parame
|
||||
|
||||
TableFormer uses ResNet-18 as the CNN Backbone Network . The input images are resized to 448*448 pixels and the feature map has a dimension of 28*28. Additionally, we enforce the following input constraints:
|
||||
|
||||
$$Image width and height ≤ 1024 pixels Structural tags length ≤ 512 tokens. (2)$$
|
||||
<!-- formula-not-decoded -->
|
||||
|
||||
Although input constraints are used also by other methods, such as EDD, ours are less restrictive due to the improved
|
||||
|
||||
@@ -177,7 +177,7 @@ We also share our baseline results on the challenging SynthTabNet dataset. Throu
|
||||
|
||||
The Tree-Edit-Distance-Based Similarity (TEDS) metric was introduced in [37]. It represents the prediction, and ground-truth as a tree structure of HTML tags. This similarity is calculated as:
|
||||
|
||||
$$TEDS ( T$\_{a}$, T$\_{b}$ ) = 1 - EditDist ( T$\_{a}$, T$\_{b}$ ) max ( | T$\_{a}$ | , | T$\_{b}$ | ) (3)$$
|
||||
<!-- formula-not-decoded -->
|
||||
|
||||
where T$_{a}$ and T$_{b}$ represent tables in tree structure HTML format. EditDist denotes the tree-edit distance, and | T | represents the number of nodes in T .
|
||||
|
||||
@@ -277,7 +277,7 @@ Figure 6: An example of TableFormer predictions (bounding boxes and structure) f
|
||||
|
||||
We showcase several visualizations for the different components of our network on various "complex" tables within datasets presented in this work in Fig. 5 and Fig. 6 As it is shown, our model is able to predict bounding boxes for all table cells, even for the empty ones. Additionally, our post-processing techniques can extract the cell content by matching the predicted bounding boxes to the PDF cells based on their overlap and spatial proximity. The left part of Fig. 5 demonstrates also the adaptability of our method to any language, as it can successfully extract Japanese text, although the training set contains only English content. We provide more visualizations including the intermediate steps in the supplementary material. Overall these illustrations justify the versatility of our method across a diverse range of table appearances and content type.
|
||||
|
||||
## 6. Future Work & Conclusion
|
||||
## 6. Future Work & Conclusion
|
||||
|
||||
In this paper, we presented TableFormer an end-to-end transformer based approach to predict table structures and bounding boxes of cells from an image. This approach enables us to recreate the table structure, and extract the cell content from PDF or OCR by using bounding boxes. Additionally, it provides the versatility required in real-world scenarios when dealing with various types of PDF documents, and languages. Furthermore, our method outperforms all state-of-the-arts with a wide margin. Finally, we introduce "SynthTabNet" a challenging synthetically generated dataset that reinforces missing characteristics from other datasets.
|
||||
|
||||
@@ -377,7 +377,7 @@ Here is a step-by-step description of the prediction postprocessing:
|
||||
- 3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.
|
||||
- 4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:
|
||||
|
||||
$$alignment = arg min c { D$\_{c}$ } D$\_{c}$ = max { x$\_{c}$ } - min { x$\_{c}$ } (4)$$
|
||||
<!-- formula-not-decoded -->
|
||||
|
||||
where c is one of { left, centroid, right } and x$_{c}$ is the xcoordinate for the corresponding point.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user