feat: Use new TableFormer model weights and default to accurate model version (#1100)

* feat: New tableformer model weights [WIP] Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> * Updated TF version Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Updated tests, after merging with Main, Switched to Accurate TF model by default Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2025-12-08 20:58:11 +00:00 · 2025-03-11 10:53:49 +01:00
parent 5e30381c0d
commit eb97357b05
43 changed files with 213 additions and 229 deletions
--- a/tests/data/groundtruth/docling_v2/2305.03393v1.md
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1.md
@@ -126,14 +126,13 @@ We have chosen the PubTabNet data set to perform HPO, since it includes a highly

 Table 1. HPO performed in OTSL and HTML representation on the same transformer-based TableFormer [9] architecture, trained only on PubTabNet [22]. Effects of reducing the # of layers in encoder and decoder stages of the model show that smaller models trained on OTSL perform better, especially in recognizing complex table structures, and maintain a much higher mAP score than the HTML counterpart.

-| #          | #          | Language   | TEDs        | TEDs        | TEDs        | mAP         | Inference   |
-|------------|------------|------------|-------------|-------------|-------------|-------------|-------------|
-| enc-layers | dec-layers | Language   | simple      | complex     | all         | (0.75)      | time (secs) |
-| 6          | 6          | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39   |
-| 4          | 4          | OTSL HTML  | 0.938 0.952 | 0.904       | 0.927       | 0.853       | 1.97        |
-| 2          | 4          | OTSL       | 0.923 0.945 | 0.909 0.897 | 0.938       | 0.843       | 3.77        |
-|            |            | HTML       |             | 0.901       | 0.915 0.931 | 0.859 0.834 | 1.91 3.81   |
-| 4          | 2          | OTSL HTML  | 0.952 0.944 | 0.92 0.903  | 0.942 0.931 | 0.857 0.824 | 1.22 2      |
+| # enc-layers   | # dec-layers   | Language   | TEDs        | TEDs        | TEDs        | mAP         | Inference   |
+|----------------|----------------|------------|-------------|-------------|-------------|-------------|-------------|
+| # enc-layers   | # dec-layers   | Language   | simple      | complex     | all         | (0.75)      | time (secs) |
+| 6              | 6              | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39   |
+| 4              | 4              | OTSL HTML  | 0.938 0.952 | 0.904 0.909 | 0.927 0.938 | 0.853 0.843 | 1.97 3.77   |
+| 2              | 4              | OTSL HTML  | 0.923 0.945 | 0.897 0.901 | 0.915 0.931 | 0.859 0.834 | 1.91 3.81   |
+| 4              | 2              | OTSL HTML  | 0.952 0.944 | 0.92 0.903  | 0.942 0.931 | 0.857 0.824 | 1.22 2      |

 ## 5.2 Quantitative Results

@@ -143,15 +142,12 @@ Additionally, the results show that OTSL has an advantage over HTML when applied

 Table 2. TSR and cell detection results compared between OTSL and HTML on the PubTabNet [22], FinTabNet [21] and PubTables-1M [14] data sets using TableFormer [9] (with enc=6, dec=6, heads=8).

-|              | Language   | TEDs   | TEDs    | TEDs   | mAP(0.75)   | Inference time (secs)   |
-|--------------|------------|--------|---------|--------|-------------|-------------------------|
-|              | Language   | simple | complex | all    | mAP(0.75)   | Inference time (secs)   |
-| PubTabNet    | OTSL       | 0.965  | 0.934   | 0.955  | 0.88        | 2.73                    |
-| PubTabNet    | HTML       | 0.969  | 0.927   | 0.955  | 0.857       | 5.39                    |
-| FinTabNet    | OTSL       | 0.955  | 0.961   | 0.959  | 0.862       | 1.85                    |
-| FinTabNet    | HTML       | 0.917  | 0.922   | 0.92   | 0.722       | 3.26                    |
-| PubTables-1M | OTSL       | 0.987  | 0.964   | 0.977  | 0.896       | 1.79                    |
-| PubTables-1M | HTML       | 0.983  | 0.944   | 0.966  | 0.889       | 3.26                    |
+| Data set     | Language   | TEDs        | TEDs        | TEDs        | mAP(0.75)   | Inference time (secs)   |
+|--------------|------------|-------------|-------------|-------------|-------------|-------------------------|
+| Data set     | Language   | simple      | complex     | all         | mAP(0.75)   | Inference time (secs)   |
+| PubTabNet    | OTSL HTML  | 0.965 0.969 | 0.934 0.927 | 0.955 0.955 | 0.88 0.857  | 2.73 5.39               |
+| FinTabNet    | OTSL HTML  | 0.955 0.917 | 0.961 0.922 | 0.959 0.92  | 0.862 0.722 | 1.85 3.26               |
+| PubTables-1M | OTSL HTML  | 0.987 0.983 | 0.964 0.944 | 0.977 0.966 | 0.896 0.889 | 1.79 3.26               |

 ## 5.3 Qualitative Results