Updated tests

Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
2025-07-26 20:14:47 +00:00 · 2024-10-03 16:36:12 +02:00 · 2024-10-03 16:36:12 +02:00 · a614710aa3
commit a614710aa3
parent aba833ab56
21 changed files with 53 additions and 28 deletions
--- a/poetry.lock
+++ b/poetry.lock
@ -991,7 +991,17 @@ torchvision = [
    {version = ">=0,<1", markers = "sys_platform != \"darwin\" or platform_machine != \"x86_64\""},
    {version = ">=0.17.2,<0.18.0", markers = "sys_platform == \"darwin\" and platform_machine == \"x86_64\""},
 ]
+<<<<<<< HEAD
 tqdm = ">=4.64.0,<5.0.0"
+=======
+tqdm = "^4.64.0"
+
+[package.source]
+type = "git"
+url = "https://github.com/DS4SD/docling-ibm-models.git"
+reference = "e92c3cef733d138da4d9e57f55750143b68c0f02"
+resolved_reference = "e92c3cef733d138da4d9e57f55750143b68c0f02"
+>>>>>>> 44dcf83 (Updated tests)

 [[package]]
 name = "docling-parse"
--- a/tests/data/2203.01017v2.doctags.txt
+++ b/tests/data/2203.01017v2.doctags.txt
@ -1,4 +1,5 @@
 <document>
+<subtitle-level-1><location><page_1><loc_16><loc_85><loc_82><loc_87></location>TableFormer: Table Structure Understanding with Transformers.</subtitle-level-1>
 <paragraph><location><page_1><loc_23><loc_78><loc_74><loc_82></location>Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research</paragraph>
 <paragraph><location><page_1><loc_34><loc_77><loc_62><loc_78></location>{ ahn,nli,mly,taa } @zurich.ibm.com</paragraph>
 <subtitle-level-1><location><page_1><loc_24><loc_71><loc_31><loc_73></location>Abstract</subtitle-level-1>
--- a/tests/data/2203.01017v2.json
+++ b/tests/data/2203.01017v2.json
--- a/tests/data/2203.01017v2.md
+++ b/tests/data/2203.01017v2.md
@ -1,3 +1,5 @@
+## TableFormer: Table Structure Understanding with Transformers.
+
 Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research

 { ahn,nli,mly,taa } @zurich.ibm.com
--- a/tests/data/2203.01017v2.pages.json
+++ b/tests/data/2203.01017v2.pages.json
--- a/tests/data/2206.01062.doctags.txt
+++ b/tests/data/2206.01062.doctags.txt
@ -1,4 +1,5 @@
 <document>
+<subtitle-level-1><location><page_1><loc_17><loc_85><loc_83><loc_89></location>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</subtitle-level-1>
 <paragraph><location><page_1><loc_15><loc_77><loc_32><loc_83></location>Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com</paragraph>
 <paragraph><location><page_1><loc_42><loc_77><loc_58><loc_83></location>Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com</paragraph>
 <paragraph><location><page_1><loc_68><loc_77><loc_85><loc_83></location>Michele Dolfi IBM Research Rueschlikon, Switzerland dol@zurich.ibm.com</paragraph>
--- a/tests/data/2206.01062.json
+++ b/tests/data/2206.01062.json
--- a/tests/data/2206.01062.md
+++ b/tests/data/2206.01062.md
@ -1,3 +1,5 @@
+## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
+
 Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com

 Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com
--- a/tests/data/2206.01062.pages.json
+++ b/tests/data/2206.01062.pages.json
--- a/tests/data/2305.03393v1.doctags.txt
+++ b/tests/data/2305.03393v1.doctags.txt
@ -1,4 +1,5 @@
 <document>
+<subtitle-level-1><location><page_1><loc_22><loc_81><loc_79><loc_85></location>Optimized Table Tokenization for Table Structure Recognition</subtitle-level-1>
 <paragraph><location><page_1><loc_23><loc_74><loc_78><loc_79></location>Maksym Lysak [0000 - 0002 - 3723 - $^{6960]}$, Ahmed Nassar[0000 - 0002 - 9468 - $^{0822]}$, Nikolaos Livathinos [0000 - 0001 - 8513 - $^{3491]}$, Christoph Auer[0000 - 0001 - 5761 - $^{0422]}$, and Peter Staar [0000 - 0002 - 8088 - 0823]</paragraph>
 <paragraph><location><page_1><loc_36><loc_70><loc_64><loc_73></location>IBM Research {mly,ahn,nli,cau,taa}@zurich.ibm.com</paragraph>
 <paragraph><location><page_1><loc_27><loc_41><loc_74><loc_66></location>Abstract. Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs. Popular table structure data-sets will be published in OTSL format to the community.</paragraph>
--- a/tests/data/2305.03393v1.json
+++ b/tests/data/2305.03393v1.json
--- a/tests/data/2305.03393v1.md
+++ b/tests/data/2305.03393v1.md
@ -1,3 +1,5 @@
+## Optimized Table Tokenization for Table Structure Recognition
+
 Maksym Lysak [0000 - 0002 - 3723 - $^{6960]}$, Ahmed Nassar[0000 - 0002 - 9468 - $^{0822]}$, Nikolaos Livathinos [0000 - 0001 - 8513 - $^{3491]}$, Christoph Auer[0000 - 0001 - 5761 - $^{0422]}$, and Peter Staar [0000 - 0002 - 8088 - 0823]

 IBM Research {mly,ahn,nli,cau,taa}@zurich.ibm.com
--- a/tests/data/2305.03393v1.pages.json
+++ b/tests/data/2305.03393v1.pages.json
--- a/tests/data/redp5110.doctags.txt
+++ b/tests/data/redp5110.doctags.txt
@ -1,12 +1,13 @@
 <document>
-<paragraph><location><page_1><loc_6><loc_59><loc_35><loc_63></location>Implement roles and separation of duties</paragraph>
-<paragraph><location><page_1><loc_6><loc_52><loc_33><loc_56></location>Leverage row permissions on the database</paragraph>
-<paragraph><location><page_1><loc_6><loc_45><loc_32><loc_49></location>Protect columns by defining column masks</paragraph>
-<paragraph><location><page_1><loc_6><loc_3><loc_27><loc_5></location>ibm.com /redbooks</paragraph>
 <paragraph><location><page_1><loc_47><loc_94><loc_68><loc_96></location>Front cover</paragraph>
 <figure>
 <location><page_1><loc_84><loc_93><loc_96><loc_97></location>
 </figure>
+<subtitle-level-1><location><page_1><loc_6><loc_79><loc_96><loc_89></location>Row and Column Access Control Support in IBM DB2 for i</subtitle-level-1>
+<paragraph><location><page_1><loc_6><loc_59><loc_35><loc_63></location>Implement roles and separation of duties</paragraph>
+<paragraph><location><page_1><loc_6><loc_52><loc_33><loc_56></location>Leverage row permissions on the database</paragraph>
+<paragraph><location><page_1><loc_6><loc_45><loc_32><loc_49></location>Protect columns by defining column masks</paragraph>
+<paragraph><location><page_1><loc_6><loc_3><loc_27><loc_5></location>ibm.com /redbooks</paragraph>
 <paragraph><location><page_1><loc_81><loc_12><loc_95><loc_27></location>Jim Bainbridge Hernando Bedoya Rob Bestgen Mike Cain Dan Cruikshank Jim Denton Doug Mack Tom McKinley Kent Milligan</paragraph>
 <figure>
 <location><page_1><loc_51><loc_2><loc_95><loc_10></location>
--- a/tests/data/redp5110.json
+++ b/tests/data/redp5110.json
--- a/tests/data/redp5110.md
+++ b/tests/data/redp5110.md
@ -1,3 +1,10 @@
+Front cover
+
+
+<!-- image -->
+
+## Row and Column Access Control Support in IBM DB2 for i
+
 Implement roles and separation of duties

 Leverage row permissions on the database
@ -6,11 +13,6 @@ Protect columns by defining column masks

 ibm.com /redbooks

-Front cover
-
-
-<!-- image -->
-
 Jim Bainbridge Hernando Bedoya Rob Bestgen Mike Cain Dan Cruikshank Jim Denton Doug Mack Tom McKinley Kent Milligan


--- a/tests/data/redp5110.pages.json
+++ b/tests/data/redp5110.pages.json
--- a/tests/data/redp5695.doctags.txt
+++ b/tests/data/redp5695.doctags.txt
@ -1,4 +1,9 @@
 <document>
+<paragraph><location><page_1><loc_47><loc_96><loc_68><loc_99></location>Front cover</paragraph>
+<figure>
+<location><page_1><loc_67><loc_90><loc_93><loc_96></location>
+</figure>
+<subtitle-level-1><location><page_1><loc_7><loc_75><loc_88><loc_86></location>IBM Cloud Pak for Data on IBM Z</subtitle-level-1>
 <paragraph><location><page_1><loc_7><loc_60><loc_20><loc_62></location>Jasmeet Bhatia</paragraph>
 <paragraph><location><page_1><loc_7><loc_57><loc_20><loc_59></location>Ravi Gummadi</paragraph>
 <paragraph><location><page_1><loc_7><loc_51><loc_21><loc_52></location>Srirama Sharma</paragraph>
@ -14,10 +19,6 @@
 <figure>
 <location><page_1><loc_7><loc_3><loc_21><loc_8></location>
 </figure>
-<paragraph><location><page_1><loc_47><loc_96><loc_68><loc_99></location>Front cover</paragraph>
-<figure>
-<location><page_1><loc_67><loc_90><loc_93><loc_96></location>
-</figure>
 <figure>
 <location><page_1><loc_24><loc_13><loc_99><loc_62></location>
 </figure>
--- a/tests/data/redp5695.json
+++ b/tests/data/redp5695.json
--- a/tests/data/redp5695.md
+++ b/tests/data/redp5695.md
@ -1,3 +1,10 @@
+Front cover
+
+
+<!-- image -->
+
+## IBM Cloud Pak for Data on IBM Z
+
 Jasmeet Bhatia

 Ravi Gummadi
@ -14,11 +21,6 @@ Srirama Sharma
 <!-- image -->


-<!-- image -->
-
-Front cover
-
-
 <!-- image -->


--- a/tests/data/redp5695.pages.json
+++ b/tests/data/redp5695.pages.json