fix(tests): Refactor the data_scanned with a very simple document that allows all OCR engines to produce the same result.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2025-07-27 04:24:45 +00:00 · 2024-10-07 18:15:36 +02:00 · 2024-10-07 18:15:36 +02:00 · be6489bde0
commit be6489bde0
parent 7532ede7f4
36 changed files with 39 additions and 435 deletions
--- a/tests/data_scanned/ocr_test.doctags.txt
+++ b/tests/data_scanned/ocr_test.doctags.txt
@ -0,0 +1,5 @@
+<document>
+<subtitle-level-1><location><page_1><loc_12><loc_89><loc_21><loc_91></location>Docling</subtitle-level-1>
+<paragraph><location><page_1><loc_12><loc_84><loc_84><loc_87></location>Docling bundles PDF document conversion to JSON and Markdown in an easy, selfcontained package.</paragraph>
+<paragraph><location><page_1><loc_12><loc_58><loc_87><loc_80></location>Features Converts any PDF document to JSON or Markdown format, stable and lightning fast. Understands detailed page layout, reading order and recovers table structures. Extracts metadata from the document, such as title, authors, references and language. Includes OCR support for scanned PDFs. Integrates easily with LLM app / RAG frameworks like LlamaIndex and LangChain Provides a simple and convenient CLI.</paragraph>
+</document>
--- a/tests/data_scanned/ocr_test.json
+++ b/tests/data_scanned/ocr_test.json
@ -0,0 +1 @@
+{"_name": "", "type": "pdf-document", "description": {"logs": []}, "file-info": {"filename": "ocr_test.pdf", "document-hash": "1e6966b64695f3e77f2931dfd42c79050f4a47cd9c53eb32dc061c98a3129b05", "#-pages": 1, "page-hashes": [{"hash": "5b246e5b7c627e174ffcbbe2a41131c2f19e4c2b02314f6bc9ca65c11f9b8d76", "model": "default", "page": 1}]}, "main-text": [{"prov": [{"bbox": [71.608642578125, 750.5054931640625, 127.90485382080078, 770.1392211914062], "page": 1, "span": [0, 7]}], "text": "Docling", "type": "subtitle-level-1", "name": "Section-header"}, {"prov": [{"bbox": [71.54174041748047, 703.8960571289062, 498.7333068847656, 733.1880493164062], "page": 1, "span": [0, 95]}], "text": "Docling bundles PDF document conversion to JSON and Markdown in an easy, selfcontained package.", "type": "paragraph", "name": "Text"}, {"prov": [{"bbox": [71.21173858642578, 484.2960510253906, 519.8010864257812, 674.6280517578125], "page": 1, "span": [0, 409]}], "text": "Features Converts any PDF document to JSON or Markdown format, stable and lightning fast. Understands detailed page layout, reading order and recovers table structures. Extracts metadata from the document, such as title, authors, references and language. Includes OCR support for scanned PDFs. Integrates easily with LLM app / RAG frameworks like LlamaIndex and LangChain Provides a simple and convenient CLI.", "type": "paragraph", "name": "Text"}], "figures": [], "tables": [], "equations": [], "footnotes": [], "page-dimensions": [{"height": 841.9200439453125, "page": 1, "width": 595.2000122070312}], "page-footers": [], "page-headers": []}
--- a/tests/data_scanned/ocr_test.md
+++ b/tests/data_scanned/ocr_test.md
@ -0,0 +1,5 @@
+## Docling
+
+Docling bundles PDF document conversion to JSON and Markdown in an easy, selfcontained package.
+
+Features Converts any PDF document to JSON or Markdown format, stable and lightning fast. Understands detailed page layout, reading order and recovers table structures. Extracts metadata from the document, such as title, authors, references and language. Includes OCR support for scanned PDFs. Integrates easily with LLM app / RAG frameworks like LlamaIndex and LangChain Provides a simple and convenient CLI.
--- a/tests/data_scanned/ocr_test.pages.json
+++ b/tests/data_scanned/ocr_test.pages.json
--- a/tests/data_scanned/ocr_test.pdf
+++ b/tests/data_scanned/ocr_test.pdf
--- a/tests/data_scanned/ocr_test.png
+++ b/tests/data_scanned/ocr_test.png
--- a/tests/data_scanned/scanned_01.easyocr.doctags.txt
+++ b/tests/data_scanned/scanned_01.easyocr.doctags.txt
@ -1,20 +0,0 @@
-<document>
-<subtitle-level-1><location><page_1><loc_21><loc_83><loc_76><loc_87></location>TableFormer: Table Structure Understanding with Transformers</subtitle-level-1>
-<subtitle-level-1><location><page_1><loc_8><loc_78><loc_29><loc_80></location>1. Details on the datasets</subtitle-level-1>
-<subtitle-level-1><location><page_1><loc_8><loc_76><loc_25><loc_78></location>1.1. Data preparation</subtitle-level-1>
-<paragraph><location><page_1><loc_8><loc_51><loc_47><loc_75></location>As a first step of our data preparation process; we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes) A table is considered to be simple if it does not contain row spans or column spans. Addition ally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row Or column spans. Therefore a strict HTML structure looks always rectangular:   However; HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity; which we want lo avoid. As such, we prefer to have strict" tables, i.e. tables where every row has exactly the same length.</paragraph>
-<paragraph><location><page_1><loc_8><loc_20><loc_47><loc_51></location>We have developed technique that tries to derive missing bounding box out of its neighbors.   As a first step; we use the annotation data to generate the most fine'grained that covers the table structure. In case of strict HTML tables. all squares are associated with some table cell and in the presence of table spans a cell extends across mul tiple grid squares.  When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally; the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML ta bles is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 489 of the simple and 699 of the complex tables. RegardFinTabNet, 689 of the simple and 98% of the complex tables require the generation of bounding boxes grid grid ing</paragraph>
-<paragraph><location><page_1><loc_8><loc_18><loc_47><loc_21></location>Figure 7 illustrates the distribution of the tables across different dimensions per dataset.</paragraph>
-<subtitle-level-1><location><page_1><loc_8><loc_15><loc_25><loc_17></location>1.2. Synthetic datasets</subtitle-level-1>
-<paragraph><location><page_1><loc_8><loc_10><loc_47><loc_14></location>Aiming t0 train and evaluate our models in a broader spectrum of table data we have synthesized four types of datasets_ Each one contains tables with different appear -</paragraph>
-<subtitle-level-1><location><page_1><loc_36><loc_82><loc_62><loc_85></location>Supplementary Material</subtitle-level-1>
-<paragraph><location><page_1><loc_50><loc_74><loc_89><loc_80></location>ances in regard to their size; structure, and content. synthetic dataset contains 150k examples, summing up to 60Ok synthetic examples. All datasets are divided into Train; Test and Val splits (8O%, 1O%o , 109) . style Every</paragraph>
-<paragraph><location><page_1><loc_50><loc_70><loc_89><loc_74></location>The process of generating a synthetic dataset can be decomposed into the following steps:</paragraph>
-<paragraph><location><page_1><loc_50><loc_60><loc_89><loc_71></location>1 Prepare styling and content templates: The styling templates have been manually designed and organized into groups of scope specific appearances financial  data, marketing data; etc.) Additionally; we have prepared curated collections of content templates by extracting the most frequently  used terms out of non-synthetic datasets PubTabNet, FinTabNet, etc.). (e.g (e.g</paragraph>
-<paragraph><location><page_1><loc_50><loc_43><loc_89><loc_60></location>2 Generate table structures: The structure of each synthetic dataset assumes a horizontal table header which potentially spans ovCr multiple rows and table body that may contain a combination of row spans and column spans. However, spans are not allowed to cross the header body boundary. The table structure is described by the parameters: Total number of table rows and columns, number of header rows, type of spans (header only spans, row only spans, column only spans, both row and column spans) maximum span size and the ratio of the table area covered by spans</paragraph>
-<paragraph><location><page_1><loc_50><loc_37><loc_89><loc_43></location>3 Generate content: Based on the dataset theme. a set of suitable content templates is chosen first. Then; this content can be combined with purely random text to produce the synthetic content.</paragraph>
-<paragraph><location><page_1><loc_50><loc_31><loc_89><loc_37></location>4 Apply styling templates:   Depending on the domain of the synthetic dataset; a set of styling templates is first manually  selected Then, style is randomly selected to format the appearance of the synthesized table.</paragraph>
-<paragraph><location><page_1><loc_50><loc_23><loc_89><loc_31></location>5 Render the complete tables: The synthetic table is finally rendered by a web browser engine to generate the bounding boxes for each table cell. A batching technique is utilized to optimize the runtime overhead of the rendering process.</paragraph>
-<subtitle-level-1><location><page_1><loc_50><loc_18><loc_89><loc_22></location>2. Prediction  post-processing for PDF documents</subtitle-level-1>
-<paragraph><location><page_1><loc_50><loc_9><loc_89><loc_17></location>Although TableFormer can predict the table structure and the bounding boxes for tables recognized inside PDF docu ments, this is not enough when a full reconstruction of the original table is required. This happens mainly due the folrcasons: lowing7</paragraph>
-</document>
--- a/tests/data_scanned/scanned_01.easyocr.json
+++ b/tests/data_scanned/scanned_01.easyocr.json
--- a/tests/data_scanned/scanned_01.easyocr.md
+++ b/tests/data_scanned/scanned_01.easyocr.md
@ -1,35 +0,0 @@
-## TableFormer: Table Structure Understanding with Transformers
-
-## 1. Details on the datasets
-
-## 1.1. Data preparation
-
-As a first step of our data preparation process; we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes) A table is considered to be simple if it does not contain row spans or column spans. Addition ally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row Or column spans. Therefore a strict HTML structure looks always rectangular:   However; HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity; which we want lo avoid. As such, we prefer to have strict" tables, i.e. tables where every row has exactly the same length.
-
-We have developed technique that tries to derive missing bounding box out of its neighbors.   As a first step; we use the annotation data to generate the most fine'grained that covers the table structure. In case of strict HTML tables. all squares are associated with some table cell and in the presence of table spans a cell extends across mul tiple grid squares.  When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally; the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML ta bles is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 489 of the simple and 699 of the complex tables. RegardFinTabNet, 689 of the simple and 98% of the complex tables require the generation of bounding boxes grid grid ing
-
-Figure 7 illustrates the distribution of the tables across different dimensions per dataset.
-
-## 1.2. Synthetic datasets
-
-Aiming t0 train and evaluate our models in a broader spectrum of table data we have synthesized four types of datasets_ Each one contains tables with different appear -
-
-## Supplementary Material
-
-ances in regard to their size; structure, and content. synthetic dataset contains 150k examples, summing up to 60Ok synthetic examples. All datasets are divided into Train; Test and Val splits (8O%, 1O%o , 109) . style Every
-
-The process of generating a synthetic dataset can be decomposed into the following steps:
-
-1 Prepare styling and content templates: The styling templates have been manually designed and organized into groups of scope specific appearances financial  data, marketing data; etc.) Additionally; we have prepared curated collections of content templates by extracting the most frequently  used terms out of non-synthetic datasets PubTabNet, FinTabNet, etc.). (e.g (e.g
-
-2 Generate table structures: The structure of each synthetic dataset assumes a horizontal table header which potentially spans ovCr multiple rows and table body that may contain a combination of row spans and column spans. However, spans are not allowed to cross the header body boundary. The table structure is described by the parameters: Total number of table rows and columns, number of header rows, type of spans (header only spans, row only spans, column only spans, both row and column spans) maximum span size and the ratio of the table area covered by spans
-
-3 Generate content: Based on the dataset theme. a set of suitable content templates is chosen first. Then; this content can be combined with purely random text to produce the synthetic content.
-
-4 Apply styling templates:   Depending on the domain of the synthetic dataset; a set of styling templates is first manually  selected Then, style is randomly selected to format the appearance of the synthesized table.
-
-5 Render the complete tables: The synthetic table is finally rendered by a web browser engine to generate the bounding boxes for each table cell. A batching technique is utilized to optimize the runtime overhead of the rendering process.
-
-## 2. Prediction  post-processing for PDF documents
-
-Although TableFormer can predict the table structure and the bounding boxes for tables recognized inside PDF docu ments, this is not enough when a full reconstruction of the original table is required. This happens mainly due the folrcasons: lowing7
--- a/tests/data_scanned/scanned_01.easyocr.pages.json
+++ b/tests/data_scanned/scanned_01.easyocr.pages.json
--- a/tests/data_scanned/scanned_01.pdf
+++ b/tests/data_scanned/scanned_01.pdf
--- a/tests/data_scanned/scanned_01.png
+++ b/tests/data_scanned/scanned_01.png
--- a/tests/data_scanned/scanned_01.tesseract.doctags.txt
+++ b/tests/data_scanned/scanned_01.tesseract.doctags.txt
@ -1,19 +0,0 @@
-<document>
-<subtitle-level-1><location><page_1><loc_22><loc_83><loc_76><loc_86></location>TableFormer: Table Structure Understanding with Transformers Supplementary Material</subtitle-level-1>
-<subtitle-level-1><location><page_1><loc_8><loc_78><loc_29><loc_80></location>1. Details on the datasets</subtitle-level-1>
-<subtitle-level-1><location><page_1><loc_8><loc_76><loc_25><loc_77></location>1.1. Data preparation</subtitle-level-1>
-<paragraph><location><page_1><loc_8><loc_51><loc_47><loc_75></location>As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have "strict" tables, i.e. tables where every row has exactly the same length.</paragraph>
-<paragraph><location><page_1><loc_8><loc_21><loc_47><loc_51></location>We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.</paragraph>
-<paragraph><location><page_1><loc_8><loc_18><loc_47><loc_20></location>Figure 7 illustrates the distribution of the tables across different dimensions per dataset.</paragraph>
-<subtitle-level-1><location><page_1><loc_8><loc_15><loc_25><loc_16></location>1.2. Synthetic datasets</subtitle-level-1>
-<paragraph><location><page_1><loc_8><loc_10><loc_47><loc_14></location>Aiming to train and evaluate our models in a broader spectrum of table data we have synthesized four types of datasets. Each one contains tables with different appear-</paragraph>
-<paragraph><location><page_1><loc_50><loc_74><loc_89><loc_80></location>ances in regard to their size, structure, style and content. Every synthetic dataset contains 150k examples, summing up to 600k synthetic examples. All datasets are divided into Train, Test and Val splits (80%, 10%, 10%).</paragraph>
-<paragraph><location><page_1><loc_50><loc_71><loc_89><loc_73></location>The process of generating a synthetic dataset can be decomposed into the following steps:</paragraph>
-<paragraph><location><page_1><loc_50><loc_60><loc_89><loc_70></location>1. Prepare styling and content templates: The styling templates have been manually designed and organized into groups of scope specific appearances (e.g. financial data, marketing data, etc.) Additionally, we have prepared curated collections of content templates by extracting the most frequently used terms out of non-synthetic datasets (e.g. PubTabNet, FinTabNet, etc.).</paragraph>
-<paragraph><location><page_1><loc_50><loc_43><loc_89><loc_60></location>2. Generate table structures: The structure of each synthetic dataset assumes a horizontal table header which potentially spans over multiple rows and a table body that may contain a combination of row spans and column spans. However, spans are not allowed to cross the header -body boundary. The table structure is described by the parameters: Total number of table rows and columns, number of header rows, type of spans (header only spans, row only spans, column only spans, both row and column spans), maximum span size and the ratio of the table area covered by spans.</paragraph>
-<paragraph><location><page_1><loc_50><loc_37><loc_89><loc_43></location>3. Generate content: Based on the dataset theme, a set of suitable content templates is chosen first. Then, this content can be combined with purely random text to produce the synthetic content.</paragraph>
-<paragraph><location><page_1><loc_50><loc_31><loc_89><loc_37></location>4. Apply styling templates: Depending on the domain of the synthetic dataset, a set of styling templates is first manually selected. Then, a style is randomly selected to format the appearance of the synthesized table.</paragraph>
-<paragraph><location><page_1><loc_50><loc_23><loc_89><loc_31></location>5. Render the complete tables: The synthetic table is finally rendered by a web browser engine to generate the bounding boxes for each table cell. A batching technique is utilized to optimize the runtime overhead of the rendering process.</paragraph>
-<subtitle-level-1><location><page_1><loc_50><loc_18><loc_89><loc_22></location>2. Prediction post-processing for PDF documents</subtitle-level-1>
-<paragraph><location><page_1><loc_50><loc_10><loc_89><loc_17></location>Although TableFormer can predict the table structure and the bounding boxes for tables recognized inside PDF documents, this is not enough when a full reconstruction of the original table is required. This happens mainly due the following reasons:</paragraph>
-</document>
--- a/tests/data_scanned/scanned_01.tesseract.json
+++ b/tests/data_scanned/scanned_01.tesseract.json
--- a/tests/data_scanned/scanned_01.tesseract.md
+++ b/tests/data_scanned/scanned_01.tesseract.md
@ -1,33 +0,0 @@
-## TableFormer: Table Structure Understanding with Transformers Supplementary Material
-
-## 1. Details on the datasets
-
-## 1.1. Data preparation
-
-As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured in the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HTML structure if every row has the same number of columns after taking into account any row or column spans. Therefore a strict HTML structure looks always rectangular. However, HTML is a lenient encoding format, i.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. These implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have "strict" tables, i.e. tables where every row has exactly the same length.
-
-We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.
-
-Figure 7 illustrates the distribution of the tables across different dimensions per dataset.
-
-## 1.2. Synthetic datasets
-
-Aiming to train and evaluate our models in a broader spectrum of table data we have synthesized four types of datasets. Each one contains tables with different appear-
-
-ances in regard to their size, structure, style and content. Every synthetic dataset contains 150k examples, summing up to 600k synthetic examples. All datasets are divided into Train, Test and Val splits (80%, 10%, 10%).
-
-The process of generating a synthetic dataset can be decomposed into the following steps:
-
-1. Prepare styling and content templates: The styling templates have been manually designed and organized into groups of scope specific appearances (e.g. financial data, marketing data, etc.) Additionally, we have prepared curated collections of content templates by extracting the most frequently used terms out of non-synthetic datasets (e.g. PubTabNet, FinTabNet, etc.).
-
-2. Generate table structures: The structure of each synthetic dataset assumes a horizontal table header which potentially spans over multiple rows and a table body that may contain a combination of row spans and column spans. However, spans are not allowed to cross the header -body boundary. The table structure is described by the parameters: Total number of table rows and columns, number of header rows, type of spans (header only spans, row only spans, column only spans, both row and column spans), maximum span size and the ratio of the table area covered by spans.
-
-3. Generate content: Based on the dataset theme, a set of suitable content templates is chosen first. Then, this content can be combined with purely random text to produce the synthetic content.
-
-4. Apply styling templates: Depending on the domain of the synthetic dataset, a set of styling templates is first manually selected. Then, a style is randomly selected to format the appearance of the synthesized table.
-
-5. Render the complete tables: The synthetic table is finally rendered by a web browser engine to generate the bounding boxes for each table cell. A batching technique is utilized to optimize the runtime overhead of the rendering process.
-
-## 2. Prediction post-processing for PDF documents
-
-Although TableFormer can predict the table structure and the bounding boxes for tables recognized inside PDF documents, this is not enough when a full reconstruction of the original table is required. This happens mainly due the following reasons:
--- a/tests/data_scanned/scanned_01.tesseract.pages.json
+++ b/tests/data_scanned/scanned_01.tesseract.pages.json
--- a/tests/data_scanned/scanned_01.tesserocr.doctags.txt
+++ b/tests/data_scanned/scanned_01.tesserocr.doctags.txt
@ -1,25 +0,0 @@
-<document>
-<paragraph><location><page_1><loc_8><loc_86><loc_47><loc_90></location>Aiming to train and evaluate our models in a broader spectrum of table data we have synthesized four types of datasets. Each one contains tables with different appear-</paragraph>
-<subtitle-level-1><location><page_1><loc_8><loc_83><loc_89><loc_86></location>1.2. Synthetic datasets the bounding boxes for tables recognized inside PDF docu-</subtitle-level-1>
-<paragraph><location><page_1><loc_8><loc_80><loc_47><loc_82></location>Figure / illustrates the distribution of the tables across different dimensions per dataset.</paragraph>
-<subtitle-level-1><location><page_1><loc_8><loc_78><loc_39><loc_80></location>tables require the generation of bounding boxes.</subtitle-level-1>
-<subtitle-level-1><location><page_1><loc_8><loc_76><loc_47><loc_78></location>ing FinlabNet, 68% of the simple and 98% of the complex</subtitle-level-1>
-<paragraph><location><page_1><loc_8><loc_75><loc_47><loc_76></location>48% of the simple and 69% of the complex tables. Regard-</paragraph>
-<subtitle-level-1><location><page_1><loc_8><loc_76><loc_47><loc_78></location>ing FinlabNet, 68% of the simple and 98% of the complex</subtitle-level-1>
-<paragraph><location><page_1><loc_8><loc_51><loc_47><loc_75></location>missing bounding box out of its neighbors. As a first step. we use the annotation data to generate the most fine-grained erid that covers the table structure. In case of strict HIML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it 1s possible to compute the geometrical border lines between the grid rows and columns. Eventually this information 1s used to generate the missing bounding boxes. Additionally, the existence of unused grid Squares indicates that the table rows have unequal number of columns and the overall structure 1s non-strict. [he generation of missing bounding boxes for non-strict HI ML tables 1s ambiguous and therefore quite challenging. lhus, we have decided to simply discard those tables. In case of Pub labNet we have computed missing bounding boxes for</paragraph>
-<paragraph><location><page_1><loc_8><loc_21><loc_47><loc_51></location>1.1. Data preparation As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured 1n the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HI ML structure 1f every row has the same number of columns after taking into account any row or column spans. [Therefore a strict HI ML structure looks always rectangular. However, HI ML 1s a lenient encoding format, 1.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. [hese implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, 1.e. tables where every row has exactly the same length. We have developed a technique that tries to derive a</paragraph>
-<paragraph><location><page_1><loc_8><loc_20><loc_29><loc_21></location>1. Details on the datasets</paragraph>
-<paragraph><location><page_1><loc_50><loc_86><loc_89><loc_90></location>ments, this 1s not enough when a full reconstruction of the original table 1s required. [his happens mainly due the following reasons</paragraph>
-<subtitle-level-1><location><page_1><loc_52><loc_83><loc_89><loc_84></location>Although lableFormer can predict the table structure and</subtitle-level-1>
-<paragraph><location><page_1><loc_53><loc_80><loc_58><loc_81></location>ments</paragraph>
-<paragraph><location><page_1><loc_50><loc_74><loc_89><loc_80></location>utilized to optimize the runtime overhead of the rendering DIOCESS. 2. Prediction post-processing for PDF docu-</paragraph>
-<paragraph><location><page_1><loc_50><loc_71><loc_89><loc_74></location>finally rendered by a web browser engine to generate the bounding boxes for each table cell. A batching technique 1s</paragraph>
-<paragraph><location><page_1><loc_50><loc_60><loc_89><loc_71></location>can be combined with purely random text to produce the synthetic content. 4. Apply styling templates: Depending on the domain of the synthetic dataset, a set of styling templates 1s first manually selected. Ihen, a style is randomly selected to format the appearance of the synthesized table. 5. Render the complete tables: The synthetic table 1s</paragraph>
-<paragraph><location><page_1><loc_50><loc_43><loc_89><loc_60></location>tentially spans over multiple rows and a table body that may contain a combination of row spans and column spans. However, spans are not allowed to cross the header - body boundary. Ihe table structure 1s described by the parameters: Total number of table rows and columns. number of header rows, type of spans (header only spans, row only spans, column only spans, both row and column spans), maximum span size and the ratio of the table area covered by spans. Generate content: Based on the dataset theme. a set of suitable content templates 1s chosen first. Then, this content</paragraph>
-<paragraph><location><page_1><loc_50><loc_37><loc_89><loc_43></location>frequently used terms out of non-synthetic datasets (e.g. Pub labNet, Fin LabNet, etc.). 2. Generate table structures: [he structure of each synthetic dataset assumes a horizontal table header which po-</paragraph>
-<paragraph><location><page_1><loc_50><loc_31><loc_89><loc_37></location>templates have been manually designed and organized into groups of scope specific appearances (e.g. financial data. marketing data, etc.) Additionally, we have prepared curated collections of content templates by extracting the most</paragraph>
-<paragraph><location><page_1><loc_50><loc_23><loc_89><loc_31></location>up to 600K synthetic examples. All datasets are divided into Train, lest and Val splits (8O%, 10%, 10%). The process of generating a synthetic dataset can be decomposed into the following steps: |. Prepare styling and content templates: The styling</paragraph>
-<paragraph><location><page_1><loc_50><loc_22><loc_89><loc_23></location>Every synthetic dataset contains 150k examples, summing</paragraph>
-<subtitle-level-1><location><page_1><loc_50><loc_18><loc_89><loc_22></location>ances in regard to their size, structure, style and content.</subtitle-level-1>
-<paragraph><location><page_1><loc_22><loc_10><loc_89><loc_17></location>TableFormer: Table Structure Understanding with Transformers Supplementary Material</paragraph>
-</document>
--- a/tests/data_scanned/scanned_01.tesserocr.json
+++ b/tests/data_scanned/scanned_01.tesserocr.json
--- a/tests/data_scanned/scanned_01.tesserocr.md
+++ b/tests/data_scanned/scanned_01.tesserocr.md
@ -1,45 +0,0 @@
-Aiming to train and evaluate our models in a broader spectrum of table data we have synthesized four types of datasets. Each one contains tables with different appear-
-
-## 1.2. Synthetic datasets the bounding boxes for tables recognized inside PDF docu-
-
-Figure / illustrates the distribution of the tables across different dimensions per dataset.
-
-## tables require the generation of bounding boxes.
-
-## ing FinlabNet, 68% of the simple and 98% of the complex
-
-48% of the simple and 69% of the complex tables. Regard-
-
-## ing FinlabNet, 68% of the simple and 98% of the complex
-
-missing bounding box out of its neighbors. As a first step. we use the annotation data to generate the most fine-grained erid that covers the table structure. In case of strict HIML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it 1s possible to compute the geometrical border lines between the grid rows and columns. Eventually this information 1s used to generate the missing bounding boxes. Additionally, the existence of unused grid Squares indicates that the table rows have unequal number of columns and the overall structure 1s non-strict. [he generation of missing bounding boxes for non-strict HI ML tables 1s ambiguous and therefore quite challenging. lhus, we have decided to simply discard those tables. In case of Pub labNet we have computed missing bounding boxes for
-
-1.1. Data preparation As a first step of our data preparation process, we have calculated statistics over the datasets across the following dimensions: (1) table size measured 1n the number of rows and columns, (2) complexity of the table, (3) strictness of the provided HTML structure and (4) completeness (i.e. no omitted bounding boxes). A table is considered to be simple if it does not contain row spans or column spans. Additionally, a table has a strict HI ML structure 1f every row has the same number of columns after taking into account any row or column spans. [Therefore a strict HI ML structure looks always rectangular. However, HI ML 1s a lenient encoding format, 1.e. tables with rows of different sizes might still be regarded as correct due to implicit display rules. [hese implicit rules leave room for ambiguity, which we want to avoid. As such, we prefer to have 'strict' tables, 1.e. tables where every row has exactly the same length. We have developed a technique that tries to derive a
-
-1. Details on the datasets
-
-ments, this 1s not enough when a full reconstruction of the original table 1s required. [his happens mainly due the following reasons
-
-## Although lableFormer can predict the table structure and
-
-ments
-
-utilized to optimize the runtime overhead of the rendering DIOCESS. 2. Prediction post-processing for PDF docu-
-
-finally rendered by a web browser engine to generate the bounding boxes for each table cell. A batching technique 1s
-
-can be combined with purely random text to produce the synthetic content. 4. Apply styling templates: Depending on the domain of the synthetic dataset, a set of styling templates 1s first manually selected. Ihen, a style is randomly selected to format the appearance of the synthesized table. 5. Render the complete tables: The synthetic table 1s
-
-tentially spans over multiple rows and a table body that may contain a combination of row spans and column spans. However, spans are not allowed to cross the header - body boundary. Ihe table structure 1s described by the parameters: Total number of table rows and columns. number of header rows, type of spans (header only spans, row only spans, column only spans, both row and column spans), maximum span size and the ratio of the table area covered by spans. Generate content: Based on the dataset theme. a set of suitable content templates 1s chosen first. Then, this content
-
-frequently used terms out of non-synthetic datasets (e.g. Pub labNet, Fin LabNet, etc.). 2. Generate table structures: [he structure of each synthetic dataset assumes a horizontal table header which po-
-
-templates have been manually designed and organized into groups of scope specific appearances (e.g. financial data. marketing data, etc.) Additionally, we have prepared curated collections of content templates by extracting the most
-
-up to 600K synthetic examples. All datasets are divided into Train, lest and Val splits (8O%, 10%, 10%). The process of generating a synthetic dataset can be decomposed into the following steps: |. Prepare styling and content templates: The styling
-
-Every synthetic dataset contains 150k examples, summing
-
-## ances in regard to their size, structure, style and content.
-
-TableFormer: Table Structure Understanding with Transformers Supplementary Material
--- a/tests/data_scanned/scanned_01.tesserocr.pages.json
+++ b/tests/data_scanned/scanned_01.tesserocr.pages.json
--- a/tests/data_scanned/scanned_02.easyocr.doctags.txt
+++ b/tests/data_scanned/scanned_02.easyocr.doctags.txt
@ -1,21 +0,0 @@
-<document>
-<subtitle-level-1><location><page_1><loc_16><loc_87><loc_82><loc_91></location>UNIVERSITYof HOUSTON | CLASS</subtitle-level-1>
-<subtitle-level-1><location><page_1><loc_30><loc_83><loc_70><loc_86></location>Professional Development Award for Staff</subtitle-level-1>
-<subtitle-level-1><location><page_1><loc_11><loc_80><loc_20><loc_82></location>Purpose</subtitle-level-1>
-<paragraph><location><page_1><loc_11><loc_69><loc_88><loc_80></location>The Dean's Professional Development Award for Staff is to allow CLASS staff the opportunity to attend conferences and workshops in their field for the sole purpose of professional development. The intent is to defray costs associated with attendance. The maximum amount of the award is $2,000 per staff member. Up to four awards will be made per year, contingent upon the availability of funding. Staff members that are awarded must wait three years from the date of award notification before reapplying again.</paragraph>
-<subtitle-level-1><location><page_1><loc_12><loc_66><loc_21><loc_68></location>Eligibility</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_64><loc_51><loc_66></location>All staff currently employed in CLASS are eligible.</paragraph>
-<subtitle-level-1><location><page_1><loc_12><loc_61><loc_37><loc_63></location>What the Award Will Fund</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_58><loc_56><loc_61></location>Costs associated with conference/workshop including:</paragraph>
-<paragraph><location><page_1><loc_15><loc_57><loc_23><loc_58></location>Airfare</paragraph>
-<paragraph><location><page_1><loc_15><loc_55><loc_24><loc_57></location>Lodging</paragraph>
-<paragraph><location><page_1><loc_15><loc_53><loc_23><loc_55></location>Meals</paragraph>
-<paragraph><location><page_1><loc_15><loc_51><loc_32><loc_53></location>Registration fees</paragraph>
-<paragraph><location><page_1><loc_15><loc_49><loc_37><loc_51></location>Ground Transportation</paragraph>
-<subtitle-level-1><location><page_1><loc_12><loc_46><loc_41><loc_48></location>What the Award Will Not Fund</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_43><loc_78><loc_46></location>expenses incurred outside of the scope of the proposed development activity. Any</paragraph>
-<paragraph><location><page_1><loc_11><loc_40><loc_29><loc_43></location>Granting Schedule</paragraph>
-<paragraph><location><page_1><loc_11><loc_34><loc_42><loc_42></location>Earliest Submission Date: August 1st Applications Due: October 1s Notification of Awards: November 1st</paragraph>
-<paragraph><location><page_1><loc_11><loc_28><loc_85><loc_32></location>Please submit applications to CLASSGrt@uh edu by the deadline. Please write "Professional Development-Staff" in the subject line.</paragraph>
-<paragraph><location><page_1><loc_12><loc_19><loc_86><loc_27></location>PLEASE NOTE: Please include a supporting letter from your Department Chair or Immediate Supervisor. Incomplete applications not be reviewed. Applications will be considered incomplete until all information has been received, at which time an email confirming receipt will be sent to you. will</paragraph>
-</document>
--- a/tests/data_scanned/scanned_02.easyocr.json
+++ b/tests/data_scanned/scanned_02.easyocr.json
--- a/tests/data_scanned/scanned_02.easyocr.md
+++ b/tests/data_scanned/scanned_02.easyocr.md
@ -1,37 +0,0 @@
-## UNIVERSITYof HOUSTON | CLASS
-
-## Professional Development Award for Staff
-
-## Purpose
-
-The Dean's Professional Development Award for Staff is to allow CLASS staff the opportunity to attend conferences and workshops in their field for the sole purpose of professional development. The intent is to defray costs associated with attendance. The maximum amount of the award is $2,000 per staff member. Up to four awards will be made per year, contingent upon the availability of funding. Staff members that are awarded must wait three years from the date of award notification before reapplying again.
-
-## Eligibility
-
-All staff currently employed in CLASS are eligible.
-
-## What the Award Will Fund
-
-Costs associated with conference/workshop including:
-
-Airfare
-
-Lodging
-
-Meals
-
-Registration fees
-
-Ground Transportation
-
-## What the Award Will Not Fund
-
-expenses incurred outside of the scope of the proposed development activity. Any
-
-Granting Schedule
-
-Earliest Submission Date: August 1st Applications Due: October 1s Notification of Awards: November 1st
-
-Please submit applications to CLASSGrt@uh edu by the deadline. Please write "Professional Development-Staff" in the subject line.
-
-PLEASE NOTE: Please include a supporting letter from your Department Chair or Immediate Supervisor. Incomplete applications not be reviewed. Applications will be considered incomplete until all information has been received, at which time an email confirming receipt will be sent to you. will
--- a/tests/data_scanned/scanned_02.easyocr.pages.json
+++ b/tests/data_scanned/scanned_02.easyocr.pages.json
--- a/tests/data_scanned/scanned_02.pdf
+++ b/tests/data_scanned/scanned_02.pdf
--- a/tests/data_scanned/scanned_02.png
+++ b/tests/data_scanned/scanned_02.png
--- a/tests/data_scanned/scanned_02.tesseract.doctags.txt
+++ b/tests/data_scanned/scanned_02.tesseract.doctags.txt
@ -1,21 +0,0 @@
-<document>
-<subtitle-level-1><location><page_1><loc_16><loc_87><loc_82><loc_91></location>UNIVERSITYof HOUSTON CLASS</subtitle-level-1>
-<subtitle-level-1><location><page_1><loc_30><loc_84><loc_70><loc_86></location>Professional Development Award for Staff</subtitle-level-1>
-<subtitle-level-1><location><page_1><loc_12><loc_80><loc_20><loc_82></location>Purpose</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_69><loc_88><loc_80></location>The Dean's Professional Development Award for Staff is to allow CLASS staff the opportunity to attend conferences and workshops in their field for the sole purpose of professional development. The intent is to defray costs associated with attendance. The maximum amount of the award is $2,000 per staff member. Up to four awards will be made per year, contingent upon the availability of funding. Staff members that are awarded must wait three years from the date of award notification before reapplying again.</paragraph>
-<subtitle-level-1><location><page_1><loc_12><loc_66><loc_20><loc_68></location>Eligibility</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_64><loc_51><loc_66></location>All staff currently employed in CLASS are eligible.</paragraph>
-<subtitle-level-1><location><page_1><loc_12><loc_61><loc_37><loc_63></location>What the Award Will Fund</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_59><loc_56><loc_60></location>Costs associated with conference/workshop including:</paragraph>
-<paragraph><location><page_1><loc_15><loc_57><loc_23><loc_58></location>e Airfare</paragraph>
-<paragraph><location><page_1><loc_15><loc_55><loc_24><loc_57></location>e Lodging</paragraph>
-<paragraph><location><page_1><loc_15><loc_53><loc_23><loc_55></location>e Meals</paragraph>
-<paragraph><location><page_1><loc_15><loc_51><loc_31><loc_53></location>e Registration fees</paragraph>
-<paragraph><location><page_1><loc_15><loc_49><loc_36><loc_51></location>e Ground Transportation</paragraph>
-<subtitle-level-1><location><page_1><loc_12><loc_46><loc_41><loc_48></location>What the Award Will Not Fund</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_44><loc_78><loc_45></location>Any expenses incurred outside of the scope of the proposed development activity.</paragraph>
-<subtitle-level-1><location><page_1><loc_12><loc_40><loc_29><loc_42></location>Granting Schedule</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_34><loc_42><loc_42></location>Granting Schedule Earliest Submission Date: August 1° Applications Due: October 1° Notification of Awards: November 1°</paragraph>
-<paragraph><location><page_1><loc_12><loc_28><loc_85><loc_32></location>Please submit applications to CLASSGrt@uh.edu by the deadline. Please write "Professional DevelopmentStaff" in the subject line.</paragraph>
-<paragraph><location><page_1><loc_12><loc_19><loc_86><loc_27></location>PLEASE NOTE: Please include a supporting letter from your Department Chair or Immediate Supervisor. Incomplete applications will not be reviewed. Applications will be considered incomplete until all information has been received, at which time an email confirming receipt will be sent to you.</paragraph>
-</document>
--- a/tests/data_scanned/scanned_02.tesseract.json
+++ b/tests/data_scanned/scanned_02.tesseract.json
--- a/tests/data_scanned/scanned_02.tesseract.md
+++ b/tests/data_scanned/scanned_02.tesseract.md
@ -1,37 +0,0 @@
-## UNIVERSITYof HOUSTON CLASS
-
-## Professional Development Award for Staff
-
-## Purpose
-
-The Dean's Professional Development Award for Staff is to allow CLASS staff the opportunity to attend conferences and workshops in their field for the sole purpose of professional development. The intent is to defray costs associated with attendance. The maximum amount of the award is $2,000 per staff member. Up to four awards will be made per year, contingent upon the availability of funding. Staff members that are awarded must wait three years from the date of award notification before reapplying again.
-
-## Eligibility
-
-All staff currently employed in CLASS are eligible.
-
-## What the Award Will Fund
-
-Costs associated with conference/workshop including:
-
-e Airfare
-
-e Lodging
-
-e Meals
-
-e Registration fees
-
-e Ground Transportation
-
-## What the Award Will Not Fund
-
-Any expenses incurred outside of the scope of the proposed development activity.
-
-## Granting Schedule
-
-Granting Schedule Earliest Submission Date: August 1° Applications Due: October 1° Notification of Awards: November 1°
-
-Please submit applications to CLASSGrt@uh.edu by the deadline. Please write "Professional DevelopmentStaff" in the subject line.
-
-PLEASE NOTE: Please include a supporting letter from your Department Chair or Immediate Supervisor. Incomplete applications will not be reviewed. Applications will be considered incomplete until all information has been received, at which time an email confirming receipt will be sent to you.
--- a/tests/data_scanned/scanned_02.tesseract.pages.json
+++ b/tests/data_scanned/scanned_02.tesseract.pages.json
--- a/tests/data_scanned/scanned_02.tesserocr.doctags.txt
+++ b/tests/data_scanned/scanned_02.tesserocr.doctags.txt
@ -1,22 +0,0 @@
-<document>
-<paragraph><location><page_1><loc_12><loc_68><loc_88><loc_81></location>Please submit applications to CLASSGrt@uh.edu by the deadline. Please write "Professional Development- Staff in the subject line. PLEASE NOTE: Please include a supporting letter from your Department Chair or Immediate Supervisor. Incomplete applications will not be reviewed. Applications will be considered incomplete until all information has been received, at which time an email confirming receipt will be sent to you.</paragraph>
-<paragraph><location><page_1><loc_12><loc_64><loc_51><loc_66></location>Notification of Awards: November 1°"</paragraph>
-<subtitle-level-1><location><page_1><loc_12><loc_61><loc_37><loc_64></location>Applications Due: October 1°</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_58><loc_56><loc_60></location>Granting Schedule</paragraph>
-<paragraph><location><page_1><loc_12><loc_60><loc_41><loc_62></location>Earliest Submission Date: August 1°</paragraph>
-<paragraph><location><page_1><loc_12><loc_55><loc_77><loc_56></location>Any expenses incurred outside of the scope of the proposed development activity.</paragraph>
-<paragraph><location><page_1><loc_12><loc_53><loc_41><loc_55></location>What the Awara Will Not Fund</paragraph>
-<paragraph><location><page_1><loc_15><loc_49><loc_36><loc_51></location>e Ground Transportation</paragraph>
-<paragraph><location><page_1><loc_15><loc_48><loc_31><loc_49></location>e Registration fees</paragraph>
-<subtitle-level-1><location><page_1><loc_12><loc_46><loc_41><loc_48></location>Meals</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_44><loc_78><loc_45></location>e Lodging</paragraph>
-<subtitle-level-1><location><page_1><loc_12><loc_40><loc_29><loc_43></location>e Aijirtare</subtitle-level-1>
-<paragraph><location><page_1><loc_12><loc_34><loc_51><loc_43></location>All staff currently employed in CLASS are eligible. What the Awara Will Fund e Aijirtare</paragraph>
-<paragraph><location><page_1><loc_12><loc_40><loc_56><loc_41></location>Costs associated with conference/workshop including:</paragraph>
-<paragraph><location><page_1><loc_12><loc_32><loc_20><loc_34></location>Eligibility</paragraph>
-<paragraph><location><page_1><loc_12><loc_28><loc_85><loc_32></location>members that are awarded must wait three years from the date of award notification before reapplying again.</paragraph>
-<paragraph><location><page_1><loc_12><loc_19><loc_88><loc_27></location>The Dean's Professional Development Award for Staff is to allow CLASS staff the opportunity to attend conferences and workshops in their field for the sole purpose of professional development. The intent is to defray costs associated with attendance. The maximum amount of the award is $2,000 per staff member. Up to four awards will be made per year, contingent upon the availability of funding. Staff</paragraph>
-<paragraph><location><page_1><loc_12><loc_18><loc_19><loc_20></location>Purpose</paragraph>
-<paragraph><location><page_1><loc_30><loc_15><loc_70><loc_16></location>Professional Development Award for Staff</paragraph>
-<paragraph><location><page_1><loc_17><loc_9><loc_81><loc_13></location>UNIVERSITYof 'CLASS</paragraph>
-</document>
--- a/tests/data_scanned/scanned_02.tesserocr.json
+++ b/tests/data_scanned/scanned_02.tesserocr.json
--- a/tests/data_scanned/scanned_02.tesserocr.md
+++ b/tests/data_scanned/scanned_02.tesserocr.md
@ -1,39 +0,0 @@
-Please submit applications to CLASSGrt@uh.edu by the deadline. Please write "Professional Development- Staff in the subject line. PLEASE NOTE: Please include a supporting letter from your Department Chair or Immediate Supervisor. Incomplete applications will not be reviewed. Applications will be considered incomplete until all information has been received, at which time an email confirming receipt will be sent to you.
-
-Notification of Awards: November 1°"
-
-## Applications Due: October 1°
-
-Granting Schedule
-
-Earliest Submission Date: August 1°
-
-Any expenses incurred outside of the scope of the proposed development activity.
-
-What the Awara Will Not Fund
-
-e Ground Transportation
-
-e Registration fees
-
-## Meals
-
-e Lodging
-
-## e Aijirtare
-
-All staff currently employed in CLASS are eligible. What the Awara Will Fund e Aijirtare
-
-Costs associated with conference/workshop including:
-
-Eligibility
-
-members that are awarded must wait three years from the date of award notification before reapplying again.
-
-The Dean's Professional Development Award for Staff is to allow CLASS staff the opportunity to attend conferences and workshops in their field for the sole purpose of professional development. The intent is to defray costs associated with attendance. The maximum amount of the award is $2,000 per staff member. Up to four awards will be made per year, contingent upon the availability of funding. Staff
-
-Purpose
-
-Professional Development Award for Staff
-
-UNIVERSITYof 'CLASS
--- a/tests/data_scanned/scanned_02.tesserocr.pages.json
+++ b/tests/data_scanned/scanned_02.tesserocr.pages.json
--- a/tests/test_e2e_ocr_conversion.py
+++ b/tests/test_e2e_ocr_conversion.py
@ -14,9 +14,6 @@ from docling.document_converter import DocumentConverter

 from .verify_utils import verify_conversion_result

-# from tests.verify_utils import verify_conversion_result
-
-
 GENERATE = False


@ -27,21 +24,22 @@ def save_output(pdf_path: Path, doc_result: ConversionResult, engine: str):
    import os

    parent = pdf_path.parent
+    eng = "" if engine is None else ".{engine}"

-    dict_fn = os.path.join(parent, f"{pdf_path.stem}.{engine}.json")
+    dict_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.json")
    with open(dict_fn, "w") as fd:
        json.dump(doc_result.render_as_dict(), fd)

-    pages_fn = os.path.join(parent, f"{pdf_path.stem}.{engine}.pages.json")
+    pages_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.pages.json")
    pages = [p.model_dump() for p in doc_result.pages]
    with open(pages_fn, "w") as fd:
        json.dump(pages, fd)

-    doctags_fn = os.path.join(parent, f"{pdf_path.stem}.{engine}.doctags.txt")
+    doctags_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.doctags.txt")
    with open(doctags_fn, "w") as fd:
        fd.write(doc_result.render_as_doctags())

-    md_fn = os.path.join(parent, f"{pdf_path.stem}.{engine}.md")
+    md_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.md")
    with open(md_fn, "w") as fd:
        fd.write(doc_result.render_as_markdown())

@ -88,14 +86,12 @@ def test_e2e_conversions():

            doc_result: ConversionResult = converter.convert_single(pdf_path)

-            # # Save conversions
-            # save_output(pdf_path, doc_result, engine)
+            # Save conversions
+            # save_output(pdf_path, doc_result, None)

            # Debug
            verify_conversion_result(
                input_path=pdf_path,
                doc_result=doc_result,
                generate=GENERATE,
-                ocr_engine=ocr_options.kind,
-                fuzzy=True,
            )
--- a/tests/verify_utils.py
+++ b/tests/verify_utils.py
@ -11,43 +11,7 @@ from docling.datamodel.base_models import ConversionStatus, Page
 from docling.datamodel.document import ConversionResult


-def levenshtein(str1: str, str2: str) -> int:
-
-    # Ensure str1 is the shorter string to optimize memory usage
-    if len(str1) > len(str2):
-        str1, str2 = str2, str1
-
-    # Previous and current row buffers
-    previous_row = list(range(len(str2) + 1))
-    current_row = [0] * (len(str2) + 1)
-
-    # Compute the Levenshtein distance row by row
-    for i, c1 in enumerate(str1, start=1):
-        current_row[0] = i
-        for j, c2 in enumerate(str2, start=1):
-            insertions = previous_row[j] + 1
-            deletions = current_row[j - 1] + 1
-            substitutions = previous_row[j - 1] + (c1 != c2)
-            current_row[j] = min(insertions, deletions, substitutions)
-        # Swap rows for the next iteration
-        previous_row, current_row = current_row, previous_row
-
-    # The result is in the last element of the previous row
-    return previous_row[-1]
-
-
-def verify_text(gt: str, pred: str, fuzzy: bool, fuzzy_threshold: float = 0.4):
-
-    if len(gt) == 0 or not fuzzy:
-        assert gt == pred, f"{gt}!={pred}"
-    else:
-        dist = levenshtein(gt, pred)
-        diff = dist / len(gt)
-        assert diff < fuzzy_threshold, f"{gt}!~{pred}"
-    return True
-
-
-def verify_cells(doc_pred_pages: List[Page], doc_true_pages: List[Page], fuzzy: bool):
+def verify_cells(doc_pred_pages: List[Page], doc_true_pages: List[Page]):

    assert len(doc_pred_pages) == len(
        doc_true_pages
@ -68,7 +32,8 @@ def verify_cells(doc_pred_pages: List[Page], doc_true_pages: List[Page], fuzzy:

            true_text = cell_true_item.text
            pred_text = cell_pred_item.text
-            verify_text(true_text, pred_text, fuzzy)
+
+            assert true_text == pred_text, f"{true_text}!={pred_text}"

            true_bbox = cell_true_item.bbox.as_tuple()
            pred_bbox = cell_pred_item.bbox.as_tuple()
@ -104,7 +69,7 @@ def verify_maintext(doc_pred: DsDocument, doc_true: DsDocument):
    return True


-def verify_tables(doc_pred: DsDocument, doc_true: DsDocument, fuzzy: bool):
+def verify_tables(doc_pred: DsDocument, doc_true: DsDocument):
    if doc_true.tables is None:
        # No tables to check
        assert doc_pred.tables is None, "not expecting any table on this document"
@ -137,7 +102,9 @@ def verify_tables(doc_pred: DsDocument, doc_true: DsDocument, fuzzy: bool):
                # print("pred: ", pred_item.data[i][j].text)
                # print("")

-                verify_text(true_item.data[i][j].text, pred_item.data[i][j].text, fuzzy)
+                assert (
+                    true_item.data[i][j].text == pred_item.data[i][j].text
+                ), "table-cell does not have the same text"

                assert (
                    true_item.data[i][j].obj_type == pred_item.data[i][j].obj_type
@ -154,20 +121,16 @@ def verify_output(doc_pred: DsDocument, doc_true: DsDocument):
    return True


-def verify_md(doc_pred_md: str, doc_true_md: str, fuzzy: bool):
-    return verify_text(doc_true_md, doc_pred_md, fuzzy)
+def verify_md(doc_pred_md, doc_true_md):
+    return doc_pred_md == doc_true_md


-def verify_dt(doc_pred_dt: str, doc_true_dt: str, fuzzy: bool):
-    return verify_text(doc_true_dt, doc_pred_dt, fuzzy)
+def verify_dt(doc_pred_dt, doc_true_dt):
+    return doc_pred_dt == doc_true_dt


 def verify_conversion_result(
-    input_path: Path,
-    doc_result: ConversionResult,
-    generate=False,
-    ocr_engine=None,
-    fuzzy: bool = False,
+    input_path: Path, doc_result: ConversionResult, generate=False
 ):
    PageList = TypeAdapter(List[Page])

@ -180,11 +143,10 @@ def verify_conversion_result(
    doc_pred_md = doc_result.render_as_markdown()
    doc_pred_dt = doc_result.render_as_doctags()

-    engine_suffix = "" if ocr_engine is None else f".{ocr_engine}"
-    pages_path = input_path.with_suffix(f"{engine_suffix}.pages.json")
-    json_path = input_path.with_suffix(f"{engine_suffix}.json")
-    md_path = input_path.with_suffix(f"{engine_suffix}.md")
-    dt_path = input_path.with_suffix(f"{engine_suffix}.doctags.txt")
+    pages_path = input_path.with_suffix(".pages.json")
+    json_path = input_path.with_suffix(".json")
+    md_path = input_path.with_suffix(".md")
+    dt_path = input_path.with_suffix(".doctags.txt")

    if generate:  # only used when re-generating truth
        with open(pages_path, "w") as fw:
@ -212,7 +174,7 @@ def verify_conversion_result(
            doc_true_dt = fr.read()

        assert verify_cells(
-            doc_pred_pages, doc_true_pages, fuzzy
+            doc_pred_pages, doc_true_pages
        ), f"Mismatch in PDF cell prediction for {input_path}"

        # assert verify_output(
@ -220,13 +182,13 @@ def verify_conversion_result(
        # ), f"Mismatch in JSON prediction for {input_path}"

        assert verify_tables(
-            doc_pred, doc_true, fuzzy
+            doc_pred, doc_true
        ), f"verify_tables(doc_pred, doc_true) mismatch for {input_path}"

        assert verify_md(
-            doc_pred_md, doc_true_md, fuzzy
+            doc_pred_md, doc_true_md
        ), f"Mismatch in Markdown prediction for {input_path}"

        assert verify_dt(
-            doc_pred_dt, doc_true_dt, fuzzy
+            doc_pred_dt, doc_true_dt
        ), f"Mismatch in DocTags prediction for {input_path}"
				`@ -0,0 +1 @@`
				{"_name": "", "type": "pdf-document", "description": {"logs": []}, "file-info": {"filename": "ocr_test.pdf", "document-hash": "1e6966b64695f3e77f2931dfd42c79050f4a47cd9c53eb32dc061c98a3129b05", "#-pages": 1, "page-hashes": [{"hash": "5b246e5b7c627e174ffcbbe2a41131c2f19e4c2b02314f6bc9ca65c11f9b8d76", "model": "default", "page": 1}]}, "main-text": [{"prov": [{"bbox": [71.608642578125, 750.5054931640625, 127.90485382080078, 770.1392211914062], "page": 1, "span": [0, 7]}], "text": "Docling", "type": "subtitle-level-1", "name": "Section-header"}, {"prov": [{"bbox": [71.54174041748047, 703.8960571289062, 498.7333068847656, 733.1880493164062], "page": 1, "span": [0, 95]}], "text": "Docling bundles PDF document conversion to JSON and Markdown in an easy, selfcontained package.", "type": "paragraph", "name": "Text"}, {"prov": [{"bbox": [71.21173858642578, 484.2960510253906, 519.8010864257812, 674.6280517578125], "page": 1, "span": [0, 409]}], "text": "Features Converts any PDF document to JSON or Markdown format, stable and lightning fast. Understands detailed page layout, reading order and recovers table structures. Extracts metadata from the document, such as title, authors, references and language. Includes OCR support for scanned PDFs. Integrates easily with LLM app / RAG frameworks like LlamaIndex and LangChain Provides a simple and convenient CLI.", "type": "paragraph", "name": "Text"}], "figures": [], "tables": [], "equations": [], "footnotes": [], "page-dimensions": [{"height": 841.9200439453125, "page": 1, "width": 595.2000122070312}], "page-footers": [], "page-headers": []}