docling/tests/data/groundtruth/docling_v2
Cesar Berrospi Ramis 7450050ace
refactor: upgrade BeautifulSoup4 with type hints (#999)
* refactor: upgrade BeautifulSoup4 with type hints

Upgrade dependency library BeautifulSoup4 to 4.13.3 (with type hints).
Refactor backends using BeautifulSoup4 to comply with type hints.
Apply style simplifications and improvements for consistency.
Remove variables and functions that are never used.
Remove code duplication between backends for parsing HTML tables.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* build: allow beautifulsoup4 version 4.12.3

Allow older version of beautifulsoup4 and ensure compatibility.
Update library dependencies.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-02-18 11:30:47 +01:00
..
2203.01017v2.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
2203.01017v2.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
2203.01017v2.md feat: Expose equation exports (#869) 2025-02-03 10:31:19 +01:00
2203.01017v2.pages.json test: update results with new docling-core (#839) 2025-01-30 14:07:52 +01:00
2206.01062.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
2206.01062.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
2206.01062.md feat: Expose equation exports (#869) 2025-02-03 10:31:19 +01:00
2206.01062.pages.json test: update results with new docling-core (#839) 2025-01-30 14:07:52 +01:00
2305.03393v1-pg9.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
2305.03393v1-pg9.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
2305.03393v1-pg9.md feat: Support AsciiDoc and Markdown input format (#168) 2024-10-23 16:14:26 +02:00
2305.03393v1-pg9.pages.json test: update results with new docling-core (#839) 2025-01-30 14:07:52 +01:00
2305.03393v1.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
2305.03393v1.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
2305.03393v1.md feat: Expose equation exports (#869) 2025-02-03 10:31:19 +01:00
2305.03393v1.pages.json test: update results with new docling-core (#839) 2025-01-30 14:07:52 +01:00
amt_handbook_sample.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
amt_handbook_sample.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
amt_handbook_sample.md docs: Add example for inspection of picture content (#624) 2025-01-29 10:39:00 +01:00
amt_handbook_sample.pages.json test: update results with new docling-core (#839) 2025-01-30 14:07:52 +01:00
blocks.md.md fix(markdown): fix empty block handling (#843) 2025-01-30 16:22:29 +01:00
bmj_sample.xml.itxt feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
bmj_sample.xml.json feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
bmj_sample.xml.md feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
code_and_formula.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
code_and_formula.json fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
code_and_formula.md fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
code_and_formula.pages.json fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
csv-comma-in-cell.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-comma-in-cell.csv.json feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-comma-in-cell.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-comma.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-comma.csv.json feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-comma.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-inconsistent-header.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-inconsistent-header.csv.json feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-inconsistent-header.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-pipe.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-pipe.csv.json feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-pipe.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-semicolon.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-semicolon.csv.json feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-semicolon.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-tab.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-tab.csv.json feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-tab.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-too-few-columns.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-too-few-columns.csv.json feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-too-few-columns.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-too-many-columns.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-too-many-columns.csv.json feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-too-many-columns.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
duck.md.md fix: fix single newline handling in MD backend (#824) 2025-01-28 19:05:55 +01:00
elife-56337.xml.itxt feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
elife-56337.xml.md feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
ending_with_table.md.md fix(markdown): fix parsing if doc ending with table (#873) 2025-02-03 14:38:38 +01:00
example_01.html.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
example_01.html.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
example_01.html.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
example_02.html.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
example_02.html.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
example_02.html.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
example_03.html.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
example_03.html.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
example_03.html.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
example_04.html.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
example_04.html.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
example_04.html.md feat: Expose equation exports (#869) 2025-02-03 10:31:19 +01:00
example_05.html.itxt fix: parse html with omitted body tag (#818) 2025-01-27 16:59:00 +01:00
example_05.html.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
example_05.html.md feat: Expose equation exports (#869) 2025-02-03 10:31:19 +01:00
ipa20180000016.itxt feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
ipa20180000016.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
ipa20180000016.md feat: Expose equation exports (#869) 2025-02-03 10:31:19 +01:00
ipa20200022300.itxt feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
ipa20200022300.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
ipa20200022300.md feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
lorem_ipsum.docx.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
lorem_ipsum.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
lorem_ipsum.docx.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
mixed.md.md fix(markdown): add support for HTML content (#855) 2025-02-03 12:21:05 +01:00
nested.md.md fix(markdown): handle nested lists (#910) 2025-02-07 12:55:12 +01:00
pa20010031492.itxt feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
pa20010031492.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
pa20010031492.md feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
pftaps057006474.itxt feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
pftaps057006474.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
pftaps057006474.md feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
pg06442728.itxt feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
pg06442728.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
pg06442728.md feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
picture_classification.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
picture_classification.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
picture_classification.md feat: New document picture classifier (#805) 2025-01-24 18:05:51 +01:00
picture_classification.pages.json test: update results with new docling-core (#839) 2025-01-30 14:07:52 +01:00
pnas_sample.xml.itxt feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pnas_sample.xml.json feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pnas_sample.xml.md feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pntd.0008301.xml.itxt feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pntd.0008301.xml.md feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pone.0234687.xml.itxt feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pone.0234687.xml.md feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
powerpoint_sample.pptx.itxt feat: Extracting picture data for raster images found in PPTX (#349) 2024-11-18 15:22:28 +01:00
powerpoint_sample.pptx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
powerpoint_sample.pptx.md feat: Extracting picture data for raster images found in PPTX (#349) 2024-11-18 15:22:28 +01:00
powerpoint_with_image.pptx.itxt feat: Extracting picture data for raster images found in PPTX (#349) 2024-11-18 15:22:28 +01:00
powerpoint_with_image.pptx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
powerpoint_with_image.pptx.md feat: Extracting picture data for raster images found in PPTX (#349) 2024-11-18 15:22:28 +01:00
redp5110_sampled.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
redp5110_sampled.json fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
redp5110_sampled.md feat: Expose equation exports (#869) 2025-02-03 10:31:19 +01:00
redp5110_sampled.pages.json test: update results with new docling-core (#839) 2025-01-30 14:07:52 +01:00
right_to_left_01.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_01.json fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_01.md fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
right_to_left_01.pages.json fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_02.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_02.json fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_02.md fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
right_to_left_02.pages.json fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_03.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_03.json fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_03.md fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
right_to_left_03.pages.json fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
tablecell.docx.itxt fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
tablecell.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
tablecell.docx.md fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
test_01.asciidoc.md feat: Support AsciiDoc and Markdown input format (#168) 2024-10-23 16:14:26 +02:00
test_02.asciidoc.md feat: Support AsciiDoc and Markdown input format (#168) 2024-10-23 16:14:26 +02:00
test_emf_docx.docx.itxt fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
test_emf_docx.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
test_emf_docx.docx.md fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
test-01.xlsx.itxt fix: added extraction of byte-images in excel (#804) 2025-01-24 18:48:02 +01:00
test-01.xlsx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
test-01.xlsx.md fix: added extraction of byte-images in excel (#804) 2025-01-24 18:48:02 +01:00
unit_test_01.html.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
unit_test_01.html.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
unit_test_01.html.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
unit_test_headers_numbered.docx.itxt fix: Fixed docx import with headers that are also lists (#842) 2025-01-31 10:51:21 +01:00
unit_test_headers_numbered.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
unit_test_headers_numbered.docx.md fix: Fixed docx import with headers that are also lists (#842) 2025-01-31 10:51:21 +01:00
unit_test_headers.docx.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
unit_test_headers.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
unit_test_headers.docx.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
unit_test_lists.docx.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
unit_test_lists.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
unit_test_lists.docx.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
wiki_duck.html.itxt refactor: upgrade BeautifulSoup4 with type hints (#999) 2025-02-18 11:30:47 +01:00
wiki_duck.html.json refactor: upgrade BeautifulSoup4 with type hints (#999) 2025-02-18 11:30:47 +01:00
wiki_duck.html.md refactor: upgrade BeautifulSoup4 with type hints (#999) 2025-02-18 11:30:47 +01:00
wiki.md.md fix: fix single newline handling in MD backend (#824) 2025-01-28 19:05:55 +01:00
word_sample.docx.itxt fix: Fixing images in the input Word files (#330) 2024-11-14 13:33:34 +01:00
word_sample.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
word_sample.docx.md fix: Fixing images in the input Word files (#330) 2024-11-14 13:33:34 +01:00
word_sample.json fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
word_sample.md fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
word_sample.yaml fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
word_tables.docx.html fix(markdown): add support for HTML content (#855) 2025-02-03 12:21:05 +01:00
word_tables.docx.itxt fix(docx): merged table cells not properly converted (#857) 2025-02-03 10:20:03 +01:00
word_tables.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
word_tables.docx.md fix(docx): merged table cells not properly converted (#857) 2025-02-03 10:20:03 +01:00