docling/tests/data/groundtruth/docling_v2
Peter W. J. Staar e25d557c06
refactor: add the contentlayer to html-backend (#1040)
* added the contentlayer to html-backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the handle_image function

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code of html backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* test(html): add more info if a test case fails

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* refactor(html): put parsed item in body if doc has no header

In case an HTML does not have any header tag, all parsed items are placed in
DoclingDocument's body content layer.
HTML paragraphs ('p' tags) are parsed as text items with paragraph label.
Update test ground truth accoring to the changes above.

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: set TextItem label to 'text' instead of 'paragraph'

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-03-02 10:37:53 -05:00
..
2203.01017v2.doctags.txt feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
2203.01017v2.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
2203.01017v2.md feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
2203.01017v2.pages.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
2206.01062.doctags.txt feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
2206.01062.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
2206.01062.md feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
2206.01062.pages.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
2305.03393v1-pg9.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
2305.03393v1-pg9.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
2305.03393v1-pg9.md feat: Support AsciiDoc and Markdown input format (#168) 2024-10-23 16:14:26 +02:00
2305.03393v1-pg9.pages.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
2305.03393v1.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
2305.03393v1.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
2305.03393v1.md feat: Expose equation exports (#869) 2025-02-03 10:31:19 +01:00
2305.03393v1.pages.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
amt_handbook_sample.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
amt_handbook_sample.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
amt_handbook_sample.md docs: Add example for inspection of picture content (#624) 2025-01-29 10:39:00 +01:00
amt_handbook_sample.pages.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
blocks.md.md fix(markdown): fix empty block handling (#843) 2025-01-30 16:22:29 +01:00
bmj_sample.xml.itxt feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
bmj_sample.xml.json feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
bmj_sample.xml.md feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
code_and_formula.doctags.txt feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
code_and_formula.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
code_and_formula.md feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
code_and_formula.pages.json fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
csv-comma-in-cell.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-comma-in-cell.csv.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
csv-comma-in-cell.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-comma.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-comma.csv.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
csv-comma.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-inconsistent-header.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-inconsistent-header.csv.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
csv-inconsistent-header.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-pipe.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-pipe.csv.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
csv-pipe.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-semicolon.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-semicolon.csv.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
csv-semicolon.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-tab.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-tab.csv.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
csv-tab.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-too-few-columns.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-too-few-columns.csv.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
csv-too-few-columns.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-too-many-columns.csv.itxt feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
csv-too-many-columns.csv.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
csv-too-many-columns.csv.md feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) 2025-02-14 08:55:09 +01:00
duck.md.md fix: fix single newline handling in MD backend (#824) 2025-01-28 19:05:55 +01:00
elife-56337.xml.itxt feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
elife-56337.xml.md feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
ending_with_table.md.md fix(markdown): fix parsing if doc ending with table (#873) 2025-02-03 14:38:38 +01:00
example_01.html.itxt refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
example_01.html.json refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
example_01.html.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
example_02.html.itxt refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
example_02.html.json refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
example_02.html.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
example_03.html.itxt refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
example_03.html.json refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
example_03.html.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
example_04.html.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
example_04.html.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
example_04.html.md feat: Expose equation exports (#869) 2025-02-03 10:31:19 +01:00
example_05.html.itxt fix: parse html with omitted body tag (#818) 2025-01-27 16:59:00 +01:00
example_05.html.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
example_05.html.md feat: Expose equation exports (#869) 2025-02-03 10:31:19 +01:00
example_06.html.itxt refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
example_06.html.json refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
example_06.html.md fix(html): Parse text in div elements as TextItem (#1041) 2025-02-24 12:38:29 +01:00
ipa20180000016.itxt feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
ipa20180000016.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
ipa20180000016.md feat: Expose equation exports (#869) 2025-02-03 10:31:19 +01:00
ipa20200022300.itxt feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
ipa20200022300.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
ipa20200022300.md feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
lorem_ipsum.docx.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
lorem_ipsum.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
lorem_ipsum.docx.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
mixed.md.md fix(markdown): add support for HTML content (#855) 2025-02-03 12:21:05 +01:00
nested.md.md fix(markdown): handle nested lists (#910) 2025-02-07 12:55:12 +01:00
pa20010031492.itxt feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
pa20010031492.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
pa20010031492.md feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
pftaps057006474.itxt feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
pftaps057006474.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
pftaps057006474.md feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
pg06442728.itxt feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
pg06442728.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
pg06442728.md feat: create a backend to parse USPTO patents into DoclingDocument (#606) 2024-12-17 16:35:23 +01:00
picture_classification.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
picture_classification.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
picture_classification.md feat: New document picture classifier (#805) 2025-01-24 18:05:51 +01:00
picture_classification.pages.json test: update results with new docling-core (#839) 2025-01-30 14:07:52 +01:00
pnas_sample.xml.itxt feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pnas_sample.xml.json feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pnas_sample.xml.md feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pntd.0008301.xml.itxt feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pntd.0008301.xml.md feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pone.0234687.xml.itxt feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
pone.0234687.xml.md feat(xml-jats): parse XML JATS documents (#967) 2025-02-17 10:43:31 +01:00
powerpoint_sample.pptx.itxt feat: Extracting picture data for raster images found in PPTX (#349) 2024-11-18 15:22:28 +01:00
powerpoint_sample.pptx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
powerpoint_sample.pptx.md feat: Extracting picture data for raster images found in PPTX (#349) 2024-11-18 15:22:28 +01:00
powerpoint_with_image.pptx.itxt feat: Extracting picture data for raster images found in PPTX (#349) 2024-11-18 15:22:28 +01:00
powerpoint_with_image.pptx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
powerpoint_with_image.pptx.md feat: Extracting picture data for raster images found in PPTX (#349) 2024-11-18 15:22:28 +01:00
redp5110_sampled.doctags.txt feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
redp5110_sampled.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
redp5110_sampled.md feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
redp5110_sampled.pages.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
right_to_left_01.doctags.txt feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
right_to_left_01.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
right_to_left_01.md feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
right_to_left_01.pages.json fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_02.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_02.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
right_to_left_02.md fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
right_to_left_02.pages.json fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_03.doctags.txt fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
right_to_left_03.json feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
right_to_left_03.md fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
right_to_left_03.pages.json fix: Revise DocTags, fix iterate_items to output content_layer in items (#965) 2025-02-17 14:11:55 +01:00
tablecell.docx.itxt fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
tablecell.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
tablecell.docx.md fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
test_01.asciidoc.md feat: Support AsciiDoc and Markdown input format (#168) 2024-10-23 16:14:26 +02:00
test_02.asciidoc.md feat: Support AsciiDoc and Markdown input format (#168) 2024-10-23 16:14:26 +02:00
test_emf_docx.docx.itxt fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
test_emf_docx.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
test_emf_docx.docx.md fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
test-01.xlsx.itxt fix: added extraction of byte-images in excel (#804) 2025-01-24 18:48:02 +01:00
test-01.xlsx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
test-01.xlsx.md fix: added extraction of byte-images in excel (#804) 2025-01-24 18:48:02 +01:00
unit_test_01.html.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
unit_test_01.html.json chore: Update tests and lockfile (#1021) 2025-02-19 16:51:53 +01:00
unit_test_01.html.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
unit_test_headers_numbered.docx.itxt fix: Fixed docx import with headers that are also lists (#842) 2025-01-31 10:51:21 +01:00
unit_test_headers_numbered.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
unit_test_headers_numbered.docx.md fix: Fixed docx import with headers that are also lists (#842) 2025-01-31 10:51:21 +01:00
unit_test_headers.docx.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
unit_test_headers.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
unit_test_headers.docx.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
unit_test_lists.docx.itxt fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
unit_test_lists.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
unit_test_lists.docx.md fix: fix duplicate title and heading + add e2e tests for html and docx (#186) 2024-10-30 13:14:56 +01:00
wiki_duck.html.itxt refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
wiki_duck.html.json refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
wiki_duck.html.md refactor: add the contentlayer to html-backend (#1040) 2025-03-02 10:37:53 -05:00
wiki.md.md fix: fix single newline handling in MD backend (#824) 2025-01-28 19:05:55 +01:00
word_sample.docx.itxt fix: Fixing images in the input Word files (#330) 2024-11-14 13:33:34 +01:00
word_sample.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
word_sample.docx.md fix: Fixing images in the input Word files (#330) 2024-11-14 13:33:34 +01:00
word_sample.json fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
word_sample.md fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
word_sample.yaml fix: Fixes for wordx (#432) 2024-11-26 14:44:43 +01:00
word_tables.docx.html fix(markdown): add support for HTML content (#855) 2025-02-03 12:21:05 +01:00
word_tables.docx.itxt fix(docx): merged table cells not properly converted (#857) 2025-02-03 10:20:03 +01:00
word_tables.docx.json feat: Add content_layer property to items to address body, furniture and other roles (#735) 2025-02-10 12:07:49 +01:00
word_tables.docx.md fix(docx): merged table cells not properly converted (#857) 2025-02-03 10:20:03 +01:00