..
2203.01017v2.doctags.txt
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
2203.01017v2.json
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
2203.01017v2.md
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
2203.01017v2.pages.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
2206.01062.doctags.txt
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
2206.01062.json
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
2206.01062.md
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
2206.01062.pages.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
2305.03393v1-pg9.doctags.txt
fix: Revise DocTags, fix iterate_items to output content_layer in items ( #965 )
2025-02-17 14:11:55 +01:00
2305.03393v1-pg9.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
2305.03393v1-pg9.md
feat: Support AsciiDoc and Markdown input format ( #168 )
2024-10-23 16:14:26 +02:00
2305.03393v1-pg9.pages.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
2305.03393v1.doctags.txt
fix: Revise DocTags, fix iterate_items to output content_layer in items ( #965 )
2025-02-17 14:11:55 +01:00
2305.03393v1.json
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
2305.03393v1.md
feat: Expose equation exports ( #869 )
2025-02-03 10:31:19 +01:00
2305.03393v1.pages.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
amt_handbook_sample.doctags.txt
fix: Revise DocTags, fix iterate_items to output content_layer in items ( #965 )
2025-02-17 14:11:55 +01:00
amt_handbook_sample.json
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
amt_handbook_sample.md
docs: Add example for inspection of picture content ( #624 )
2025-01-29 10:39:00 +01:00
amt_handbook_sample.pages.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
blocks.md.md
fix(markdown): fix empty block handling ( #843 )
2025-01-30 16:22:29 +01:00
bmj_sample.xml.itxt
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
bmj_sample.xml.json
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
bmj_sample.xml.md
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
code_and_formula.doctags.txt
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
code_and_formula.json
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
code_and_formula.md
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
code_and_formula.pages.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
csv-comma-in-cell.csv.itxt
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-comma-in-cell.csv.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
csv-comma-in-cell.csv.md
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-comma.csv.itxt
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-comma.csv.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
csv-comma.csv.md
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-inconsistent-header.csv.itxt
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-inconsistent-header.csv.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
csv-inconsistent-header.csv.md
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-pipe.csv.itxt
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-pipe.csv.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
csv-pipe.csv.md
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-semicolon.csv.itxt
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-semicolon.csv.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
csv-semicolon.csv.md
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-tab.csv.itxt
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-tab.csv.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
csv-tab.csv.md
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-too-few-columns.csv.itxt
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-too-few-columns.csv.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
csv-too-few-columns.csv.md
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-too-many-columns.csv.itxt
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
csv-too-many-columns.csv.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
csv-too-many-columns.csv.md
feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument ( #945 )
2025-02-14 08:55:09 +01:00
duck.md.md
fix: fix single newline handling in MD backend ( #824 )
2025-01-28 19:05:55 +01:00
elife-56337.xml.itxt
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
elife-56337.xml.md
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
ending_with_table.md.md
fix(markdown): fix parsing if doc ending with table ( #873 )
2025-02-03 14:38:38 +01:00
example_01.html.itxt
refactor: add the contentlayer to html-backend ( #1040 )
2025-03-02 10:37:53 -05:00
example_01.html.json
refactor: add the contentlayer to html-backend ( #1040 )
2025-03-02 10:37:53 -05:00
example_01.html.md
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
example_02.html.itxt
refactor: add the contentlayer to html-backend ( #1040 )
2025-03-02 10:37:53 -05:00
example_02.html.json
refactor: add the contentlayer to html-backend ( #1040 )
2025-03-02 10:37:53 -05:00
example_02.html.md
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
example_03.html.itxt
refactor: add the contentlayer to html-backend ( #1040 )
2025-03-02 10:37:53 -05:00
example_03.html.json
refactor: add the contentlayer to html-backend ( #1040 )
2025-03-02 10:37:53 -05:00
example_03.html.md
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
example_04.html.itxt
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
example_04.html.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
example_04.html.md
feat: Expose equation exports ( #869 )
2025-02-03 10:31:19 +01:00
example_05.html.itxt
fix: parse html with omitted body tag ( #818 )
2025-01-27 16:59:00 +01:00
example_05.html.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
example_05.html.md
feat: Expose equation exports ( #869 )
2025-02-03 10:31:19 +01:00
example_06.html.itxt
refactor: add the contentlayer to html-backend ( #1040 )
2025-03-02 10:37:53 -05:00
example_06.html.json
refactor: add the contentlayer to html-backend ( #1040 )
2025-03-02 10:37:53 -05:00
example_06.html.md
fix(html): Parse text in div elements as TextItem ( #1041 )
2025-02-24 12:38:29 +01:00
ipa20180000016.itxt
feat: create a backend to parse USPTO patents into DoclingDocument ( #606 )
2024-12-17 16:35:23 +01:00
ipa20180000016.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
ipa20180000016.md
feat: Expose equation exports ( #869 )
2025-02-03 10:31:19 +01:00
ipa20200022300.itxt
feat: create a backend to parse USPTO patents into DoclingDocument ( #606 )
2024-12-17 16:35:23 +01:00
ipa20200022300.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
ipa20200022300.md
feat: create a backend to parse USPTO patents into DoclingDocument ( #606 )
2024-12-17 16:35:23 +01:00
lorem_ipsum.docx.itxt
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
lorem_ipsum.docx.json
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
2025-02-10 12:07:49 +01:00
lorem_ipsum.docx.md
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
mixed.md.md
fix(markdown): add support for HTML content ( #855 )
2025-02-03 12:21:05 +01:00
nested.md.md
fix(markdown): handle nested lists ( #910 )
2025-02-07 12:55:12 +01:00
pa20010031492.itxt
feat: create a backend to parse USPTO patents into DoclingDocument ( #606 )
2024-12-17 16:35:23 +01:00
pa20010031492.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
pa20010031492.md
feat: create a backend to parse USPTO patents into DoclingDocument ( #606 )
2024-12-17 16:35:23 +01:00
pftaps057006474.itxt
feat: create a backend to parse USPTO patents into DoclingDocument ( #606 )
2024-12-17 16:35:23 +01:00
pftaps057006474.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
pftaps057006474.md
feat: create a backend to parse USPTO patents into DoclingDocument ( #606 )
2024-12-17 16:35:23 +01:00
pg06442728.itxt
feat: create a backend to parse USPTO patents into DoclingDocument ( #606 )
2024-12-17 16:35:23 +01:00
pg06442728.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
pg06442728.md
feat: create a backend to parse USPTO patents into DoclingDocument ( #606 )
2024-12-17 16:35:23 +01:00
picture_classification.doctags.txt
fix: Revise DocTags, fix iterate_items to output content_layer in items ( #965 )
2025-02-17 14:11:55 +01:00
picture_classification.json
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
picture_classification.md
feat: New document picture classifier ( #805 )
2025-01-24 18:05:51 +01:00
picture_classification.pages.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
pnas_sample.xml.itxt
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
pnas_sample.xml.json
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
pnas_sample.xml.md
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
pntd.0008301.xml.itxt
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
pntd.0008301.xml.md
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
pone.0234687.xml.itxt
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
pone.0234687.xml.md
feat(xml-jats): parse XML JATS documents ( #967 )
2025-02-17 10:43:31 +01:00
powerpoint_sample.pptx.itxt
feat: Extracting picture data for raster images found in PPTX ( #349 )
2024-11-18 15:22:28 +01:00
powerpoint_sample.pptx.json
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
2025-02-10 12:07:49 +01:00
powerpoint_sample.pptx.md
feat: Extracting picture data for raster images found in PPTX ( #349 )
2024-11-18 15:22:28 +01:00
powerpoint_with_image.pptx.itxt
feat: Extracting picture data for raster images found in PPTX ( #349 )
2024-11-18 15:22:28 +01:00
powerpoint_with_image.pptx.json
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
2025-02-10 12:07:49 +01:00
powerpoint_with_image.pptx.md
feat: Extracting picture data for raster images found in PPTX ( #349 )
2024-11-18 15:22:28 +01:00
redp5110_sampled.doctags.txt
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
redp5110_sampled.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
redp5110_sampled.md
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
redp5110_sampled.pages.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
right_to_left_01.doctags.txt
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
right_to_left_01.json
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
right_to_left_01.md
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
right_to_left_01.pages.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
right_to_left_02.doctags.txt
fix: Revise DocTags, fix iterate_items to output content_layer in items ( #965 )
2025-02-17 14:11:55 +01:00
right_to_left_02.json
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
right_to_left_02.md
fix: Test cases for RTL programmatic PDFs and fixes for the formula model ( #903 )
2025-02-07 08:43:31 +01:00
right_to_left_02.pages.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
right_to_left_03.doctags.txt
fix: Revise DocTags, fix iterate_items to output content_layer in items ( #965 )
2025-02-17 14:11:55 +01:00
right_to_left_03.json
feat: Implement new reading-order model ( #916 )
2025-02-20 17:51:17 +01:00
right_to_left_03.md
fix: Test cases for RTL programmatic PDFs and fixes for the formula model ( #903 )
2025-02-07 08:43:31 +01:00
right_to_left_03.pages.json
fix: Proper handling of orphan IDs in layout postprocessing ( #1118 )
2025-03-05 14:30:59 +01:00
tablecell.docx.itxt
fix: Fixes for wordx ( #432 )
2024-11-26 14:44:43 +01:00
tablecell.docx.json
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
2025-02-10 12:07:49 +01:00
tablecell.docx.md
fix: Fixes for wordx ( #432 )
2024-11-26 14:44:43 +01:00
test_01.asciidoc.md
feat: Support AsciiDoc and Markdown input format ( #168 )
2024-10-23 16:14:26 +02:00
test_02.asciidoc.md
feat: Support AsciiDoc and Markdown input format ( #168 )
2024-10-23 16:14:26 +02:00
test_emf_docx.docx.itxt
fix: Fixes for wordx ( #432 )
2024-11-26 14:44:43 +01:00
test_emf_docx.docx.json
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
2025-02-10 12:07:49 +01:00
test_emf_docx.docx.md
fix: Fixes for wordx ( #432 )
2024-11-26 14:44:43 +01:00
test-01.xlsx.itxt
fix: added extraction of byte-images in excel ( #804 )
2025-01-24 18:48:02 +01:00
test-01.xlsx.json
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
2025-02-10 12:07:49 +01:00
test-01.xlsx.md
fix: added extraction of byte-images in excel ( #804 )
2025-01-24 18:48:02 +01:00
unit_test_01.html.itxt
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
unit_test_01.html.json
chore: Update tests and lockfile ( #1021 )
2025-02-19 16:51:53 +01:00
unit_test_01.html.md
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
unit_test_headers_numbered.docx.itxt
fix: Fixed docx import with headers that are also lists ( #842 )
2025-01-31 10:51:21 +01:00
unit_test_headers_numbered.docx.json
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
2025-02-10 12:07:49 +01:00
unit_test_headers_numbered.docx.md
fix: Fixed docx import with headers that are also lists ( #842 )
2025-01-31 10:51:21 +01:00
unit_test_headers.docx.itxt
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
unit_test_headers.docx.json
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
2025-02-10 12:07:49 +01:00
unit_test_headers.docx.md
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
unit_test_lists.docx.itxt
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
unit_test_lists.docx.json
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
2025-02-10 12:07:49 +01:00
unit_test_lists.docx.md
fix: fix duplicate title and heading + add e2e tests for html and docx ( #186 )
2024-10-30 13:14:56 +01:00
wiki_duck.html.itxt
refactor: add the contentlayer to html-backend ( #1040 )
2025-03-02 10:37:53 -05:00
wiki_duck.html.json
refactor: add the contentlayer to html-backend ( #1040 )
2025-03-02 10:37:53 -05:00
wiki_duck.html.md
refactor: add the contentlayer to html-backend ( #1040 )
2025-03-02 10:37:53 -05:00
wiki.md.md
fix: fix single newline handling in MD backend ( #824 )
2025-01-28 19:05:55 +01:00
word_sample.docx.itxt
fix: Fixing images in the input Word files ( #330 )
2024-11-14 13:33:34 +01:00
word_sample.docx.json
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
2025-02-10 12:07:49 +01:00
word_sample.docx.md
fix: Fixing images in the input Word files ( #330 )
2024-11-14 13:33:34 +01:00
word_sample.json
fix: Fixes for wordx ( #432 )
2024-11-26 14:44:43 +01:00
word_sample.md
fix: Fixes for wordx ( #432 )
2024-11-26 14:44:43 +01:00
word_sample.yaml
fix: Fixes for wordx ( #432 )
2024-11-26 14:44:43 +01:00
word_tables.docx.html
fix(markdown): add support for HTML content ( #855 )
2025-02-03 12:21:05 +01:00
word_tables.docx.itxt
fix(docx): merged table cells not properly converted ( #857 )
2025-02-03 10:20:03 +01:00
word_tables.docx.json
feat: Add content_layer property to items to address body, furniture and other roles ( #735 )
2025-02-10 12:07:49 +01:00
word_tables.docx.md
fix(docx): merged table cells not properly converted ( #857 )
2025-02-03 10:20:03 +01:00