docling

2203.01017v2.doctags.txt

fix: prov for merged-elems (#1728 )

2025-06-10 11:22:42 +02:00

2203.01017v2.json

fix: prov for merged-elems (#1728 )

2025-06-10 11:22:42 +02:00

2203.01017v2.md

chore: update locked deps (#1239 )

2025-03-25 15:48:02 +01:00

2203.01017v2.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

2206.01062.doctags.txt

fix: prov for merged-elems (#1728 )

2025-06-10 11:22:42 +02:00

2206.01062.json

fix: prov for merged-elems (#1728 )

2025-06-10 11:22:42 +02:00

2206.01062.md

chore: update locked deps (#1239 )

2025-03-25 15:48:02 +01:00

2206.01062.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

2305.03393v1-pg9.doctags.txt

feat: Use new TableFormer model weights and default to accurate model version (#1100 )

2025-03-11 10:53:49 +01:00

2305.03393v1-pg9.json

feat: support xlsm files (#1520 )

2025-06-10 16:55:59 +02:00

2305.03393v1-pg9.md

feat: Use new TableFormer model weights and default to accurate model version (#1100 )

2025-03-11 10:53:49 +01:00

2305.03393v1-pg9.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

2305.03393v1.doctags.txt

feat: Use new TableFormer model weights and default to accurate model version (#1100 )

2025-03-11 10:53:49 +01:00

2305.03393v1.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

2305.03393v1.md

feat: Use new TableFormer model weights and default to accurate model version (#1100 )

2025-03-11 10:53:49 +01:00

2305.03393v1.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

amt_handbook_sample.doctags.txt

fix: Revise DocTags, fix iterate_items to output content_layer in items (#965 )

2025-02-17 14:11:55 +01:00

amt_handbook_sample.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

amt_handbook_sample.md

docs: Add example for inspection of picture content (#624 )

2025-01-29 10:39:00 +01:00

amt_handbook_sample.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

blocks.md.md

fix: Pass tests, update docling-core to 2.22.0 (#1150 )

2025-03-13 09:45:55 +01:00

bmj_sample.xml.itxt

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

bmj_sample.xml.json

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

bmj_sample.xml.md

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

code_and_formula.doctags.txt

chore: update locked deps (#1239 )

2025-03-25 15:48:02 +01:00

code_and_formula.json

chore: format JSON test files to enable comparison (#1511 )

2025-05-02 10:52:18 +02:00

code_and_formula.md

chore: update locked deps (#1239 )

2025-03-25 15:48:02 +01:00

code_and_formula.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

csv-comma-in-cell.csv.itxt

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-comma-in-cell.csv.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

csv-comma-in-cell.csv.md

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-comma.csv.itxt

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-comma.csv.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

csv-comma.csv.md

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-inconsistent-header.csv.itxt

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-inconsistent-header.csv.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

csv-inconsistent-header.csv.md

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-pipe.csv.itxt

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-pipe.csv.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

csv-pipe.csv.md

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-semicolon.csv.itxt

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-semicolon.csv.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

csv-semicolon.csv.md

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-tab.csv.itxt

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-tab.csv.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

csv-tab.csv.md

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-too-few-columns.csv.itxt

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-too-few-columns.csv.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

csv-too-few-columns.csv.md

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-too-many-columns.csv.itxt

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

csv-too-many-columns.csv.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

csv-too-many-columns.csv.md

feat: Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945 )

2025-02-14 08:55:09 +01:00

duck.md.md

fix: fix single newline handling in MD backend (#824 )

2025-01-28 19:05:55 +01:00

elife-56337.xml.itxt

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

elife-56337.xml.md

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

ending_with_table.md.md

fix(markdown): fix parsing if doc ending with table (#873 )

2025-02-03 14:38:38 +01:00

equations.docx.itxt

fix(docx): Adding new latex symbols, simplifying how equations are added to text (#1295 )

2025-04-08 17:11:37 +02:00

equations.docx.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

equations.docx.md

fix(docx): Adding new latex symbols, simplifying how equations are added to text (#1295 )

2025-04-08 17:11:37 +02:00

example_8.html.itxt

feat: support xlsm files (#1520 )

2025-06-10 16:55:59 +02:00

example_8.html.json

feat: support xlsm files (#1520 )

2025-06-10 16:55:59 +02:00

example_8.html.md

feat: support xlsm files (#1520 )

2025-06-10 16:55:59 +02:00

example_01.html.itxt

refactor: add the contentlayer to html-backend (#1040 )

2025-03-02 10:37:53 -05:00

example_01.html.json

fix(html): fix HTML parsed heading level (#1244 )

2025-03-26 10:30:23 +01:00

example_01.html.md

fix(html): fix HTML parsed heading level (#1244 )

2025-03-26 10:30:23 +01:00

example_02.html.itxt

refactor: add the contentlayer to html-backend (#1040 )

2025-03-02 10:37:53 -05:00

example_02.html.json

fix(html): fix HTML parsed heading level (#1244 )

2025-03-26 10:30:23 +01:00

example_02.html.md

fix(html): fix HTML parsed heading level (#1244 )

2025-03-26 10:30:23 +01:00

example_03.html.itxt

refactor: add the contentlayer to html-backend (#1040 )

2025-03-02 10:37:53 -05:00

example_03.html.json

fix(html): fix HTML parsed heading level (#1244 )

2025-03-26 10:30:23 +01:00

example_03.html.md

fix(html): fix HTML parsed heading level (#1244 )

2025-03-26 10:30:23 +01:00

example_04.html.itxt

fix: fix duplicate title and heading + add e2e tests for html and docx (#186 )

2024-10-30 13:14:56 +01:00

example_04.html.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

example_04.html.md

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

example_05.html.itxt

fix: parse html with omitted body tag (#818 )

2025-01-27 16:59:00 +01:00

example_05.html.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

example_05.html.md

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

example_06.html.itxt

fix(html): handle address, details, and summary tags (#1436 )

2025-04-23 09:30:59 +02:00

example_06.html.json

fix(html): handle address, details, and summary tags (#1436 )

2025-04-23 09:30:59 +02:00

example_06.html.md

fix(html): handle address, details, and summary tags (#1436 )

2025-04-23 09:30:59 +02:00

example_07.html.itxt

fix(html): handle nested empty lists (#1154 )

2025-03-13 16:56:58 +01:00

example_07.html.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

example_07.html.md

fix(html): handle nested empty lists (#1154 )

2025-03-13 16:56:58 +01:00

example_08.html.itxt

test: add missing ground truth files (#1667 )

2025-05-28 13:26:49 +02:00

example_08.html.json

test: add missing ground truth files (#1667 )

2025-05-28 13:26:49 +02:00

example_08.html.md

test: add missing ground truth files (#1667 )

2025-05-28 13:26:49 +02:00

ipa20180000016.itxt

feat: create a backend to parse USPTO patents into DoclingDocument (#606 )

2024-12-17 16:35:23 +01:00

ipa20180000016.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

ipa20180000016.md

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

ipa20200022300.itxt

feat: create a backend to parse USPTO patents into DoclingDocument (#606 )

2024-12-17 16:35:23 +01:00

ipa20200022300.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

ipa20200022300.md

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

lorem_ipsum.docx.itxt

fix: fix duplicate title and heading + add e2e tests for html and docx (#186 )

2024-10-30 13:14:56 +01:00

lorem_ipsum.docx.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

lorem_ipsum.docx.md

fix: fix duplicate title and heading + add e2e tests for html and docx (#186 )

2024-10-30 13:14:56 +01:00

mixed_without_h1.md.md

fix: improve HTML layer detection, various MD fixes (#1241 )

2025-03-26 16:07:14 +01:00

mixed.md.md

fix(html): fix HTML parsed heading level (#1244 )

2025-03-26 10:30:23 +01:00

multi_page.doctags.txt

fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549 )

2025-05-19 15:26:00 +02:00

multi_page.json

fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549 )

2025-05-19 15:26:00 +02:00

multi_page.md

fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549 )

2025-05-19 15:26:00 +02:00

multi_page.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

nested.md.md

fix(markdown): handle nested lists (#910 )

2025-02-07 12:55:12 +01:00

pa20010031492.itxt

feat: create a backend to parse USPTO patents into DoclingDocument (#606 )

2024-12-17 16:35:23 +01:00

pa20010031492.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

pa20010031492.md

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

pftaps057006474.itxt

fix: Pass tests, update docling-core to 2.22.0 (#1150 )

2025-03-13 09:45:55 +01:00

pftaps057006474.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

pftaps057006474.md

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

pg06442728.itxt

feat: create a backend to parse USPTO patents into DoclingDocument (#606 )

2024-12-17 16:35:23 +01:00

pg06442728.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

pg06442728.md

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

picture_classification.doctags.txt

fix: Revise DocTags, fix iterate_items to output content_layer in items (#965 )

2025-02-17 14:11:55 +01:00

picture_classification.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

picture_classification.md

feat: New document picture classifier (#805 )

2025-01-24 18:05:51 +01:00

picture_classification.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

pnas_sample.xml.itxt

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

pnas_sample.xml.json

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

pnas_sample.xml.md

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

pntd.0008301.xml.itxt

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

pntd.0008301.xml.md

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

pone.0234687.xml.itxt

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

pone.0234687.xml.md

feat(xml-jats): parse XML JATS documents (#967 )

2025-02-17 10:43:31 +01:00

powerpoint_sample.pptx.itxt

feat: Extracting picture data for raster images found in PPTX (#349 )

2024-11-18 15:22:28 +01:00

powerpoint_sample.pptx.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

powerpoint_sample.pptx.md

feat: Extracting picture data for raster images found in PPTX (#349 )

2024-11-18 15:22:28 +01:00

powerpoint_with_image.pptx.itxt

feat: Extracting picture data for raster images found in PPTX (#349 )

2024-11-18 15:22:28 +01:00

powerpoint_with_image.pptx.json

feat: Add DoclingParseV4 backend, using high-level docling-parse API (#905 )

2025-03-18 10:38:19 +01:00

powerpoint_with_image.pptx.md

feat: Extracting picture data for raster images found in PPTX (#349 )

2024-11-18 15:22:28 +01:00

redp5110_sampled.doctags.txt

chore: propagate docling-core fix (#1389 )

2025-04-15 10:51:47 +02:00

redp5110_sampled.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

redp5110_sampled.md

chore: update locked deps (#1239 )

2025-03-25 15:48:02 +01:00

redp5110_sampled.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

right_to_left_01.doctags.txt

feat: Implement new reading-order model (#916 )

2025-02-20 17:51:17 +01:00

right_to_left_01.json

chore: format JSON test files to enable comparison (#1511 )

2025-05-02 10:52:18 +02:00

right_to_left_01.md

feat: Implement new reading-order model (#916 )

2025-02-20 17:51:17 +01:00

right_to_left_01.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

right_to_left_02.doctags.txt

chore: update locked deps (#1239 )

2025-03-25 15:48:02 +01:00

right_to_left_02.json

chore: format JSON test files to enable comparison (#1511 )

2025-05-02 10:52:18 +02:00

right_to_left_02.md

fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903 )

2025-02-07 08:43:31 +01:00

right_to_left_02.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

right_to_left_03.doctags.txt

fix: Revise DocTags, fix iterate_items to output content_layer in items (#965 )

2025-02-17 14:11:55 +01:00

right_to_left_03.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

right_to_left_03.md

fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903 )

2025-02-07 08:43:31 +01:00

right_to_left_03.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

sample_sales_data.xlsm.itxt

feat: support xlsm files (#1520 )

2025-06-10 16:55:59 +02:00

sample_sales_data.xlsm.json

feat: support xlsm files (#1520 )

2025-06-10 16:55:59 +02:00

sample_sales_data.xlsm.md

feat: support xlsm files (#1520 )

2025-06-10 16:55:59 +02:00

tablecell.docx.itxt

fix: Fixes for wordx (#432 )

2024-11-26 14:44:43 +01:00

tablecell.docx.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

tablecell.docx.md

fix: Fixes for wordx (#432 )

2024-11-26 14:44:43 +01:00

test_01.asciidoc.md

feat: Support AsciiDoc and Markdown input format (#168 )

2024-10-23 16:14:26 +02:00

test_02.asciidoc.md

feat: Support AsciiDoc and Markdown input format (#168 )

2024-10-23 16:14:26 +02:00

test_03.asciidoc.md

fix(asciidoc): set default size when missing in image directive (#1769 )

2025-06-16 10:38:46 +02:00

test_emf_docx.docx.itxt

fix: Fixes for wordx (#432 )

2024-11-26 14:44:43 +01:00

test_emf_docx.docx.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

test_emf_docx.docx.md

fix: Fixes for wordx (#432 )

2024-11-26 14:44:43 +01:00

test-01.xlsx.itxt

fix: added extraction of byte-images in excel (#804 )

2025-01-24 18:48:02 +01:00

test-01.xlsx.json

feat(xlsx): create a page for each worksheet in XLSX backend (#1332 )

2025-04-11 10:29:53 +02:00

test-01.xlsx.md

fix: added extraction of byte-images in excel (#804 )

2025-01-24 18:48:02 +01:00

textbox.docx.itxt

feat: support xlsm files (#1520 )

2025-06-10 16:55:59 +02:00

textbox.docx.json

feat: support xlsm files (#1520 )

2025-06-10 16:55:59 +02:00

textbox.docx.md

feat: support xlsm files (#1520 )

2025-06-10 16:55:59 +02:00

unit_test_01.html.itxt

fix: fix duplicate title and heading + add e2e tests for html and docx (#186 )

2024-10-30 13:14:56 +01:00

unit_test_01.html.json

fix(html): fix HTML parsed heading level (#1244 )

2025-03-26 10:30:23 +01:00

unit_test_01.html.md

fix(html): fix HTML parsed heading level (#1244 )

2025-03-26 10:30:23 +01:00

unit_test_formatting.docx.itxt

feat(docx): add text formatting and hyperlink support (#630 )

2025-04-03 15:11:50 +02:00

unit_test_formatting.docx.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

unit_test_formatting.docx.md

feat(docx): add text formatting and hyperlink support (#630 )

2025-04-03 15:11:50 +02:00

unit_test_headers_numbered.docx.itxt

fix(docx): identifying numbered headers (#1231 )

2025-03-25 11:41:02 +01:00

unit_test_headers_numbered.docx.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

unit_test_headers_numbered.docx.md

fix(docx): identifying numbered headers (#1231 )

2025-03-25 11:41:02 +01:00

unit_test_headers.docx.itxt

fix: fix duplicate title and heading + add e2e tests for html and docx (#186 )

2024-10-30 13:14:56 +01:00

unit_test_headers.docx.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

unit_test_headers.docx.md

fix: fix duplicate title and heading + add e2e tests for html and docx (#186 )

2024-10-30 13:14:56 +01:00

unit_test_lists.docx.itxt

fix: fix duplicate title and heading + add e2e tests for html and docx (#186 )

2024-10-30 13:14:56 +01:00

unit_test_lists.docx.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

unit_test_lists.docx.md

fix: fix duplicate title and heading + add e2e tests for html and docx (#186 )

2024-10-30 13:14:56 +01:00

wiki_duck.html.itxt

fix: improve HTML layer detection, various MD fixes (#1241 )

2025-03-26 16:07:14 +01:00

wiki_duck.html.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

wiki_duck.html.md

fix: improve HTML layer detection, various MD fixes (#1241 )

2025-03-26 16:07:14 +01:00

wiki.md.md

fix: fix single newline handling in MD backend (#824 )

2025-01-28 19:05:55 +01:00

word_sample.docx.itxt

fix: Fixing images in the input Word files (#330 )

2024-11-14 13:33:34 +01:00

word_sample.docx.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

word_sample.docx.md

fix: Fixing images in the input Word files (#330 )

2024-11-14 13:33:34 +01:00

word_sample.json

fix: Fixes for wordx (#432 )

2024-11-26 14:44:43 +01:00

word_sample.md

fix: Fixes for wordx (#432 )

2024-11-26 14:44:43 +01:00

word_sample.yaml

fix: Fixes for wordx (#432 )

2024-11-26 14:44:43 +01:00

word_tables.docx.html

feat(cli): add option for html with split-page mode (#1355 )

2025-04-14 08:41:50 +02:00

word_tables.docx.itxt

fix(docx): merged table cells not properly converted (#857 )

2025-02-03 10:20:03 +01:00

word_tables.docx.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

word_tables.docx.md

fix(docx): merged table cells not properly converted (#857 )

2025-02-03 10:20:03 +01:00