docling/tests/data_scanned/groundtruth/docling_v1 at c5fb353f109dfe79b51c201ebb1ff33fceeae34a - docling - Zorio's Git

mirrors/docling

mirror of https://github.com/DS4SD/docling.git synced 2025-12-13 15:18:30 +00:00

Files

History

Christoph Auer 7d3302cb48 feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make page.parsed_page the only source of truth for text cells

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Correctly compute PDF boxes from pymupdf

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use different OCR engine order

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add type hints and fix mypy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* One more test fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove with pypdfium2_lock from caller sites

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix typing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

2025-06-13 19:01:55 +02:00

..

ocr_test_rotated_90.doctags.txt

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

ocr_test_rotated_90.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

ocr_test_rotated_90.md

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

ocr_test_rotated_90.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

ocr_test_rotated_180.doctags.txt

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

ocr_test_rotated_180.json

fix(tesseract): initialize df_osd to avoid uninitialized variable error (#1718 )

2025-06-10 10:57:45 +02:00

ocr_test_rotated_180.md

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

ocr_test_rotated_180.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

ocr_test_rotated_270.doctags.txt

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

ocr_test_rotated_270.json

fix(tesseract): initialize df_osd to avoid uninitialized variable error (#1718 )

2025-06-10 10:57:45 +02:00

ocr_test_rotated_270.md

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

ocr_test_rotated_270.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00

ocr_test_rotated.doctags.txt

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

ocr_test_rotated.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

ocr_test_rotated.md

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

ocr_test_rotated.pages.json

feat(ocr): auto-detect rotated pages in Tesseract (#1167 )

2025-05-21 18:12:33 +02:00

ocr_test.doctags.txt

feat!: Docling v2 (#117 )

2024-10-16 21:02:03 +02:00

ocr_test.json

fix(tesseract): initialize df_osd to avoid uninitialized variable error (#1718 )

2025-06-10 10:57:45 +02:00

ocr_test.md

feat!: Docling v2 (#117 )

2024-10-16 21:02:03 +02:00

ocr_test.pages.json

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745 )

2025-06-13 19:01:55 +02:00