docling/tests/data/groundtruth/docling_v1
Clément Doumouro 45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract (#1167)
* fix(ocr): tesseract support mis-oriented documents

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): update missing test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): rotate image to the natural orientation before layout prediction

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): move bounding bow rotation util to orientation.py

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): refactor rotation utilities

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`

* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`

* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation

* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`

---------

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-05-21 18:12:33 +02:00
..
2203.01017v2.doctags.txt fix: restrict click version and update lock file (#1582) 2025-05-13 10:40:08 +02:00
2203.01017v2.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
2203.01017v2.md feat: Use new TableFormer model weights and default to accurate model version (#1100) 2025-03-11 10:53:49 +01:00
2203.01017v2.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
2206.01062.doctags.txt fix: restrict click version and update lock file (#1582) 2025-05-13 10:40:08 +02:00
2206.01062.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
2206.01062.md feat: Use new TableFormer model weights and default to accurate model version (#1100) 2025-03-11 10:53:49 +01:00
2206.01062.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
2305.03393v1-pg9.doctags.txt feat: Use new TableFormer model weights and default to accurate model version (#1100) 2025-03-11 10:53:49 +01:00
2305.03393v1-pg9.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
2305.03393v1-pg9.md feat: Use new TableFormer model weights and default to accurate model version (#1100) 2025-03-11 10:53:49 +01:00
2305.03393v1-pg9.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
2305.03393v1.doctags.txt fix: restrict click version and update lock file (#1582) 2025-05-13 10:40:08 +02:00
2305.03393v1.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
2305.03393v1.md feat: Use new TableFormer model weights and default to accurate model version (#1100) 2025-03-11 10:53:49 +01:00
2305.03393v1.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
amt_handbook_sample.doctags.txt fix: restrict click version and update lock file (#1582) 2025-05-13 10:40:08 +02:00
amt_handbook_sample.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
amt_handbook_sample.md docs: Add example for inspection of picture content (#624) 2025-01-29 10:39:00 +01:00
amt_handbook_sample.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
code_and_formula.doctags.txt feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
code_and_formula.json chore: format JSON test files to enable comparison (#1511) 2025-05-02 10:52:18 +02:00
code_and_formula.md feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
code_and_formula.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
multi_page.doctags.txt fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549) 2025-05-19 15:26:00 +02:00
multi_page.json fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549) 2025-05-19 15:26:00 +02:00
multi_page.md fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549) 2025-05-19 15:26:00 +02:00
multi_page.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
picture_classification.doctags.txt fix: restrict click version and update lock file (#1582) 2025-05-13 10:40:08 +02:00
picture_classification.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
picture_classification.md feat: New document picture classifier (#805) 2025-01-24 18:05:51 +01:00
picture_classification.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
redp5110_sampled.doctags.txt fix: restrict click version and update lock file (#1582) 2025-05-13 10:40:08 +02:00
redp5110_sampled.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
redp5110_sampled.md fix: Proper handling of orphan IDs in layout postprocessing (#1118) 2025-03-05 14:30:59 +01:00
redp5110_sampled.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
right_to_left_01.doctags.txt feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
right_to_left_01.json chore: format JSON test files to enable comparison (#1511) 2025-05-02 10:52:18 +02:00
right_to_left_01.md feat: Implement new reading-order model (#916) 2025-02-20 17:51:17 +01:00
right_to_left_01.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
right_to_left_02.doctags.txt fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
right_to_left_02.json chore: format JSON test files to enable comparison (#1511) 2025-05-02 10:52:18 +02:00
right_to_left_02.md fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
right_to_left_02.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
right_to_left_03.doctags.txt fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
right_to_left_03.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00
right_to_left_03.md fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) 2025-02-07 08:43:31 +01:00
right_to_left_03.pages.json feat(ocr): auto-detect rotated pages in Tesseract (#1167) 2025-05-21 18:12:33 +02:00