feat(ocr): auto-detect rotated pages in Tesseract (#1167)

* fix(ocr): tesseract support mis-oriented documents

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): update missing test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): rotate image to the natural orientation before layout prediction

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): move bounding bow rotation util to orientation.py

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): refactor rotation utilities

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`

* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`

* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation

* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`

---------

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
This commit is contained in:
Clément Doumouro
2025-05-21 18:12:33 +02:00
committed by GitHub
parent 90875247e5
commit 45265bf8b1
96 changed files with 9864 additions and 5258 deletions

View File

@@ -1390,7 +1390,7 @@
"id": 2,
"label": "picture",
"bbox": {
"l": 134.9199981689453,
"l": 134.9200439453125,
"t": 304.890625,
"r": 475.6635437011719,
"b": 510.21826171875,
@@ -2174,7 +2174,7 @@
"id": 2,
"label": "picture",
"bbox": {
"l": 134.9199981689453,
"l": 134.9200439453125,
"t": 304.890625,
"r": 475.6635437011719,
"b": 510.21826171875,
@@ -2909,7 +2909,7 @@
"id": 2,
"label": "picture",
"bbox": {
"l": 134.9199981689453,
"l": 134.9200439453125,
"t": 304.890625,
"r": 475.6635437011719,
"b": 510.21826171875,
@@ -3623,7 +3623,7 @@
"b": 268.20489999999995,
"coord_origin": "TOPLEFT"
},
"confidence": 0.987092912197113,
"confidence": 0.9870928525924683,
"cells": [
{
"index": 0,
@@ -3938,7 +3938,7 @@
"b": 532.05774,
"coord_origin": "TOPLEFT"
},
"confidence": 0.9494234323501587,
"confidence": 0.9494236707687378,
"cells": [
{
"index": 12,
@@ -4302,7 +4302,7 @@
"id": 2,
"label": "picture",
"bbox": {
"l": 218.81556701660156,
"l": 218.8155517578125,
"t": 278.0153503417969,
"r": 391.96246337890625,
"b": 508.89410400390625,
@@ -4337,7 +4337,7 @@
"b": 268.20489999999995,
"coord_origin": "TOPLEFT"
},
"confidence": 0.987092912197113,
"confidence": 0.9870928525924683,
"cells": [
{
"index": 0,
@@ -4658,7 +4658,7 @@
"b": 532.05774,
"coord_origin": "TOPLEFT"
},
"confidence": 0.9494234323501587,
"confidence": 0.9494236707687378,
"cells": [
{
"index": 12,
@@ -5040,7 +5040,7 @@
"id": 2,
"label": "picture",
"bbox": {
"l": 218.81556701660156,
"l": 218.8155517578125,
"t": 278.0153503417969,
"r": 391.96246337890625,
"b": 508.89410400390625,
@@ -5072,7 +5072,7 @@
"b": 268.20489999999995,
"coord_origin": "TOPLEFT"
},
"confidence": 0.987092912197113,
"confidence": 0.9870928525924683,
"cells": [
{
"index": 0,
@@ -5393,7 +5393,7 @@
"b": 532.05774,
"coord_origin": "TOPLEFT"
},
"confidence": 0.9494234323501587,
"confidence": 0.9494236707687378,
"cells": [
{
"index": 12,
@@ -5729,7 +5729,7 @@
"id": 2,
"label": "picture",
"bbox": {
"l": 218.81556701660156,
"l": 218.8155517578125,
"t": 278.0153503417969,
"r": 391.96246337890625,
"b": 508.89410400390625,