feat(ocr): auto-detect rotated pages in Tesseract (#1167)

* fix(ocr): tesseract support mis-oriented documents

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): update missing test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): rotate image to the natural orientation before layout prediction

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): move bounding bow rotation util to orientation.py

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): refactor rotation utilities

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`

* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`

* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation

* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`

---------

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
This commit is contained in:
Clément Doumouro
2025-05-21 18:12:33 +02:00
committed by GitHub
parent 90875247e5
commit 45265bf8b1
96 changed files with 9864 additions and 5258 deletions

View File

@@ -2498,9 +2498,9 @@
{
"bbox": [
148.45364379882812,
366.1538391113281,
366.1537780761719,
464.3608093261719,
583.6257476806641
583.6257629394531
],
"page": 2,
"span": [
@@ -2541,9 +2541,9 @@
"prov": [
{
"bbox": [
164.6503143310547,
164.65028381347656,
511.6590576171875,
449.550537109375,
449.5505676269531,
628.2029113769531
],
"page": 7,
@@ -2563,7 +2563,7 @@
"prov": [
{
"bbox": [
140.70960998535156,
140.70968627929688,
198.32281494140625,
472.73382568359375,
283.9361572265625
@@ -2585,10 +2585,10 @@
"prov": [
{
"bbox": [
162.67434692382812,
128.786376953125,
451.70068359375,
347.3774719238281
162.67430114746094,
128.78643798828125,
451.70062255859375,
347.37744140625
],
"page": 10,
"span": [
@@ -2607,9 +2607,9 @@
"prov": [
{
"bbox": [
168.3928985595703,
168.39285278320312,
157.99432373046875,
447.3513488769531,
447.35137939453125,
610.0334930419922
],
"page": 11,
@@ -4065,7 +4065,7 @@
143.6376495361328,
528.7375183105469,
470.8485412597656,
635.6522827148438
635.6522979736328
],
"page": 10,
"span": [