mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-09 05:08:14 +00:00
feat(ocr): auto-detect rotated pages in Tesseract (#1167)
* fix(ocr): tesseract support mis-oriented documents Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): update missing test data Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): rotate image to the natural orientation before layout prediction Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): move bounding bow rotation util to orientation.py Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): refactor rotation utilities Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): revert layout updates Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): update e2e OCR test data Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * fix(ocr): avoid to swallow tesseract errors causing orientation detection failures Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): revert layout updates Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com> * chore(ocr): update e2e OCR test data * chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel` * chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel` * chore(ocr): default `TesseractOcrCliModel._is_auto` to `False` * fix(ocr): fix `TesseractOcrCliModel._is_auto` computation * chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel` --------- Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
This commit is contained in:
@@ -133,7 +133,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 1.1",
|
||||
"text": "Paragraph 1.1"
|
||||
"text": "Paragraph 1.1",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/5",
|
||||
@@ -157,7 +163,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 1.2",
|
||||
"text": "Paragraph 1.2"
|
||||
"text": "Paragraph 1.2",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/7",
|
||||
@@ -222,7 +234,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 1.1.1",
|
||||
"text": "Paragraph 1.1.1"
|
||||
"text": "Paragraph 1.1.1",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/11",
|
||||
@@ -246,7 +264,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 1.1.2",
|
||||
"text": "Paragraph 1.1.2"
|
||||
"text": "Paragraph 1.1.2",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/13",
|
||||
@@ -314,7 +338,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 1.1.1",
|
||||
"text": "Paragraph 1.1.1"
|
||||
"text": "Paragraph 1.1.1",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/17",
|
||||
@@ -338,7 +368,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 1.1.2",
|
||||
"text": "Paragraph 1.1.2"
|
||||
"text": "Paragraph 1.1.2",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/19",
|
||||
@@ -406,7 +442,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 1.2.3.1",
|
||||
"text": "Paragraph 1.2.3.1"
|
||||
"text": "Paragraph 1.2.3.1",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/23",
|
||||
@@ -430,7 +472,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 1.2.3.1",
|
||||
"text": "Paragraph 1.2.3.1"
|
||||
"text": "Paragraph 1.2.3.1",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/25",
|
||||
@@ -513,7 +561,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 2.1",
|
||||
"text": "Paragraph 2.1"
|
||||
"text": "Paragraph 2.1",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/30",
|
||||
@@ -537,7 +591,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 2.2",
|
||||
"text": "Paragraph 2.2"
|
||||
"text": "Paragraph 2.2",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/32",
|
||||
@@ -602,7 +662,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 2.1.1.1",
|
||||
"text": "Paragraph 2.1.1.1"
|
||||
"text": "Paragraph 2.1.1.1",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/36",
|
||||
@@ -626,7 +692,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 2.1.1.1",
|
||||
"text": "Paragraph 2.1.1.1"
|
||||
"text": "Paragraph 2.1.1.1",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/38",
|
||||
@@ -694,7 +766,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 2.1.1",
|
||||
"text": "Paragraph 2.1.1"
|
||||
"text": "Paragraph 2.1.1",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/42",
|
||||
@@ -718,7 +796,13 @@
|
||||
"label": "paragraph",
|
||||
"prov": [],
|
||||
"orig": "Paragraph 2.1.2",
|
||||
"text": "Paragraph 2.1.2"
|
||||
"text": "Paragraph 2.1.2",
|
||||
"formatting": {
|
||||
"bold": false,
|
||||
"italic": false,
|
||||
"underline": false,
|
||||
"strikethrough": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"self_ref": "#/texts/44",
|
||||
|
||||
Reference in New Issue
Block a user