feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745)

mirror of https://github.com/DS4SD/docling.git synced 2025-12-08 20:58:11 +00:00

* Keep page.parsed_page.textline_cells and page.cells in sync, including OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make page.parsed_page the only source of truth for text cells

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Correctly compute PDF boxes from pymupdf

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Use different OCR engine order

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add type hints and fix mypy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* One more test fix

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove with pypdfium2_lock from caller sites

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix typing

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

This commit is contained in:

Christoph Auer

2025-06-13 19:01:55 +02:00

committed by

GitHub

parent 0432a31b2f

commit 7d3302cb48

50 changed files with 339091 additions and 330047 deletions

89985

tests/data/groundtruth/docling_v2/2206.01062.pages.json vendored

View File

File diff suppressed because it is too large Load Diff

feat: Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745)

89985 tests/data/groundtruth/docling_v2/2206.01062.pages.json vendored View File

89985

tests/data/groundtruth/docling_v2/2206.01062.pages.json vendored

View File