docs: Example on how to apply external OCR as post processing (#2517)

* Example on how to apply to Docling Document OCR as a post-processing with "nanonets-ocr2-3b" via LM Studio

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added support of elements with multiple provenances

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaning up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* improved prompt for nanonets-ocr2-3b

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaning up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* excluded example from CI

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* updated class name

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Improved usability of the example, added simple cli, and some helper functions

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fix api_image_request usage

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix pydantic errors

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Improvements and corrections

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added string sanitation, removing break lines from remote OCR, also preserving original text from json

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added quick and reliable detection of empty image crops (elements, table cells, form items), these are not sent to OCR

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Example respects ocr_documents.txt, tuned empty crop detection

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaning api_image_request

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
This commit is contained in:
Maxim Lysak
2025-11-27 11:04:40 +01:00
committed by GitHub
parent 0049857c7d
commit fa21128138
3 changed files with 821 additions and 40 deletions

View File

@@ -20,7 +20,7 @@ env:
tests/test_asr_pipeline.py
tests/test_threaded_pipeline.py
PYTEST_TO_SKIP: |-
EXAMPLES_TO_SKIP: '^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|minimal_asr_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|suryaocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model|granitedocling_repetition_stopping|mlx_whisper_example|gpu_standard_pipeline|gpu_vlm_pipeline|demo_layout_vlm)\.py$'
EXAMPLES_TO_SKIP: '^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|minimal_asr_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|suryaocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model|granitedocling_repetition_stopping|mlx_whisper_example|gpu_standard_pipeline|gpu_vlm_pipeline|demo_layout_vlm|post_process_ocr_with_vlm)\.py$'
jobs:
lint: