docs: explain OCR options

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
This commit is contained in:
Michele Dolfi 2024-10-08 10:54:43 +02:00
parent 471daee277
commit 73108d597c

View File

@ -52,6 +52,71 @@ Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectu
``` ```
</details> </details>
<details>
<summary><b>Alternative OCR engines</b></summary>
Docling supports multiple OCR engines for processing scanned documents. The current version provides
the following engines.
| Engine | Installation | Usage |
| ------ | ------------ | ----- |
| [EasyOCR](https://github.com/JaidedAI/EasyOCR) | Default in Docling or via `pip install easyocr`. | `EasyOcrOptions` |
| Tesseract | System dependency. See description for Tesseract and Tesserocr below. | `TesseractOcrOptions` |
| Tesseract CLI | System dependency. See description below. | `TesseractCliOcrOptions` |
The Docling `DocumentConverter` allows to choose the OCR engine with the `ocr_options` settings. For example
```python
from docling.datamodel.base_models import ConversionStatus, PipelineOptions
from docling.datamodel.pipeline_options import PipelineOptions, EasyOcrOptions, TesseractOcrOptions
from docling.document_converter import DocumentConverter
pipeline_options = PipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions() # Use Tesseract
doc_converter = DocumentConverter(
pipeline_options=pipeline_options,
)
```
#### Tesseract installation
[Tesseract](https://github.com/tesseract-ocr/tesseract) is a popular OCR engine which is available
on most operating systems. For using this engine with Docling, Tesseract must be installed on your
system, using the packaging tool of your choice. Below we provide example commands.
For macOS, we reccomend using [Homebrew](https://brew.sh/).
```console
brew install tesseract leptonica pkg-config
```
For Debian-based systems.
```console
apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config
```
For RHEL systems.
```console
dnf install tesseract tesseract-devel tesseract-langpack-eng leptonica-devel
```
#### Linking to Tesseract
The most efficient usage of the Tesseract library is via linking. Docling is using
the [Tesserocr](https://github.com/sirfz/tesserocr) package for this.
If you get into installation issues of Tesserocr, we suggest using the following
installation options:
```console
pip uninstall tesserocr
pip install --no-binary :all: tesserocr
```
</details>
<details> <details>
<summary><b>Docling development setup</b></summary> <summary><b>Docling development setup</b></summary>