diff --git a/README.md b/README.md index a975cdcd..882f3a8a 100644 --- a/README.md +++ b/README.md @@ -85,23 +85,31 @@ Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectu [Tesseract](https://github.com/tesseract-ocr/tesseract) is a popular OCR engine which is available on most operating systems. For using this engine with Docling, Tesseract must be installed on your system, using the packaging tool of your choice. Below we provide example commands. + After installing Tesseract you are expected to provide the path to its language files using the + `TESSDATA_PREFIX` environment variable (note that it must terminate with a slash `/`). For macOS, we reccomend using [Homebrew](https://brew.sh/). ```console brew install tesseract leptonica pkg-config + TESSDATA_PREFIX=/opt/homebrew/share/tessdata/ + echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}" ``` For Debian-based systems. ```console apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config + TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$) + echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}" ``` For RHEL systems. ```console - dnf install tesseract tesseract-devel tesseract-langpack-eng leptonica-devel + dnf install tesseract tesseract-devel tesseract-langpack-eng leptonica-devel + TESSDATA_PREFIX=/usr/share/tesseract/tessdata/ + echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}" ``` #### Linking to Tesseract