docs: Describe examples (#2262)

* Update .py examples with clearer guidance, update out of date imports and calls Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com> * Fix minimal.py string error, fix ruff format error Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com> * fix more CI issues Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com> --------- Signed-off-by: Mingxuan Zhao <43148277+mingxzhao@users.noreply.github.com>
2025-12-08 20:58:11 +00:00 · 2025-09-16 10:00:38 -04:00
parent 0e95171dd6
commit ff351fd40c
21 changed files with 608 additions and 85 deletions
--- a/docs/examples/tesseract_lang_detection.py
+++ b/docs/examples/tesseract_lang_detection.py
@@ -1,3 +1,22 @@
+# %% [markdown]
+# Detect language automatically with Tesseract OCR and force full-page OCR.
+#
+# What this example does
+# - Configures Tesseract (CLI in this snippet) with `lang=["auto"]`.
+# - Forces full-page OCR and prints the recognized text as Markdown.
+#
+# How to run
+# - From the repo root: `python docs/examples/tesseract_lang_detection.py`.
+# - Ensure Tesseract CLI (or library) is installed and on PATH.
+#
+# Notes
+# - You can switch to `TesseractOcrOptions` instead of `TesseractCliOcrOptions`.
+# - Language packs must be installed; set `TESSDATA_PREFIX` if Tesseract
+#   cannot find language data. Using `lang=["auto"]` requires traineddata
+#   that supports script/language detection on your system.
+
+# %%
+
 from pathlib import Path

 from docling.datamodel.base_models import InputFormat