merged with main

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-07-30 22:14:37 +00:00 · 2024-11-17 06:03:12 +01:00 · 2024-11-17 06:03:12 +01:00 · b0e5154d87
commit b0e5154d87
parent 7c494270ac 7dbdbdeaf3
47 changed files with 834 additions and 258 deletions
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@ -0,0 +1,26 @@
+---
+name: Bug report
+about: Report an issue to help improve Docling
+title: ''
+labels: bug
+assignees: ''
+
+---
+
+### Bug
+<!-- Describe the buggy behavior you have observed. -->
+...
+
+### Steps to reproduce
+<!-- Describe the sequence of steps for reproducing the bug. -->
+...
+
+### Docling version
+<!-- Copy the output of `docling --version`. -->
+...
+
+### Python version
+<!-- Copy the output of `python --version`. -->
+...
+
+<!-- ⚠️ ATTENTION: When sharing screenshots, attachments, or other data make sure not to include any sensitive information. -->
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@ -0,0 +1 @@
+blank_issues_enabled: false
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@ -0,0 +1,18 @@
+---
+name: Feature request
+about: Suggest an idea for enhancing Docling
+title: ''
+labels: enhancement
+assignees: ''
+
+---
+
+### Requested feature
+<!-- Describe the feature you have in mind and the user need it addresses. -->
+...
+
+### Alternatives
+<!-- Describe any alternatives you have considered. -->
+...
+
+<!-- ⚠️ ATTENTION: When sharing screenshots, attachments, or other data make sure not to include any sensitive information. -->
--- a/.github/ISSUE_TEMPLATE/question.md
+++ b/.github/ISSUE_TEMPLATE/question.md
@ -0,0 +1,14 @@
+---
+name: Question
+about: Ask a question
+title: ''
+labels: question
+assignees: ''
+
+---
+
+### Question
+<!-- Describe what you would like to achieve and which part you need help with. -->
+...
+
+<!-- ⚠️ ATTENTION: When sharing screenshots, attachments, or other data make sure not to include any sensitive information. -->
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -3,7 +3,8 @@
 <!-- STEPS TO FOLLOW:
  1. Add a description of the changes (frequently the same as the commit description)
  2. Enter the issue number next to "Resolves #" below (if there is no tracking issue resolved, **remove that section**)
-  3. Follow the steps in the checklist below, starting with the **Commit Message Formatting**.
+  3. Make sure the PR title follows the **Commit Message Formatting**: https://www.conventionalcommits.org/en/v1.0.0/#summary.
+  4. Follow the steps in the checklist below, starting with the **Commit Message Formatting**.
 -->

 <!-- Uncomment this section with the issue number if an issue is being resolved
@ -13,8 +14,6 @@ Resolves #

 **Checklist:**

- [ ] **Commit Message Formatting**: Commit titles and messages follow guidelines in the
-  [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary).
 - [ ] Documentation has been updated, if necessary.
 - [ ] Examples have been added, if necessary.
 - [ ] Tests have been added, if necessary.
--- a/.github/mergify.yml
+++ b/.github/mergify.yml
@ -0,0 +1,18 @@
+merge_protections:
+  - name: Enforce conventional commit
+    description: Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
+    if:
+      - base = main
+    success_conditions:
+      - "title ~=
+        ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\\(.+\
+        \\))?:"
+  - name: Require two reviewer for test updates
+    description: When test data is updated, we require two reviewers
+    if:
+      - base = main
+      - or:
+          - files ~= ^tests/data
+          - files ~= ^tests/data_scanned
+    success_conditions:
+      - "#approved-reviews-by >= 2"
--- a/.github/workflows/docs.yml
+++ b/.github/workflows/docs.yml
@ -11,15 +11,7 @@ jobs:
        runs-on: ubuntu-latest
        steps:
        - uses: actions/checkout@v4
-        - name: Install poetry
-          run: pipx install poetry==1.8.3
-          shell: bash
-        - uses: actions/setup-python@v5
-          with:
-              cache: 'poetry'
-        - name: Install dependencies
-          run: poetry install --only docs
-          shell: bash
+        - uses: ./.github/actions/setup-poetry
        - name: Build docs
          run: poetry run mkdocs build --verbose --clean
        - name: Build and push docs
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,55 @@
+## [v2.5.2](https://github.com/DS4SD/docling/releases/tag/v2.5.2) - 2024-11-13
+
+### Fix
+
+* Skip glm model downloads ([#322](https://github.com/DS4SD/docling/issues/322)) ([`c9341bf`](https://github.com/DS4SD/docling/commit/c9341bf22e08920284cbc14821c190eaf6abf8a6))
+
+## [v2.5.1](https://github.com/DS4SD/docling/releases/tag/v2.5.1) - 2024-11-12
+
+### Fix
+
+* Handling of single-cell tables in DOCX backend ([#314](https://github.com/DS4SD/docling/issues/314)) ([`fb8ba86`](https://github.com/DS4SD/docling/commit/fb8ba861e28eda0079daa44fb1ea3ed17745f1d2))
+
+### Documentation
+
+* Hybrid RAG with Qdrant ([#312](https://github.com/DS4SD/docling/issues/312)) ([`7f5d35e`](https://github.com/DS4SD/docling/commit/7f5d35ea3c225ce1ce7328825842f98755c0104f))
+* Add Data Prep Kit integration ([#316](https://github.com/DS4SD/docling/issues/316)) ([`93fc1be`](https://github.com/DS4SD/docling/commit/93fc1be61abfe0669daf26c0984a51ec8675bf62))
+
+## [v2.5.0](https://github.com/DS4SD/docling/releases/tag/v2.5.0) - 2024-11-12
+
+### Feature
+
+* **OCR:** Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning ([#290](https://github.com/DS4SD/docling/issues/290)) ([`c6b3763`](https://github.com/DS4SD/docling/commit/c6b3763ecb6ef862840a30978ee177b907f86505))
+
+### Fix
+
+* Configure env prefix for docling settings ([#315](https://github.com/DS4SD/docling/issues/315)) ([`5d4a10b`](https://github.com/DS4SD/docling/commit/5d4a10b121317fa481208dacbee47032b08ff928))
+* Added handling of grouped elements in pptx backend ([#307](https://github.com/DS4SD/docling/issues/307)) ([`81c8243`](https://github.com/DS4SD/docling/commit/81c8243a8bf177feed8f87ea283b5bb6836350cb))
+* Allow mps usage for easyocr ([#286](https://github.com/DS4SD/docling/issues/286)) ([`97f214e`](https://github.com/DS4SD/docling/commit/97f214efddcf66f0734a95c17c08936f6111d113))
+
+### Documentation
+
+* Add navigation indices ([#305](https://github.com/DS4SD/docling/issues/305)) ([`1239ade`](https://github.com/DS4SD/docling/commit/1239ade2750349d13d4e865d88449b232bbad944))
+
+## [v2.4.2](https://github.com/DS4SD/docling/releases/tag/v2.4.2) - 2024-11-08
+
+### Fix
+
+* **EasyOcrModel:** Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr ([#282](https://github.com/DS4SD/docling/issues/282)) ([`0eb065e`](https://github.com/DS4SD/docling/commit/0eb065e9b6e4619d4c412ed98bc7408915ca3f95))
+
+## [v2.4.1](https://github.com/DS4SD/docling/releases/tag/v2.4.1) - 2024-11-08
+
+### Fix
+
+* **tesserocr:** Raise Exception if tesserocr has not loaded any languages ([#279](https://github.com/DS4SD/docling/issues/279)) ([`704d792`](https://github.com/DS4SD/docling/commit/704d792a7997c4ca34f9f9045ed4ae02b4f5df47))
+* Dockerfile example copy command ([#234](https://github.com/DS4SD/docling/issues/234)) ([`90836db`](https://github.com/DS4SD/docling/commit/90836db90accf4a66c9c20544c98452840e3a308))
+
+### Documentation
+
+* Update badges & credits ([#248](https://github.com/DS4SD/docling/issues/248)) ([`a84ec27`](https://github.com/DS4SD/docling/commit/a84ec276b0997c4ba9b32e18e911a966124dc3bc))
+* Add coming-soon section ([#235](https://github.com/DS4SD/docling/issues/235)) ([`5ce02c5`](https://github.com/DS4SD/docling/commit/5ce02c5c598a2efa615ad15f0ead8d752d3ad2ea))
+* Add artifacts-path param to CLI ([#233](https://github.com/DS4SD/docling/issues/233)) ([`d5e65ae`](https://github.com/DS4SD/docling/commit/d5e65aedac23d6849c805a0e88dd06f2a285eb18))
+
 ## [v2.4.0](https://github.com/DS4SD/docling/releases/tag/v2.4.0) - 2024-11-04

 ### Feature
--- a/2
+++ b/2
@ -14,7 +14,7 @@ RUN pip install --no-cache-dir docling --extra-index-url https://download.pytorc
 ENV HF_HOME=/tmp/
 ENV TORCH_HOME=/tmp/

-COPY examples/minimal.py /root/minimal.py
+COPY docs/examples/minimal.py /root/minimal.py

 RUN python -c 'from deepsearch_glm.utils.load_pretrained_models import load_pretrained_nlp_models; load_pretrained_nlp_models(verbose=True);'
 RUN python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; StandardPdfPipeline.download_models_hf(force=True);'
--- a/README.md
+++ b/README.md
@ -6,6 +6,10 @@

 # Docling

+<p align="center">
+  <a href="https://trendshift.io/repositories/12132" target="_blank"><img src="https://trendshift.io/api/badge/repositories/12132" alt="DS4SD%2Fdocling | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
+</p>
+
 [![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
 [![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://ds4sd.github.io/docling/)
 [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
@ -19,19 +23,22 @@

 Docling parses documents and exports them to the desired format with ease and speed.

-
 ## Features

 * 🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON
 * 📑 Advanced PDF document understanding including page layout, reading order & table structures
 * 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
-* 📝 Metadata extraction, including title, authors, references & language
-* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
+* 🤖 Easy integration with LlamaIndex 🦙 & LangChain 🦜🔗 for powerful RAG / QA applications
 * 🔍 OCR support for scanned PDFs
 * 💻 Simple and convenient CLI

 Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!

+### Coming soon
+
+* ♾️ Equation & code extraction
+* 📝 Metadata extraction, including title, authors, references & language
+* 🦜🔗 Native LangChain extension

 ## Installation

@ -57,16 +64,13 @@ result = converter.convert(source)
 print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"
 ```

-
 Check out [Getting started](https://ds4sd.github.io/docling/).
 You will find lots of tuning options to leverage all the advanced capabilities.

-
 ## Get help and support

 Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).

-
 ## Technical report

 For more details on Docling's inner workings, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).
@ -75,7 +79,6 @@ For more details on Docling's inner workings, check out the [Docling Technical R

 Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.

-
 ## References

 If you use Docling in your projects, please consider citing the following:
@ -95,5 +98,9 @@ If you use Docling in your projects, please consider citing the following:

 ## License

-The Docling codebase is under MIT license. 
+The Docling codebase is under MIT license.
 For individual model usage, please refer to the model licenses found in the original packages.
+
+## IBM ❤️ Open Source AI
+
+Docling has been brought to you by IBM.
--- a/docling/backend/docling_parse_backend.py
+++ b/docling/backend/docling_parse_backend.py
@ -29,7 +29,7 @@ class DoclingParsePageBackend(PdfPageBackend):
            self._dpage = parsed_page["pages"][0]
        else:
            _log.info(
-                f"An error occured when loading page {page_no} of document {document_hash}."
+                f"An error occurred when loading page {page_no} of document {document_hash}."
            )

    def is_valid(self) -> bool:
--- a/docling/backend/docling_parse_v2_backend.py
+++ b/docling/backend/docling_parse_v2_backend.py
@ -31,7 +31,7 @@ class DoclingParseV2PageBackend(PdfPageBackend):
            self._dpage = parsed_page["pages"][0]
        else:
            _log.info(
-                f"An error occured when loading page {page_no} of document {document_hash}."
+                f"An error occurred when loading page {page_no} of document {document_hash}."
            )

    def is_valid(self) -> bool:
--- a/docling/backend/html_backend.py
+++ b/docling/backend/html_backend.py
@ -120,6 +120,8 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
            self.handle_header(element, idx, doc)
        elif element.name in ["p"]:
            self.handle_paragraph(element, idx, doc)
+        elif element.name in ["pre"]:
+            self.handle_code(element, idx, doc)
        elif element.name in ["ul", "ol"]:
            self.handle_list(element, idx, doc)
        elif element.name in ["li"]:
@ -205,6 +207,16 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
                level=hlevel,
            )

+    def handle_code(self, element, idx, doc):
+        """Handles monospace code snippets (pre)."""
+        if element.text is None:
+            return
+        text = element.text.strip()
+        label = DocItemLabel.CODE
+        if len(text) == 0:
+            return
+        doc.add_text(parent=self.parents[self.level], label=label, text=text)
+
    def handle_paragraph(self, element, idx, doc):
        """Handles paragraph tags (p)."""
        if element.text is None:
--- a/docling/backend/mspowerpoint_backend.py
+++ b/docling/backend/mspowerpoint_backend.py
@ -358,41 +358,36 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB

            size = Size(width=slide_width, height=slide_height)
            parent_page = doc.add_page(page_no=slide_ind + 1, size=size)
-            # parent_page = doc.add_page(page_no=slide_ind, size=size, hash=hash)
-
-            # Loop through each shape in the slide
-            for shape in slide.shapes:

+            def handle_shapes(shape, parent_slide, slide_ind, doc):
+                handle_groups(shape, parent_slide, slide_ind, doc)
                if shape.has_table:
                    # Handle Tables
                    self.handle_tables(shape, parent_slide, slide_ind, doc)
-
                if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
-                    # Handle Tables
+                    # Handle Pictures
                    self.handle_pictures(shape, parent_slide, slide_ind, doc)
-
                # If shape doesn't have any text, move on to the next shape
                if not hasattr(shape, "text"):
-                    continue
+                    return
                if shape.text is None:
-                    continue
+                    return
                if len(shape.text.strip()) == 0:
-                    continue
+                    return
                if not shape.has_text_frame:
-                    _log.warn("Warning: shape has text but not text_frame")
-                    continue
-
-                # if shape.is_placeholder:
-                # Handle Titles (Headers) and Subtitles
-                # Check if the shape is a placeholder (titles are placeholders)
-                # self.handle_title(shape, parent_slide, slide_ind, doc)
-                # self.handle_text_elements(shape, parent_slide, slide_ind, doc)
-                # else:
-
+                    _log.warning("Warning: shape has text but not text_frame")
+                    return
                # Handle other text elements, including lists (bullet lists, numbered lists)
                self.handle_text_elements(shape, parent_slide, slide_ind, doc)
+                return

-                # figures...
-                # doc.add_figure(data=BaseFigureData(), parent=self.parents[self.level], caption=None)
+            def handle_groups(shape, parent_slide, slide_ind, doc):
+                if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
+                    for groupedshape in shape.shapes:
+                        handle_shapes(groupedshape, parent_slide, slide_ind, doc)
+
+            # Loop through each shape in the slide
+            for shape in slide.shapes:
+                handle_shapes(shape, parent_slide, slide_ind, doc)

        return doc
--- a/docling/backend/msword_backend.py
+++ b/docling/backend/msword_backend.py
@ -9,10 +9,12 @@ from docling_core.types.doc import (
    DoclingDocument,
    DocumentOrigin,
    GroupLabel,
+    ImageRef,
    TableCell,
    TableData,
 )
 from lxml import etree
+from PIL import Image

 from docling.backend.abstract_backend import DeclarativeDocumentBackend
 from docling.datamodel.base_models import InputFormat
@ -130,14 +132,8 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
    def walk_linear(self, body, docx_obj, doc) -> DoclingDocument:
        for element in body:
            tag_name = etree.QName(element).localname
-
-            # Check for Inline Images (drawings or blip elements)
-            found_drawing = etree.ElementBase.xpath(
-                element, ".//w:drawing", namespaces=self.xml_namespaces
-            )
-            found_pict = etree.ElementBase.xpath(
-                element, ".//w:pict", namespaces=self.xml_namespaces
-            )
+            # Check for Inline Images (blip elements)
+            drawing_blip = element.xpath(".//a:blip")

            # Check for Tables
            if element.tag.endswith("tbl"):
@ -146,8 +142,8 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
                except Exception:
                    _log.debug("could not parse a table, broken docx table")

-            elif found_drawing or found_pict:
-                self.handle_pictures(element, docx_obj, doc)
+            elif drawing_blip:
+                self.handle_pictures(element, docx_obj, drawing_blip, doc)
            # Check for Text
            elif tag_name in ["p"]:
                self.handle_text_elements(element, docx_obj, doc)
@ -201,7 +197,6 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
            label_str = ""
            label_level = 0
            if parts[0] == "Heading":
-                # print("{} - {}".format(parts[0], parts[1]))
                label_str = parts[0]
                label_level = self.str_to_int(parts[1], default=None)
            if parts[1] == "Heading":
@ -217,19 +212,16 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
        if paragraph.text is None:
            # _log.warn(f"paragraph has text==None")
            return
-
        text = paragraph.text.strip()
        # if len(text)==0 # keep empty paragraphs, they seperate adjacent lists!

        # Common styles for bullet and numbered lists.
        # "List Bullet", "List Number", "List Paragraph"
-        # TODO: reliably identify wether list is a numbered list or not
+        # Identify wether list is a numbered list or not
        # is_numbered = "List Bullet" not in paragraph.style.name
        is_numbered = False
-
        p_style_name, p_level = self.get_label_and_level(paragraph)
        numid, ilevel = self.get_numId_and_ilvl(paragraph)
-        # print("numid: {}, ilevel: {}, text: {}".format(numid, ilevel, text))

        if numid == 0:
            numid = None
@ -450,8 +442,13 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
        for row in table.rows:
            # Calculate the max number of columns
            num_cols = max(num_cols, sum(get_colspan(cell) for cell in row.cells))
-            # if row.cells:
-            #     num_cols = max(num_cols, len(row.cells))
+
+        if num_rows == 1 and num_cols == 1:
+            cell_element = table.rows[0].cells[0]
+            # In case we have a table of only 1 cell, we consider it furniture
+            # And proceed processing the content of the cell as though it's in the document body
+            self.walk_linear(cell_element._element, docx_obj, doc)
+            return

        # Initialize the table grid
        table_grid = [[None for _ in range(num_cols)] for _ in range(num_rows)]
@ -491,6 +488,24 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
        doc.add_table(data=data, parent=self.parents[level - 1])
        return

-    def handle_pictures(self, element, docx_obj, doc):
-        doc.add_picture(parent=self.parents[self.level], caption=None)
+    def handle_pictures(self, element, docx_obj, drawing_blip, doc):
+        def get_docx_image(element, drawing_blip):
+            rId = drawing_blip[0].get(
+                "{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed"
+            )
+            if rId in docx_obj.part.rels:
+                # Access the image part using the relationship ID
+                image_part = docx_obj.part.rels[rId].target_part
+                image_data = image_part.blob  # Get the binary image data
+            return image_data
+
+        image_data = get_docx_image(element, drawing_blip)
+        image_bytes = BytesIO(image_data)
+        # Open the BytesIO object with PIL to create an Image
+        pil_image = Image.open(image_bytes)
+        doc.add_picture(
+            parent=self.parents[self.level],
+            image=ImageRef.from_pil(image=pil_image, dpi=72),
+            caption=None,
+        )
        return
--- a/docling/backend/pypdfium2_backend.py
+++ b/docling/backend/pypdfium2_backend.py
@ -29,7 +29,7 @@ class PyPdfiumPageBackend(PdfPageBackend):
            self._ppage: pdfium.PdfPage = pdfium_doc[page_no]
        except PdfiumError as e:
            _log.info(
-                f"An exception occured when loading page {page_no} of document {document_hash}.",
+                f"An exception occurred when loading page {page_no} of document {document_hash}.",
                exc_info=True,
            )
            self.valid = False
--- a/docling/cli/main.py
+++ b/docling/cli/main.py
@ -153,6 +153,13 @@ def convert(
            ..., help="If enabled, the bitmap content will be processed using OCR."
        ),
    ] = True,
+    force_ocr: Annotated[
+        bool,
+        typer.Option(
+            ...,
+            help="Replace any existing text with OCR generated text over the full content.",
+        ),
+    ] = False,
    ocr_engine: Annotated[
        OcrEngine, typer.Option(..., help="The OCR engine to use.")
    ] = OcrEngine.EASYOCR,
@ -178,6 +185,15 @@ def convert(
    output: Annotated[
        Path, typer.Option(..., help="Output directory where results are saved.")
    ] = Path("."),
+    verbose: Annotated[
+        int,
+        typer.Option(
+            "--verbose",
+            "-v",
+            count=True,
+            help="Set the verbosity level. -v for info logging, -vv for debug logging.",
+        ),
+    ] = 0,
    version: Annotated[
        Optional[bool],
        typer.Option(
@ -188,7 +204,12 @@ def convert(
        ),
    ] = None,
 ):
-    logging.basicConfig(level=logging.INFO)
+    if verbose == 0:
+        logging.basicConfig(level=logging.WARNING)
+    elif verbose == 1:
+        logging.basicConfig(level=logging.INFO)
+    elif verbose == 2:
+        logging.basicConfig(level=logging.DEBUG)

    if from_formats is None:
        from_formats = [e for e in InputFormat]
@ -219,11 +240,11 @@ def convert(

    match ocr_engine:
        case OcrEngine.EASYOCR:
-            ocr_options: OcrOptions = EasyOcrOptions()
+            ocr_options: OcrOptions = EasyOcrOptions(force_full_page_ocr=force_ocr)
        case OcrEngine.TESSERACT_CLI:
-            ocr_options = TesseractCliOcrOptions()
+            ocr_options = TesseractCliOcrOptions(force_full_page_ocr=force_ocr)
        case OcrEngine.TESSERACT:
-            ocr_options = TesseractOcrOptions()
+            ocr_options = TesseractOcrOptions(force_full_page_ocr=force_ocr)
        case _:
            raise RuntimeError(f"Unexpected OCR engine type {ocr_engine}")

@ -280,5 +301,7 @@ def convert(
    _log.info(f"All documents were converted in {end_time:.2f} seconds.")


+click_app = typer.main.get_command(app)
+
 if __name__ == "__main__":
    app()
--- a/docling/datamodel/pipeline_options.py
+++ b/docling/datamodel/pipeline_options.py
@ -22,6 +22,7 @@ class TableStructureOptions(BaseModel):

 class OcrOptions(BaseModel):
    kind: str
+    force_full_page_ocr: bool = False  # If enabled a full page OCR is always applied
    bitmap_area_threshold: float = (
        0.05  # percentage of the area for a bitmap to processed with OCR
    )
--- a/docling/datamodel/settings.py
+++ b/docling/datamodel/settings.py
@ -2,7 +2,7 @@ import sys
 from pathlib import Path

 from pydantic import BaseModel
-from pydantic_settings import BaseSettings
+from pydantic_settings import BaseSettings, SettingsConfigDict


 class DocumentLimits(BaseModel):
@ -40,6 +40,8 @@ class DebugSettings(BaseModel):


 class AppSettings(BaseSettings):
+    model_config = SettingsConfigDict(env_prefix="DOCLING_", env_nested_delimiter="_")
+
    perf: BatchConcurrencySettings
    debug: DebugSettings

--- a/docling/models/base_ocr_model.py
+++ b/docling/models/base_ocr_model.py
@ -10,7 +10,7 @@ from PIL import Image, ImageDraw
 from rtree import index
 from scipy.ndimage import find_objects, label

-from docling.datamodel.base_models import OcrCell, Page
+from docling.datamodel.base_models import Cell, OcrCell, Page
 from docling.datamodel.document import ConversionResult
 from docling.datamodel.pipeline_options import OcrOptions
 from docling.datamodel.settings import settings
@ -73,7 +73,9 @@ class BaseOcrModel(BasePageModel):
        coverage, ocr_rects = find_ocr_rects(page.size, bitmap_rects)

        # return full-page rectangle if sufficiently covered with bitmaps
-        if coverage > max(BITMAP_COVERAGE_TRESHOLD, self.options.bitmap_area_threshold):
+        if self.options.force_full_page_ocr or coverage > max(
+            BITMAP_COVERAGE_TRESHOLD, self.options.bitmap_area_threshold
+        ):
            return [
                BoundingBox(
                    l=0,
@ -96,7 +98,7 @@ class BaseOcrModel(BasePageModel):
            return ocr_rects

    # Filters OCR cells by dropping any OCR cell that intersects with an existing programmatic cell.
-    def filter_ocr_cells(self, ocr_cells, programmatic_cells):
+    def _filter_ocr_cells(self, ocr_cells, programmatic_cells):
        # Create R-tree index for programmatic cells
        p = index.Property()
        p.dimension = 2
@ -117,6 +119,23 @@ class BaseOcrModel(BasePageModel):
        ]
        return filtered_ocr_cells

+    def post_process_cells(self, ocr_cells, programmatic_cells):
+        r"""
+        Post-process the ocr and programmatic cells and return the final list of of cells
+        """
+        if self.options.force_full_page_ocr:
+            # If a full page OCR is forced, use only the OCR cells
+            cells = [
+                Cell(id=c_ocr.id, text=c_ocr.text, bbox=c_ocr.bbox)
+                for c_ocr in ocr_cells
+            ]
+            return cells
+
+        ## Remove OCR cells which overlap with programmatic cells.
+        filtered_ocr_cells = self._filter_ocr_cells(ocr_cells, programmatic_cells)
+        programmatic_cells.extend(filtered_ocr_cells)
+        return programmatic_cells
+
    def draw_ocr_rects_and_cells(self, conv_res, page, ocr_rects, show: bool = False):
        image = copy.deepcopy(page.image)
        draw = ImageDraw.Draw(image, "RGBA")
--- a/docling/models/ds_glm_model.py
+++ b/docling/models/ds_glm_model.py
@ -43,7 +43,8 @@ class GlmModel:
    def __init__(self, options: GlmOptions):
        self.options = options

-        load_pretrained_nlp_models()
+        if self.options.model_names != "":
+            load_pretrained_nlp_models()
        self.model = init_nlp_model(model_names=self.options.model_names)

    def _to_legacy_document(self, conv_res) -> DsDocument:
--- a/docling/models/easyocr_model.py
+++ b/docling/models/easyocr_model.py
@ -2,9 +2,10 @@ import logging
 from typing import Iterable

 import numpy
+import torch
 from docling_core.types.doc import BoundingBox, CoordOrigin

-from docling.datamodel.base_models import OcrCell, Page
+from docling.datamodel.base_models import Cell, OcrCell, Page
 from docling.datamodel.document import ConversionResult
 from docling.datamodel.pipeline_options import EasyOcrOptions
 from docling.datamodel.settings import settings
@ -32,6 +33,7 @@ class EasyOcrModel(BaseOcrModel):

            self.reader = easyocr.Reader(
                lang_list=self.options.lang,
+                gpu=self.options.use_gpu,
                model_storage_directory=self.options.model_storage_directory,
                download_enabled=self.options.download_enabled,
            )
@ -86,12 +88,8 @@ class EasyOcrModel(BaseOcrModel):
                        ]
                        all_ocr_cells.extend(cells)

-                    ## Remove OCR cells which overlap with programmatic cells.
-                    filtered_ocr_cells = self.filter_ocr_cells(
-                        all_ocr_cells, page.cells
-                    )
-
-                    page.cells.extend(filtered_ocr_cells)
+                    # Post-process the cells
+                    page.cells = self.post_process_cells(all_ocr_cells, page.cells)

                # DEBUG code:
                if settings.debug.visualize_ocr:
--- a/docling/models/tesseract_ocr_cli_model.py
+++ b/docling/models/tesseract_ocr_cli_model.py
@ -7,7 +7,7 @@ from typing import Iterable, Optional, Tuple
 import pandas as pd
 from docling_core.types.doc import BoundingBox, CoordOrigin

-from docling.datamodel.base_models import OcrCell, Page
+from docling.datamodel.base_models import Cell, OcrCell, Page
 from docling.datamodel.document import ConversionResult
 from docling.datamodel.pipeline_options import TesseractCliOcrOptions
 from docling.datamodel.settings import settings
@ -170,12 +170,8 @@ class TesseractOcrCliModel(BaseOcrModel):
                            )
                            all_ocr_cells.append(cell)

-                    ## Remove OCR cells which overlap with programmatic cells.
-                    filtered_ocr_cells = self.filter_ocr_cells(
-                        all_ocr_cells, page.cells
-                    )
-
-                    page.cells.extend(filtered_ocr_cells)
+                    # Post-process the cells
+                    page.cells = self.post_process_cells(all_ocr_cells, page.cells)

                # DEBUG code:
                if settings.debug.visualize_ocr:
--- a/docling/models/tesseract_ocr_model.py
+++ b/docling/models/tesseract_ocr_model.py
@ -3,7 +3,7 @@ from typing import Iterable

 from docling_core.types.doc import BoundingBox, CoordOrigin

-from docling.datamodel.base_models import OcrCell, Page
+from docling.datamodel.base_models import Cell, OcrCell, Page
 from docling.datamodel.document import ConversionResult
 from docling.datamodel.pipeline_options import TesseractOcrOptions
 from docling.datamodel.settings import settings
@ -22,25 +22,37 @@ class TesseractOcrModel(BaseOcrModel):
        self.reader = None

        if self.enabled:
-            setup_errmsg = (
+            install_errmsg = (
                "tesserocr is not correctly installed. "
                "Please install it via `pip install tesserocr` to use this OCR engine. "
-                "Note that tesserocr might have to be manually compiled for working with"
+                "Note that tesserocr might have to be manually compiled for working with "
                "your Tesseract installation. The Docling documentation provides examples for it. "
-                "Alternatively, Docling has support for other OCR engines. See the documentation."
+                "Alternatively, Docling has support for other OCR engines. See the documentation: "
+                "https://ds4sd.github.io/docling/installation/"
            )
+            missing_langs_errmsg = (
+                "tesserocr is not correctly configured. No language models have been detected. "
+                "Please ensure that the TESSDATA_PREFIX envvar points to tesseract languages dir. "
+                "You can find more information how to setup other OCR engines in Docling "
+                "documentation: "
+                "https://ds4sd.github.io/docling/installation/"
+            )
+
            try:
                import tesserocr
            except ImportError:
-                raise ImportError(setup_errmsg)
-
+                raise ImportError(install_errmsg)
            try:
                tesseract_version = tesserocr.tesseract_version()
-                _log.debug("Initializing TesserOCR: %s", tesseract_version)
            except:
-                raise ImportError(setup_errmsg)
+                raise ImportError(install_errmsg)
+
+            _, tesserocr_languages = tesserocr.get_languages()
+            if not tesserocr_languages:
+                raise ImportError(missing_langs_errmsg)

            # Initialize the tesseractAPI
+            _log.debug("Initializing TesserOCR: %s", tesseract_version)
            lang = "+".join(self.options.lang)
            if self.options.path is not None:
                self.reader = tesserocr.PyTessBaseAPI(
@ -128,12 +140,8 @@ class TesseractOcrModel(BaseOcrModel):
                        # del high_res_image
                        all_ocr_cells.extend(cells)

-                    ## Remove OCR cells which overlap with programmatic cells.
-                    filtered_ocr_cells = self.filter_ocr_cells(
-                        all_ocr_cells, page.cells
-                    )
-
-                    page.cells.extend(filtered_ocr_cells)
+                    # Post-process the cells
+                    page.cells = self.post_process_cells(all_ocr_cells, page.cells)

                # DEBUG code:
                if settings.debug.visualize_ocr:
--- a/docs/assets/docling_arch.png
+++ b/docs/assets/docling_arch.png
--- a/docs/assets/docling_arch.pptx
+++ b/docs/assets/docling_arch.pptx
--- a/docs/cli.md
+++ b/docs/cli.md
@ -0,0 +1,9 @@
+# CLI Reference
+
+This page provides documentation for our command line tools.
+
+::: mkdocs-click
+    :module: docling.cli.main
+    :command: click_app
+    :prog_name: docling
+    :style: table
--- a/docs/concepts/architecture.md
+++ b/docs/concepts/architecture.md
@ -0,0 +1,19 @@
+![docling_architecture](../assets/docling_arch.png)
+
+In a nutshell, Docling's architecture is outlined in the diagram above.
+
+For each document format, the *document converter* knows which format-specific *backend* to employ for parsing the document and which *pipeline* to use for orchestrating the execution, along with any relevant *options*.
+
+!!! tip
+
+    While the document converter holds a default mapping, this configuration is parametrizable, so e.g. for the PDF format, different backends and different pipeline options can be used — see [Usage](../usage.md#adjust-pipeline-features).
+
+The *conversion result* contains the [*Docling document*](./docling_document.md), Docling's fundamental document representation.
+
+Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it chunked by a *chunker*.
+
+For more details on Docling's architecture, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).
+
+!!! note
+
+    The components illustrated with dashed outline indicate base classes that can be subclassed for specialized implementations.
--- a/docs/concepts/index.md
+++ b/docs/concepts/index.md
@ -0,0 +1 @@
+Use the navigation on the left to browse some core Docling concepts.
--- a/docs/examples/custom_convert.py
+++ b/docs/examples/custom_convert.py
@ -80,6 +80,20 @@ def main():
        }
    )

+    # Docling Parse with EasyOCR (CPU only)
+    # ----------------------
+    # pipeline_options = PdfPipelineOptions()
+    # pipeline_options.do_ocr = True
+    # pipeline_options.ocr_options.use_gpu = False  # <-- set this.
+    # pipeline_options.do_table_structure = True
+    # pipeline_options.table_structure_options.do_cell_matching = True
+
+    # doc_converter = DocumentConverter(
+    #     format_options={
+    #         InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+    #     }
+    # )
+
    # Docling Parse with Tesseract
    # ----------------------
    # pipeline_options = PdfPipelineOptions()
--- a/docs/examples/full_page_ocr.py
+++ b/docs/examples/full_page_ocr.py
@ -0,0 +1,42 @@
+from pathlib import Path
+
+from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
+from docling.datamodel.base_models import InputFormat
+from docling.datamodel.pipeline_options import (
+    EasyOcrOptions,
+    PdfPipelineOptions,
+    TesseractCliOcrOptions,
+    TesseractOcrOptions,
+)
+from docling.document_converter import DocumentConverter, PdfFormatOption
+
+
+def main():
+    input_doc = Path("./tests/data/2206.01062.pdf")
+
+    pipeline_options = PdfPipelineOptions()
+    pipeline_options.do_ocr = True
+    pipeline_options.do_table_structure = True
+    pipeline_options.table_structure_options.do_cell_matching = True
+
+    # Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions
+    # ocr_options = EasyOcrOptions(force_full_page_ocr=True)
+    # ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
+    ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
+    pipeline_options.ocr_options = ocr_options
+
+    converter = DocumentConverter(
+        format_options={
+            InputFormat.PDF: PdfFormatOption(
+                pipeline_options=pipeline_options,
+            )
+        }
+    )
+
+    doc = converter.convert(input_doc).document
+    md = doc.export_to_markdown()
+    print(md)
+
+
+if __name__ == "__main__":
+    main()
--- a/docs/examples/hybrid_rag_qdrant.ipynb
+++ b/docs/examples/hybrid_rag_qdrant.ipynb
@ -0,0 +1,288 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/hybrid_rag_qdrant\n",
+    ".ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Hybrid RAG with Qdrant"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Overview"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This example demonstrates using Docling with [Qdrant](https://qdrant.tech/) to perform a hybrid search across your documents using dense and sparse vectors.\n",
+    "\n",
+    "We'll chunk the documents using Docling before adding them to a Qdrant collection. By limiting the length of the chunks, we can preserve the meaning in each vector embedding."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- 👉 Qdrant client uses [FastEmbed](https://github.com/qdrant/fastembed) to generate vector embeddings. You can install the `fastembed-gpu` package if you've got the hardware to support it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.3.1\u001b[0m\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
+      "Note: you may need to restart the kernel to use updated packages.\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pip install --no-warn-conflicts -q qdrant-client docling docling-core fastembed"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's import all the classes we'll be working with."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from docling_core.transforms.chunker import HierarchicalChunker\n",
+    "from qdrant_client import QdrantClient\n",
+    "\n",
+    "from docling.datamodel.base_models import InputFormat\n",
+    "from docling.document_converter import DocumentConverter"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- For Docling, we'll set the  allowed formats to HTML since we'll only be working with webpages in this tutorial.\n",
+    "- If we set a sparse model, Qdrant client will fuse the dense and sparse results using RRF. [Reference](https://qdrant.tech/documentation/tutorials/hybrid-search-fastembed/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "c1077c6634d9434584c41cc12f9107c9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "67069c07b73448d491944452159d10bc",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Fetching 29 files:   0%|          | 0/29 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "COLLECTION_NAME = \"docling\"\n",
+    "\n",
+    "doc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML])\n",
+    "client = QdrantClient(location=\":memory:\")\n",
+    "# The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI.\n",
+    "# For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant\n",
+    "# client = QdrantClient(location=\"http://localhost:6333\")\n",
+    "\n",
+    "client.set_model(\"sentence-transformers/all-MiniLM-L6-v2\")\n",
+    "client.set_sparse_model(\"Qdrant/bm25\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now download and chunk the document using Docling. For demonstration, we'll use an article about chunking strategies :)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result = doc_converter.convert(\n",
+    "    \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\"\n",
+    ")\n",
+    "documents, metadatas = [], []\n",
+    "for chunk in HierarchicalChunker().chunk(result.document):\n",
+    "    documents.append(chunk.text)\n",
+    "    metadatas.append(chunk.meta.export_json_dict())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's now upload the documents to Qdrant.\n",
+    "\n",
+    "- The `add()` method batches the documents and uses FastEmbed to generate vector embeddings on our machine."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['e74ae15be5eb4805858307846318e784',\n",
+       " 'f83f6125b0fa4a0595ae6a0777c9d90d',\n",
+       " '9cf63c7f30764715bf3804a19db36d7d',\n",
+       " '007dbe6d355b4b49af3b736cbd63a4d8',\n",
+       " 'e5e31f21f2e84aa68beca0dfc532cbe9',\n",
+       " '69c10816af204bb28630a1f957d8dd3e',\n",
+       " 'b63546b9b1744063bdb076b234d883ca',\n",
+       " '90ad15ba8fa6494489e1d3221e30bfcf',\n",
+       " '13517debb483452ea40fc7aa04c08c50',\n",
+       " '84ccab5cfab74e27a55acef1c63e3fad',\n",
+       " 'e8aa2ef46d234c5a8a9da64b701d60b4',\n",
+       " '190bea5ba43c45e792197c50898d1d90',\n",
+       " 'a730319ea65645ca81e735ace0bcc72e',\n",
+       " '415e7f6f15864e30b836e23ae8d71b43',\n",
+       " '5569bce4e65541868c762d149c6f491e',\n",
+       " '74d9b234e9c04ebeb8e4e1ca625789ac',\n",
+       " '308b1c5006a94a679f4c8d6f2396993c',\n",
+       " 'aaa5ec6d385a418388e660c425bf1dbe',\n",
+       " '630be8e43e4e4472a9cdb9af9462a43a',\n",
+       " '643b316224de4770a5349bf69cf93471',\n",
+       " 'da9265e6f6c2485493d15223eefdf411',\n",
+       " 'a916e447d52c4084b5ce81a0c5a65b07',\n",
+       " '2883c620858e4e728b88e127155a4f2c',\n",
+       " '2a998f0e9c124af99027060b94027874',\n",
+       " 'be551fbd2b9e42f48ebae0cbf1f481bc',\n",
+       " '95b7f7608e974ca6847097ee4590fba1',\n",
+       " '309db4f3863b4e3aaf16d5f346c309f3',\n",
+       " 'c818383267f64fd68b2237b024bd724e',\n",
+       " '1f16e78338c94238892171b400051cd4',\n",
+       " '25c680c3e064462cab071ea9bf1bad8c',\n",
+       " 'f41ab7e480a248c6bb87019341c7ca74',\n",
+       " 'd440128bed6d4dcb987152b48ecd9a8a',\n",
+       " 'c110d5dfdc5849808851788c2404dd15']"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "client.add(COLLECTION_NAME, documents=documents, metadata=metadatas, batch_size=64)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Query Documents"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<=== Retrieved documents ===>\n",
+      "Document Specific Chunking is a strategy that respects the document's structure. Rather than using a set number of characters or a recursive process, it creates chunks that align with the logical sections of the document, like paragraphs or subsections. This approach maintains the original author's organization of content and helps keep the text coherent. It makes the retrieved information more relevant and useful, particularly for structured documents with clearly defined sections.\n",
+      "Document Specific Chunking can handle a variety of document formats, such as:\n",
+      "Consequently, there are also splitters available for this purpose.\n",
+      "1. We start at the top of the document, treating the first part as a chunk.\n",
+      "   2. We continue down the document, deciding if a new sentence or piece of information belongs with the first chunk or should start a new one.\n",
+      "    3. We keep this up until we reach the end of the document.\n",
+      "Have you ever wondered how we, humans, would chunk? Here's a breakdown of a possible way a human would process a new document:\n",
+      "The goal of chunking is, as its name says, to chunk the information into multiple smaller pieces in order to store it in a more efficient and meaningful way. This allows the retrieval to capture pieces of information that are more related to the question at hand, and the generation to be more precise, but also less costly, as only a part of a document will be included in the LLM prompt, instead of the whole document.\n",
+      "To put these strategies into action, there's a whole array of tools and libraries at your disposal. For example, llama_index is a fantastic tool that lets you create document indices and retrieve chunked documents. Let's not forget LangChain, another remarkable tool that makes implementing chunking strategies a breeze, particularly when dealing with multi-language data. Diving into these tools and understanding how they can work in harmony with the chunking strategies we've discussed is a crucial part of mastering Retrieval Augmented Generation.\n",
+      "Semantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.\n",
+      "You can see here that with a chunk size of 105, the Markdown structure of the document is taken into account, and the chunks thus preserve the semantics of the text!\n",
+      "And there you have it! These chunking strategies are like a personal toolbox when it comes to implementing Retrieval Augmented Generation. They're a ton of ways to slice and dice text, each with its unique features and quirks. This variety gives you the freedom to pick the strategy that suits your project best, allowing you to tailor your approach to perfectly fit the unique needs of your work.\n"
+     ]
+    }
+   ],
+   "source": [
+    "points = client.query(COLLECTION_NAME, query_text=\"Can I split documents?\", limit=10)\n",
+    "\n",
+    "print(\"<=== Retrieved documents ===>\")\n",
+    "for point in points:\n",
+    "    print(point.document)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/docs/examples/index.md
+++ b/docs/examples/index.md
@ -0,0 +1 @@
+Use the navigation on the left to browse through examples covering a range of possible workflows and use cases.
--- a/docs/index.md
+++ b/docs/index.md
@ -2,9 +2,9 @@

 <p align="center">
  <img loading="lazy" alt="Docling" src="assets/docling_processing.png" width="100%" />
+  <a href="https://trendshift.io/repositories/12132" target="_blank"><img src="https://trendshift.io/api/badge/repositories/12132" alt="DS4SD%2Fdocling | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 </p>

-
 [![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
 [![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
 ![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
@ -22,7 +22,16 @@ Docling parses documents and exports them to the desired format with ease and sp
 * 🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON
 * 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
 * 🧩 Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format
-* 📝 Metadata extraction, including title, authors, references & language
-* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
+* 🤖 Easy integration with LlamaIndex 🦙 & LangChain 🦜🔗 for powerful RAG / QA applications
 * 🔍 OCR support for scanned PDFs
 * 💻 Simple and convenient CLI
+
+### Coming soon
+
+* ♾️ Equation & code extraction
+* 📝 Metadata extraction, including title, authors, references & language
+* 🦜🔗 Native LangChain extension
+
+## IBM ❤️ Open Source AI
+
+Docling has been brought to you by IBM.
--- a/docs/integrations/data_prep_kit.md
+++ b/docs/integrations/data_prep_kit.md
@ -0,0 +1,13 @@
+## Get started
+
+Docling is used by the [Data Prep Kit \[↗\]](https://ibm.github.io/data-prep-kit/) open-source toolkit for preparing unstructured data for LLM application development ranging from laptop scale to datacenter scale.
+
+Below you find the Data Prep Kit modules powered by Docling.
+
+## PDF ingestion to Parquet
+- 💻 [GitHub \[↗\]](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/pdf2parquet)
+- 📖 [API docs \[↗\]](https://ibm.github.io/data-prep-kit/transforms/language/pdf2parquet/python/)
+
+## Document chunking
+- 💻 [GitHub \[↗\]](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_chunk)
+- 📖 [API docs \[↗\]](https://ibm.github.io/data-prep-kit/transforms/language/doc_chunk/python/)
--- a/docs/integrations/index.md
+++ b/docs/integrations/index.md
@ -0,0 +1 @@
+Use the navigation on the left to browse through Docling integrations with popular frameworks and tools.
--- a/docs/integrations/llamaindex.md
+++ b/docs/integrations/llamaindex.md
@ -1,6 +1,6 @@
 ## Get started

-Docling is available as an official LlamaIndex extension!
+Docling is available as an official [LlamaIndex \[↗\]](https://docs.llamaindex.ai/) extension.

 To get started, check out the [step-by-step guide in LlamaIndex \[↗\]](https://docs.llamaindex.ai/en/stable/examples/data_connectors/DoclingReaderDemo/)<!--{target="_blank"}-->.

--- a/docs/usage.md
+++ b/docs/usage.md
@ -22,50 +22,7 @@ A simple example would look like this:
 docling https://arxiv.org/pdf/2206.01062
 ```

-To see all available options (export formats etc.) run `docling --help`.
-
-<details>
-  <summary><b>CLI reference</b></summary>
-
-Here are the available options as of this writing (for an up-to-date listing, run `docling --help`):
-
-```console
-$ docling --help
-
- Usage: docling [OPTIONS] source                                                                                             
-                                                                                                                             
-╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
-│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None]         │
-│                                 [required]                                                                                │
-╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
-╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
-│ --from                                     [docx|pptx|html|image|pdf|asciidoc|md]  Specify input formats to convert from. │
-│                                                                                    Defaults to all formats.               │
-│                                                                                    [default: None]                        │
-│ --to                                       [md|json|text|doctags]                  Specify output formats. Defaults to    │
-│                                                                                    Markdown.                              │
-│                                                                                    [default: None]                        │
-│ --ocr               --no-ocr                                                       If enabled, the bitmap content will be │
-│                                                                                    processed using OCR.                   │
-│                                                                                    [default: ocr]                         │
-│ --ocr-engine                               [easyocr|tesseract_cli|tesseract]       The OCR engine to use.                 │
-│                                                                                    [default: easyocr]                     │
-│ --pdf-backend                              [pypdfium2|dlparse_v1|dlparse_v2]       The PDF backend to use.                │
-│                                                                                    [default: dlparse_v1]                  │
-│ --table-mode                               [fast|accurate]                         The mode to use in the table structure │
-│                                                                                    model.                                 │
-│                                                                                    [default: fast]                        │
-│ --abort-on-error    --no-abort-on-error                                            If enabled, the bitmap content will be │
-│                                                                                    processed using OCR.                   │
-│                                                                                    [default: no-abort-on-error]           │
-│ --output                                   PATH                                    Output directory where results are     │
-│                                                                                    saved.                                 │
-│                                                                                    [default: .]                           │
-│ --version                                                                          Show version information.              │
-│ --help                                                                             Show this message and exit.            │
-╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
-```
-</details>
+To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./cli.md).



@ -161,7 +118,7 @@ from docling.datamodel.base_models import DocumentStream
 from docling.document_converter import DocumentConverter

 buf = BytesIO(your_binary_stream)
-source = DocumentStream(filename="my_doc.pdf", stream=buf)
+source = DocumentStream(name="my_doc.pdf", stream=buf)
 converter = DocumentConverter()
 result = converter.convert(source)
 ```
--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -39,7 +39,7 @@ theme:
    - content.code.copy
    - announce.dismiss
    - navigation.tabs
-    # - navigation.indexes  # <= if set, each "section" can have its own page, if index.md is used
+    - navigation.indexes  # <= if set, each "section" can have its own page, if index.md is used
    - navigation.instant
    - navigation.instant.prefetch
    # - navigation.instant.preview
@ -55,11 +55,15 @@ nav:
    - Home: index.md
    - Installation: installation.md
    - Usage: usage.md
+    - CLI: cli.md
    - Docling v2: v2.md
  - Concepts:
+    - Concepts: concepts/index.md
+    - Architecture: concepts/architecture.md
    - Docling Document: concepts/docling_document.md
  #   - Chunking: concepts/chunking.md
  - Examples:
+    - Examples: examples/index.md
    - Conversion:
      - "Simple conversion": examples/minimal.py
      - "Custom conversion": examples/custom_convert.py
@ -69,16 +73,20 @@ nav:
      - "Figure enrichment": examples/develop_picture_enrichment.py
      - "Table export": examples/export_tables.py
      - "Multimodal export": examples/export_multimodal.py
+      - "Force full page OCR": examples/full_page_ocr.py
    - RAG / QA:
      - "RAG with LlamaIndex 🦙": examples/rag_llamaindex.ipynb
      - "RAG with LangChain 🦜🔗": examples/rag_langchain.ipynb
+      - "Hybrid RAG with Qdrant": examples/hybrid_rag_qdrant.ipynb
    # - Chunking:
    #   - Chunking: examples/chunking.md
    # - CLI:
    #   - CLI: examples/cli.md
  - Integrations:
-    - "LlamaIndex 🦙 extension": integrations/llamaindex.md
-    # - "LangChain 🦜🔗 extension": integrations/langchain.md
+    - Integrations: integrations/index.md
+    - "Data Prep Kit": integrations/data_prep_kit.md
+    - "LlamaIndex 🦙": integrations/llamaindex.md
+    # - "LangChain 🦜🔗": integrations/langchain.md
  # - API reference:
  #   - API reference: api_reference/index.md

@ -92,9 +100,16 @@ markdown_extensions:
  - admonition
  - pymdownx.details
  - attr_list
+  - mkdocs-click
 plugins:
  - search
  - mkdocs-jupyter
+  # - mkdocstrings:
+  #     default_handler: python
+  #     options:
+  #       preload_modules:
+  #       - docling
+  #       - docling_core

 extra_css:
  - stylesheets/extra.css
--- a/poetry.lock
+++ b/poetry.lock
@ -2594,6 +2594,21 @@ watchdog = ">=2.0"
 i18n = ["babel (>=2.9.0)"]
 min-versions = ["babel (==2.9.0)", "click (==7.0)", "colorama (==0.4)", "ghp-import (==1.0)", "importlib-metadata (==4.4)", "jinja2 (==2.11.1)", "markdown (==3.3.6)", "markupsafe (==2.0.1)", "mergedeep (==1.3.4)", "mkdocs-get-deps (==0.2.0)", "packaging (==20.5)", "pathspec (==0.11.1)", "pyyaml (==5.1)", "pyyaml-env-tag (==0.1)", "watchdog (==2.0)"]

+[[package]]
+name = "mkdocs-click"
+version = "0.8.1"
+description = "An MkDocs extension to generate documentation for Click command line applications"
+optional = false
+python-versions = ">=3.7"
+files = [
+    {file = "mkdocs_click-0.8.1-py3-none-any.whl", hash = "sha256:a100ff938be63911f86465a1c21d29a669a7c51932b700fdb3daa90d13b61ee4"},
+    {file = "mkdocs_click-0.8.1.tar.gz", hash = "sha256:0a88cce04870c5d70ff63138e2418219c3c4119cc928a59c66b76eb5214edba6"},
+]
+
+[package.dependencies]
+click = ">=8.1"
+markdown = ">=3.3"
+
 [[package]]
 name = "mkdocs-get-deps"
 version = "0.2.0"
@ -7176,4 +7191,4 @@ tesserocr = ["tesserocr"]
 [metadata]
 lock-version = "2.0"
 python-versions = "^3.10"
-content-hash = "95357a52d305fc7dda3da7e397f20d6fe0d4050a90d904c1714536c5a005ea34"
+content-hash = "9a7b0fe34d218e02da79cf62f27f7d2763dcebc92c2e791bc2814cf5d4de8cc2"
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,6 +1,6 @@
 [tool.poetry]
 name = "docling"
-version = "2.4.0"  # DO NOT EDIT, updated automatically
+version = "2.5.2"  # DO NOT EDIT, updated automatically
 description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
 authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Panos Vagenas <pva@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
 license = "MIT"
@ -71,6 +71,7 @@ nbqa = "^1.9.0"
 [tool.poetry.group.docs.dependencies]
 mkdocs-material = "^9.5.40"
 mkdocs-jupyter = "^0.25.0"
+mkdocs-click = "^0.8.1"

 [tool.poetry.group.examples.dependencies]
 datasets = "^2.21.0"
--- a/tests/data/docx/tablecell.docx
+++ b/tests/data/docx/tablecell.docx
--- a/tests/data/groundtruth/docling_v2/word_sample.docx.itxt
+++ b/tests/data/groundtruth/docling_v2/word_sample.docx.itxt
@ -2,7 +2,7 @@ item-0 at level 0: unspecified: group _root_
  item-1 at level 1: paragraph: Summer activities
  item-2 at level 1: title: Swimming in the lake
    item-3 at level 2: paragraph: Duck
-    item-4 at level 2: paragraph: 
+    item-4 at level 2: picture
    item-5 at level 2: paragraph: Figure 1: This is a cute duckling
    item-6 at level 2: section_header: Let’s swim!
      item-7 at level 3: paragraph: To get started with swimming, fi ...  down in a water and try not to drown:
--- a/tests/data/groundtruth/docling_v2/word_sample.docx.json
+++ b/tests/data/groundtruth/docling_v2/word_sample.docx.json
--- a/tests/data/groundtruth/docling_v2/word_sample.docx.md
+++ b/tests/data/groundtruth/docling_v2/word_sample.docx.md
@ -4,6 +4,8 @@ Summer activities

 Duck

+<!-- image -->
+
 Figure 1: This is a cute duckling

 ## Let’s swim!
--- a/tests/test_e2e_ocr_conversion.py
+++ b/tests/test_e2e_ocr_conversion.py
@ -15,34 +15,8 @@ from docling.document_converter import DocumentConverter, PdfFormatOption

 from .verify_utils import verify_conversion_result_v1, verify_conversion_result_v2

-GENERATE = False
-
-
-# Debug
-def save_output(pdf_path: Path, doc_result: ConversionResult, engine: str):
-    r""" """
-    import json
-    import os
-
-    parent = pdf_path.parent
-    eng = "" if engine is None else f".{engine}"
-
-    dict_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.json")
-    with open(dict_fn, "w") as fd:
-        json.dump(doc_result.legacy_document.export_to_dict(), fd)
-
-    pages_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.pages.json")
-    pages = [p.model_dump() for p in doc_result.pages]
-    with open(pages_fn, "w") as fd:
-        json.dump(pages, fd)
-
-    doctags_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.doctags.txt")
-    with open(doctags_fn, "w") as fd:
-        fd.write(doc_result.legacy_document.export_to_doctags())
-
-    md_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.md")
-    with open(md_fn, "w") as fd:
-        fd.write(doc_result.legacy_document.export_to_markdown())
+GENERATE_V1 = False
+GENERATE_V2 = False


 def get_pdf_paths():
@ -74,13 +48,15 @@ def get_converter(ocr_options: OcrOptions):


 def test_e2e_conversions():
-
    pdf_paths = get_pdf_paths()

    engines: List[OcrOptions] = [
        EasyOcrOptions(),
        TesseractOcrOptions(),
        TesseractCliOcrOptions(),
+        EasyOcrOptions(force_full_page_ocr=True),
+        TesseractOcrOptions(force_full_page_ocr=True),
+        TesseractCliOcrOptions(force_full_page_ocr=True),
    ]

    for ocr_options in engines:
@ -91,20 +67,16 @@ def test_e2e_conversions():

            doc_result: ConversionResult = converter.convert(pdf_path)

-            # Save conversions
-            # save_output(pdf_path, doc_result, None)
-
-            # Debug
            verify_conversion_result_v1(
                input_path=pdf_path,
                doc_result=doc_result,
-                generate=GENERATE,
+                generate=GENERATE_V1,
                fuzzy=True,
            )

            verify_conversion_result_v2(
                input_path=pdf_path,
                doc_result=doc_result,
-                generate=GENERATE,
+                generate=GENERATE_V2,
                fuzzy=True,
            )
--- a/tests/verify_utils.py
+++ b/tests/verify_utils.py
@ -256,15 +256,19 @@ def verify_conversion_result_v1(
    dt_path = gt_subpath.with_suffix(f"{engine_suffix}.doctags.txt")

    if generate:  # only used when re-generating truth
+        pages_path.parent.mkdir(parents=True, exist_ok=True)
        with open(pages_path, "w") as fw:
            fw.write(json.dumps(doc_pred_pages, default=pydantic_encoder))

+        json_path.parent.mkdir(parents=True, exist_ok=True)
        with open(json_path, "w") as fw:
            fw.write(json.dumps(doc_pred, default=pydantic_encoder))

+        md_path.parent.mkdir(parents=True, exist_ok=True)
        with open(md_path, "w") as fw:
            fw.write(doc_pred_md)

+        dt_path.parent.mkdir(parents=True, exist_ok=True)
        with open(dt_path, "w") as fw:
            fw.write(doc_pred_dt)
    else:  # default branch in test
@ -328,15 +332,19 @@ def verify_conversion_result_v2(
    dt_path = gt_subpath.with_suffix(f"{engine_suffix}.doctags.txt")

    if generate:  # only used when re-generating truth
+        pages_path.parent.mkdir(parents=True, exist_ok=True)
        with open(pages_path, "w") as fw:
            fw.write(json.dumps(doc_pred_pages, default=pydantic_encoder))

+        json_path.parent.mkdir(parents=True, exist_ok=True)
        with open(json_path, "w") as fw:
            fw.write(json.dumps(doc_pred, default=pydantic_encoder))

+        md_path.parent.mkdir(parents=True, exist_ok=True)
        with open(md_path, "w") as fw:
            fw.write(doc_pred_md)

+        dt_path.parent.mkdir(parents=True, exist_ok=True)
        with open(dt_path, "w") as fw:
            fw.write(doc_pred_dt)
    else:  # default branch in test
				`@ -0,0 +1 @@`
				`Use the navigation on the left to browse some core Docling concepts.`