Merge remote-tracking branch 'origin/main' into cau/integrate-docling-parse-v2

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-27 04:24:45 +00:00 · 2024-10-11 10:56:57 +02:00 · 2024-10-11 10:56:57 +02:00 · 8f6347dbb1
commit 8f6347dbb1
parent c47ee35bc4 dae2a3b667
23 changed files with 1782 additions and 875 deletions
--- a/.github/workflows/checks.yml
+++ b/.github/workflows/checks.yml
@ -9,6 +9,11 @@ jobs:
        python-version: ['3.10', '3.11', '3.12']
    steps:
      - uses: actions/checkout@v3
      - name: Install tesseract
        run: sudo apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa libleptonica-dev libtesseract-dev pkg-config
      - name: Set TESSDATA_PREFIX
        run: |
          echo "TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)" >> "$GITHUB_ENV"
      - uses: ./.github/actions/setup-poetry
        with:
          python-version: ${{ matrix.python-version }}
@ -32,4 +37,4 @@ jobs:
            poetry run python "$file" || exit 1
          done
      - name: Build with poetry
-        run: poetry build
+        run: poetry build
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,9 @@
 ## [v1.19.0](https://github.com/DS4SD/docling/releases/tag/v1.19.0) - 2024-10-08
 ### Feature
 * Add options for choosing OCR engines ([#118](https://github.com/DS4SD/docling/issues/118)) ([`f96ea86`](https://github.com/DS4SD/docling/commit/f96ea86a00fd1aafaa57025e46b5288b43958725))
 ## [v1.18.0](https://github.com/DS4SD/docling/releases/tag/v1.18.0) - 2024-10-03
 ### Feature
--- a/README.md
+++ b/README.md
@ -52,6 +52,79 @@ Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectu
  ```
 </details>
 <details>
  <summary><b>Alternative OCR engines</b></summary>
  Docling supports multiple OCR engines for processing scanned documents. The current version provides
  the following engines.
  | Engine | Installation | Usage |
  | ------ | ------------ | ----- |
  | [EasyOCR](https://github.com/JaidedAI/EasyOCR) | Default in Docling or via `pip install easyocr`. | `EasyOcrOptions` |
  | Tesseract | System dependency. See description for Tesseract and Tesserocr below.  | `TesseractOcrOptions` |
  | Tesseract CLI | System dependency. See description below. | `TesseractCliOcrOptions` |
  The Docling `DocumentConverter` allows to choose the OCR engine with the `ocr_options` settings. For example
  ```python
    from docling.datamodel.base_models import ConversionStatus, PipelineOptions
    from docling.datamodel.pipeline_options import PipelineOptions, EasyOcrOptions, TesseractOcrOptions
    from docling.document_converter import DocumentConverter
    pipeline_options = PipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.ocr_options = TesseractOcrOptions()  # Use Tesseract
    doc_converter = DocumentConverter(
        pipeline_options=pipeline_options,
    )
  ```
  #### Tesseract installation
  [Tesseract](https://github.com/tesseract-ocr/tesseract) is a popular OCR engine which is available
  on most operating systems. For using this engine with Docling, Tesseract must be installed on your
  system, using the packaging tool of your choice. Below we provide example commands.
  After installing Tesseract you are expected to provide the path to its language files using the
  `TESSDATA_PREFIX` environment variable (note that it must terminate with a slash `/`).
  For macOS, we reccomend using [Homebrew](https://brew.sh/).
  ```console
  brew install tesseract leptonica pkg-config
  TESSDATA_PREFIX=/opt/homebrew/share/tessdata/
  echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
  ```
  For Debian-based systems.
  ```console
  apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config
  TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)
  echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
  ```
  For RHEL systems.
  ```console
  dnf install tesseract tesseract-devel tesseract-langpack-eng leptonica-devel
  TESSDATA_PREFIX=/usr/share/tesseract/tessdata/
  echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"
  ```
  #### Linking to Tesseract
  The most efficient usage of the Tesseract library is via linking. Docling is using
  the [Tesserocr](https://github.com/sirfz/tesserocr) package for this.
  If you get into installation issues of Tesserocr, we suggest using the following
  installation options:
  ```console
  pip uninstall tesserocr
  pip install --no-binary :all: tesserocr
  ```
 </details>
 <details>
  <summary><b>Docling development setup</b></summary>
@ -216,15 +289,14 @@ from docling_core.transforms.chunker import HierarchicalChunker
 doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2206.01062").output
 chunks = list(HierarchicalChunker().chunk(doc))
-# > [
+print(chunks[0])
-# >     ChunkWithMetadata(
+# ChunkWithMetadata(
-# >         path='$.main-text[0]',
+#     path='#/main-text/1',
-# >         text='DocLayNet: A Large Human-Annotated Dataset [...]',
+#     text='DocLayNet: A Large Human-Annotated Dataset [...]',
-# >         page=1,
+#     page=1,
-# >         bbox=[107.30, 672.38, 505.19, 709.08]
+#     bbox=[107.30, 672.38, 505.19, 709.08],
-# >     ),
+#     [...]
-# >     [...]
+# )
 # > ]
 ```
--- a/docling/cli/main.py
+++ b/docling/cli/main.py
@ -14,7 +14,12 @@ from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
 from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
 from docling.datamodel.base_models import ConversionStatus
 from docling.datamodel.document import ConversionResult, DocumentConversionInput
-from docling.datamodel.pipeline_options import PipelineOptions
+from docling.datamodel.pipeline_options import (
    EasyOcrOptions,
    PipelineOptions,
    TesseractCliOcrOptions,
    TesseractOcrOptions,
 )
 from docling.document_converter import DocumentConverter
 warnings.filterwarnings(action="ignore", category=UserWarning, module="pydantic|torch")
@ -53,6 +58,13 @@ class Backend(str, Enum):
    DOCLING = "docling"
 # Define an enum for the ocr engines
 class OcrEngine(str, Enum):
    EASYOCR = "easyocr"
    TESSERACT_CLI = "tesseract_cli"
    TESSERACT = "tesseract"
 def export_documents(
    conv_results: Iterable[ConversionResult],
    output_dir: Path,
@ -152,6 +164,9 @@ def convert(
    backend: Annotated[
        Backend, typer.Option(..., help="The PDF backend to use.")
    ] = Backend.DOCLING,
    ocr_engine: Annotated[
        OcrEngine, typer.Option(..., help="The OCR engine to use.")
    ] = OcrEngine.EASYOCR,
    output: Annotated[
        Path, typer.Option(..., help="Output directory where results are saved.")
    ] = Path("."),
@ -191,8 +206,19 @@ def convert(
        case _:
            raise RuntimeError(f"Unexpected backend type {backend}")
    match ocr_engine:
        case OcrEngine.EASYOCR:
            ocr_options = EasyOcrOptions()
        case OcrEngine.TESSERACT_CLI:
            ocr_options = TesseractCliOcrOptions()
        case OcrEngine.TESSERACT:
            ocr_options = TesseractOcrOptions()
        case _:
            raise RuntimeError(f"Unexpected backend type {backend}")
    pipeline_options = PipelineOptions(
        do_ocr=ocr,
        ocr_options=ocr_options,
        do_table_structure=True,
    )
    pipeline_options.table_structure_options.do_cell_matching = do_cell_matching
--- a/docling/datamodel/base_models.py
+++ b/docling/datamodel/base_models.py
@ -110,7 +110,10 @@ class BoundingBox(BaseModel):
            return BoundingBox(l=l, t=t, r=r, b=b, coord_origin=origin)
    def area(self) -> float:
-        return (self.r - self.l) * (self.b - self.t)
+        area = (self.r - self.l) * (self.b - self.t)
        if self.coord_origin == CoordOrigin.BOTTOMLEFT:
            area = -area
        return area
    def intersection_area_with(self, other: "BoundingBox") -> float:
        # Calculate intersection coordinates
--- a/docling/datamodel/pipeline_options.py
+++ b/docling/datamodel/pipeline_options.py
@ -1,6 +1,7 @@
 from enum import Enum, auto
 from typing import List, Literal, Optional, Union
-from pydantic import BaseModel
+from pydantic import BaseModel, ConfigDict, Field
 class TableFormerMode(str, Enum):
@ -18,8 +19,49 @@ class TableStructureOptions(BaseModel):
    mode: TableFormerMode = TableFormerMode.FAST
 class OcrOptions(BaseModel):
    kind: str
 class EasyOcrOptions(OcrOptions):
    kind: Literal["easyocr"] = "easyocr"
    lang: List[str] = ["fr", "de", "es", "en"]
    use_gpu: bool = True  # same default as easyocr.Reader
    model_storage_directory: Optional[str] = None
    download_enabled: bool = True  # same default as easyocr.Reader
    model_config = ConfigDict(
        extra="forbid",
        protected_namespaces=(),
    )
 class TesseractCliOcrOptions(OcrOptions):
    kind: Literal["tesseract"] = "tesseract"
    lang: List[str] = ["fra", "deu", "spa", "eng"]
    tesseract_cmd: str = "tesseract"
    path: Optional[str] = None
    model_config = ConfigDict(
        extra="forbid",
    )
 class TesseractOcrOptions(OcrOptions):
    kind: Literal["tesserocr"] = "tesserocr"
    lang: List[str] = ["fra", "deu", "spa", "eng"]
    path: Optional[str] = None
    model_config = ConfigDict(
        extra="forbid",
    )
 class PipelineOptions(BaseModel):
    do_table_structure: bool = True  # True: perform table structure extraction
    do_ocr: bool = True  # True: perform OCR, replace programmatic PDF text
    table_structure_options: TableStructureOptions = TableStructureOptions()
    ocr_options: Union[EasyOcrOptions, TesseractCliOcrOptions, TesseractOcrOptions] = (
        Field(EasyOcrOptions(), discriminator="kind")
    )
--- a/docling/document_converter.py
+++ b/docling/document_converter.py
@ -199,9 +199,6 @@ class DocumentConverter:
                end_pb_time = time.time() - start_pb_time
                _log.info(f"Finished converting page batch time={end_pb_time:.3f}")
            # Free up mem resources of PDF backend
            in_doc._backend.unload()
            conv_res.pages = all_assembled_pages
            self._assemble_doc(conv_res)
@ -227,6 +224,11 @@ class DocumentConverter:
                f"{trace}"
            )
        finally:
            # Always unload the PDF backend, even in case of failure
            if in_doc._backend:
                in_doc._backend.unload()
        end_doc_time = time.time() - start_doc_time
        _log.info(
            f"Finished converting document time-pages={end_doc_time:.2f}/{in_doc.page_count}"
--- a/docling/models/base_ocr_model.py
+++ b/docling/models/base_ocr_model.py
@ -3,21 +3,21 @@ import logging
 from abc import abstractmethod
 from typing import Iterable, List, Tuple
 import numpy
 import numpy as np
 from PIL import Image, ImageDraw
 from rtree import index
 from scipy.ndimage import find_objects, label
 from docling.datamodel.base_models import BoundingBox, CoordOrigin, OcrCell, Page
 from docling.datamodel.pipeline_options import OcrOptions
 _log = logging.getLogger(__name__)
 class BaseOcrModel:
-    def __init__(self, config):
+    def __init__(self, enabled: bool, options: OcrOptions):
-        self.config = config
+        self.enabled = enabled
-        self.enabled = config["enabled"]
+        self.options = options
    # Computes the optimum amount and coordinates of rectangles to OCR on a given page
    def get_ocr_rects(self, page: Page) -> Tuple[bool, List[BoundingBox]]:
--- a/docling/models/easyocr_model.py
+++ b/docling/models/easyocr_model.py
@ -4,21 +4,33 @@ from typing import Iterable
 import numpy
 from docling.datamodel.base_models import BoundingBox, CoordOrigin, OcrCell, Page
 from docling.datamodel.pipeline_options import EasyOcrOptions
 from docling.models.base_ocr_model import BaseOcrModel
 _log = logging.getLogger(__name__)
 class EasyOcrModel(BaseOcrModel):
-    def __init__(self, config):
+    def __init__(self, enabled: bool, options: EasyOcrOptions):
-        super().__init__(config)
+        super().__init__(enabled=enabled, options=options)
        self.options: EasyOcrOptions
        self.scale = 3  # multiplier for 72 dpi == 216 dpi.
        if self.enabled:
-            import easyocr
+            try:
                import easyocr
            except ImportError:
                raise ImportError(
                    "EasyOCR is not installed. Please install it via `pip install easyocr` to use this OCR engine. "
                    "Alternatively, Docling has support for other OCR engines. See the documentation."
                )
-            self.reader = easyocr.Reader(config["lang"])
+            self.reader = easyocr.Reader(
                lang_list=self.options.lang,
                model_storage_directory=self.options.model_storage_directory,
                download_enabled=self.options.download_enabled,
            )
    def __call__(self, page_batch: Iterable[Page]) -> Iterable[Page]:
@ -31,6 +43,9 @@ class EasyOcrModel(BaseOcrModel):
            all_ocr_cells = []
            for ocr_rect in ocr_rects:
                # Skip zero area boxes
                if ocr_rect.area() == 0:
                    continue
                high_res_image = page._backend.get_page_image(
                    scale=self.scale, cropbox=ocr_rect
                )
--- a/docling/models/tesseract_ocr_cli_model.py
+++ b/docling/models/tesseract_ocr_cli_model.py
@ -0,0 +1,167 @@
 import io
 import logging
 import tempfile
 from subprocess import DEVNULL, PIPE, Popen
 from typing import Iterable, Tuple
 import pandas as pd
 from docling.datamodel.base_models import BoundingBox, CoordOrigin, OcrCell, Page
 from docling.datamodel.pipeline_options import TesseractCliOcrOptions
 from docling.models.base_ocr_model import BaseOcrModel
 _log = logging.getLogger(__name__)
 class TesseractOcrCliModel(BaseOcrModel):
    def __init__(self, enabled: bool, options: TesseractCliOcrOptions):
        super().__init__(enabled=enabled, options=options)
        self.options: TesseractCliOcrOptions
        self.scale = 3  # multiplier for 72 dpi == 216 dpi.
        self._name = None
        self._version = None
        if self.enabled:
            try:
                self._get_name_and_version()
            except Exception as exc:
                raise RuntimeError(
                    f"Tesseract is not available, aborting: {exc} "
                    "Install tesseract on your system and the tesseract binary is discoverable. "
                    "The actual command for Tesseract can be specified in `pipeline_options.ocr_options.tesseract_cmd='tesseract'`. "
                    "Alternatively, Docling has support for other OCR engines. See the documentation."
                )
    def _get_name_and_version(self) -> Tuple[str, str]:
        if self._name != None and self._version != None:
            return self._name, self._version
        cmd = [self.options.tesseract_cmd, "--version"]
        proc = Popen(cmd, stdout=PIPE, stderr=PIPE)
        stdout, stderr = proc.communicate()
        proc.wait()
        # HACK: Windows versions of Tesseract output the version to stdout, Linux versions
        # to stderr, so check both.
        version_line = (
            (stdout.decode("utf8").strip() or stderr.decode("utf8").strip())
            .split("\n")[0]
            .strip()
        )
        # If everything else fails...
        if not version_line:
            version_line = "tesseract XXX"
        name, version = version_line.split(" ")
        self._name = name
        self._version = version
        return name, version
    def _run_tesseract(self, ifilename: str):
        cmd = [self.options.tesseract_cmd]
        if self.options.lang is not None and len(self.options.lang) > 0:
            cmd.append("-l")
            cmd.append("+".join(self.options.lang))
        if self.options.path is not None:
            cmd.append("--tessdata-dir")
            cmd.append(self.options.path)
        cmd += [ifilename, "stdout", "tsv"]
        _log.info("command: {}".format(" ".join(cmd)))
        proc = Popen(cmd, stdout=PIPE, stderr=DEVNULL)
        output, _ = proc.communicate()
        # _log.info(output)
        # Decode the byte string to a regular string
        decoded_data = output.decode("utf-8")
        # _log.info(decoded_data)
        # Read the TSV file generated by Tesseract
        df = pd.read_csv(io.StringIO(decoded_data), sep="\t")
        # Display the dataframe (optional)
        # _log.info("df: ", df.head())
        # Filter rows that contain actual text (ignore header or empty rows)
        df_filtered = df[df["text"].notnull() & (df["text"].str.strip() != "")]
        return df_filtered
    def __call__(self, page_batch: Iterable[Page]) -> Iterable[Page]:
        if not self.enabled:
            yield from page_batch
            return
        for page in page_batch:
            ocr_rects = self.get_ocr_rects(page)
            all_ocr_cells = []
            for ocr_rect in ocr_rects:
                # Skip zero area boxes
                if ocr_rect.area() == 0:
                    continue
                high_res_image = page._backend.get_page_image(
                    scale=self.scale, cropbox=ocr_rect
                )
                with tempfile.NamedTemporaryFile(suffix=".png", mode="w") as image_file:
                    fname = image_file.name
                    high_res_image.save(fname)
                    df = self._run_tesseract(fname)
                # _log.info(df)
                # Print relevant columns (bounding box and text)
                for ix, row in df.iterrows():
                    text = row["text"]
                    conf = row["conf"]
                    l = float(row["left"])
                    b = float(row["top"])
                    w = float(row["width"])
                    h = float(row["height"])
                    t = b + h
                    r = l + w
                    cell = OcrCell(
                        id=ix,
                        text=text,
                        confidence=conf / 100.0,
                        bbox=BoundingBox.from_tuple(
                            coord=(
                                (l / self.scale) + ocr_rect.l,
                                (b / self.scale) + ocr_rect.t,
                                (r / self.scale) + ocr_rect.l,
                                (t / self.scale) + ocr_rect.t,
                            ),
                            origin=CoordOrigin.TOPLEFT,
                        ),
                    )
                    all_ocr_cells.append(cell)
            ## Remove OCR cells which overlap with programmatic cells.
            filtered_ocr_cells = self.filter_ocr_cells(all_ocr_cells, page.cells)
            page.cells.extend(filtered_ocr_cells)
            # DEBUG code:
            # self.draw_ocr_rects_and_cells(page, ocr_rects)
            yield page
--- a/docling/models/tesseract_ocr_model.py
+++ b/docling/models/tesseract_ocr_model.py
@ -0,0 +1,122 @@
 import logging
 from typing import Iterable
 import numpy
 from docling.datamodel.base_models import BoundingBox, CoordOrigin, OcrCell, Page
 from docling.datamodel.pipeline_options import TesseractCliOcrOptions
 from docling.models.base_ocr_model import BaseOcrModel
 _log = logging.getLogger(__name__)
 class TesseractOcrModel(BaseOcrModel):
    def __init__(self, enabled: bool, options: TesseractCliOcrOptions):
        super().__init__(enabled=enabled, options=options)
        self.options: TesseractCliOcrOptions
        self.scale = 3  # multiplier for 72 dpi == 216 dpi.
        self.reader = None
        if self.enabled:
            setup_errmsg = (
                "tesserocr is not correctly installed. "
                "Please install it via `pip install tesserocr` to use this OCR engine. "
                "Note that tesserocr might have to be manually compiled for working with"
                "your Tesseract installation. The Docling documentation provides examples for it. "
                "Alternatively, Docling has support for other OCR engines. See the documentation."
            )
            try:
                import tesserocr
            except ImportError:
                raise ImportError(setup_errmsg)
            try:
                tesseract_version = tesserocr.tesseract_version()
                _log.debug("Initializing TesserOCR: %s", tesseract_version)
            except:
                raise ImportError(setup_errmsg)
            # Initialize the tesseractAPI
            lang = "+".join(self.options.lang)
            if self.options.path is not None:
                self.reader = tesserocr.PyTessBaseAPI(
                    path=self.options.path,
                    lang=lang,
                    psm=tesserocr.PSM.AUTO,
                    init=True,
                    oem=tesserocr.OEM.DEFAULT,
                )
            else:
                self.reader = tesserocr.PyTessBaseAPI(
                    lang=lang,
                    psm=tesserocr.PSM.AUTO,
                    init=True,
                    oem=tesserocr.OEM.DEFAULT,
                )
            self.reader_RIL = tesserocr.RIL
    def __del__(self):
        if self.reader is not None:
            # Finalize the tesseractAPI
            self.reader.End()
    def __call__(self, page_batch: Iterable[Page]) -> Iterable[Page]:
        if not self.enabled:
            yield from page_batch
            return
        for page in page_batch:
            ocr_rects = self.get_ocr_rects(page)
            all_ocr_cells = []
            for ocr_rect in ocr_rects:
                # Skip zero area boxes
                if ocr_rect.area() == 0:
                    continue
                high_res_image = page._backend.get_page_image(
                    scale=self.scale, cropbox=ocr_rect
                )
                # Retrieve text snippets with their bounding boxes
                self.reader.SetImage(high_res_image)
                boxes = self.reader.GetComponentImages(self.reader_RIL.TEXTLINE, True)
                cells = []
                for ix, (im, box, _, _) in enumerate(boxes):
                    # Set the area of interest. Tesseract uses Bottom-Left for the origin
                    self.reader.SetRectangle(box["x"], box["y"], box["w"], box["h"])
                    # Extract text within the bounding box
                    text = self.reader.GetUTF8Text().strip()
                    confidence = self.reader.MeanTextConf()
                    left = box["x"] / self.scale
                    bottom = box["y"] / self.scale
                    right = (box["x"] + box["w"]) / self.scale
                    top = (box["y"] + box["h"]) / self.scale
                    cells.append(
                        OcrCell(
                            id=ix,
                            text=text,
                            confidence=confidence,
                            bbox=BoundingBox.from_tuple(
                                coord=(left, top, right, bottom),
                                origin=CoordOrigin.TOPLEFT,
                            ),
                        )
                    )
                # del high_res_image
                all_ocr_cells.extend(cells)
            ## Remove OCR cells which overlap with programmatic cells.
            filtered_ocr_cells = self.filter_ocr_cells(all_ocr_cells, page.cells)
            page.cells.extend(filtered_ocr_cells)
            # DEBUG code:
            # self.draw_ocr_rects_and_cells(page, ocr_rects)
            yield page
--- a/docling/pipeline/standard_model_pipeline.py
+++ b/docling/pipeline/standard_model_pipeline.py
@ -1,9 +1,17 @@
 from pathlib import Path
-from docling.datamodel.pipeline_options import PipelineOptions
+from docling.datamodel.pipeline_options import (
    EasyOcrOptions,
    PipelineOptions,
    TesseractCliOcrOptions,
    TesseractOcrOptions,
 )
 from docling.models.base_ocr_model import BaseOcrModel
 from docling.models.easyocr_model import EasyOcrModel
 from docling.models.layout_model import LayoutModel
 from docling.models.table_structure_model import TableStructureModel
 from docling.models.tesseract_ocr_cli_model import TesseractOcrCliModel
 from docling.models.tesseract_ocr_model import TesseractOcrModel
 from docling.pipeline.base_model_pipeline import BaseModelPipeline
@ -14,19 +22,38 @@ class StandardModelPipeline(BaseModelPipeline):
    def __init__(self, artifacts_path: Path, pipeline_options: PipelineOptions):
        super().__init__(artifacts_path, pipeline_options)
        ocr_model: BaseOcrModel
        if isinstance(pipeline_options.ocr_options, EasyOcrOptions):
            ocr_model = EasyOcrModel(
                enabled=pipeline_options.do_ocr,
                options=pipeline_options.ocr_options,
            )
        elif isinstance(pipeline_options.ocr_options, TesseractCliOcrOptions):
            ocr_model = TesseractOcrCliModel(
                enabled=pipeline_options.do_ocr,
                options=pipeline_options.ocr_options,
            )
        elif isinstance(pipeline_options.ocr_options, TesseractOcrOptions):
            ocr_model = TesseractOcrModel(
                enabled=pipeline_options.do_ocr,
                options=pipeline_options.ocr_options,
            )
        else:
            raise RuntimeError(
                f"The specified OCR kind is not supported: {pipeline_options.ocr_options.kind}."
            )
        self.model_pipe = [
-            EasyOcrModel(
+            # OCR
-                config={
+            ocr_model,
-                    "lang": ["fr", "de", "es", "en"],
+            # Layout
                    "enabled": pipeline_options.do_ocr,
                }
            ),
            LayoutModel(
                config={
                    "artifacts_path": artifacts_path
                    / StandardModelPipeline._layout_model_path
                }
            ),
            # Table structure
            TableStructureModel(
                config={
                    "artifacts_path": artifacts_path
--- a/examples/custom_convert.py
+++ b/examples/custom_convert.py
@ -8,6 +8,10 @@ from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
 from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
 from docling.datamodel.base_models import ConversionStatus, PipelineOptions
 from docling.datamodel.document import ConversionResult, DocumentConversionInput
 from docling.datamodel.pipeline_options import (
    TesseractCliOcrOptions,
    TesseractOcrOptions,
 )
 from docling.document_converter import DocumentConverter
 _log = logging.getLogger(__name__)
@ -71,7 +75,7 @@ def main():
    # and PDF Backends for various configurations.
    # Uncomment one section at the time to see the differences in the output.
-    # PyPdfium without OCR
+    # PyPdfium without EasyOCR
    # --------------------
    # pipeline_options = PipelineOptions()
    # pipeline_options.do_ocr=False
@ -83,7 +87,7 @@ def main():
    #     pdf_backend=PyPdfiumDocumentBackend,
    # )
-    # PyPdfium with OCR
+    # PyPdfium with EasyOCR
    # -----------------
    # pipeline_options = PipelineOptions()
    # pipeline_options.do_ocr=True
@ -95,7 +99,7 @@ def main():
    #     pdf_backend=PyPdfiumDocumentBackend,
    # )
-    # Docling Parse without OCR
+    # Docling Parse without EasyOCR
    # -------------------------
    pipeline_options = PipelineOptions()
    pipeline_options.do_ocr = False
@ -107,7 +111,7 @@ def main():
        pdf_backend=DoclingParseDocumentBackend,
    )
-    # Docling Parse with OCR
+    # Docling Parse with EasyOCR
    # ----------------------
    # pipeline_options = PipelineOptions()
    # pipeline_options.do_ocr=True
@ -119,6 +123,32 @@ def main():
    #     pdf_backend=DoclingParseDocumentBackend,
    # )
    # Docling Parse with Tesseract
    # ----------------------
    # pipeline_options = PipelineOptions()
    # pipeline_options.do_ocr = True
    # pipeline_options.do_table_structure = True
    # pipeline_options.table_structure_options.do_cell_matching = True
    # pipeline_options.ocr_options = TesseractOcrOptions()
    # doc_converter = DocumentConverter(
    #     pipeline_options=pipeline_options,
    #     pdf_backend=DoclingParseDocumentBackend,
    # )
    # Docling Parse with Tesseract CLI
    # ----------------------
    # pipeline_options = PipelineOptions()
    # pipeline_options.do_ocr = True
    # pipeline_options.do_table_structure = True
    # pipeline_options.table_structure_options.do_cell_matching = True
    # pipeline_options.ocr_options = TesseractCliOcrOptions()
    # doc_converter = DocumentConverter(
    #     pipeline_options=pipeline_options,
    #     pdf_backend=DoclingParseDocumentBackend,
    # )
    ###########################################################################
    # Define input files
--- a/examples/rag_llamaindex.ipynb
+++ b/examples/rag_llamaindex.ipynb
@ -1,5 +1,12 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/examples/rag_llamaindex.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -7,6 +14,38 @@
    "# RAG with Docling and 🦙 LlamaIndex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LlamaIndex extensions `DoclingReader` and `DoclingNodeParser` presented in this notebook seamlessly integrate Docling into LlamaIndex, enabling you to:\n",
    "- use PDF documents in your LLM applications with ease and speed, and\n",
    "- leverage Docling's rich format for advanced, document-native grounding."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 👉 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.\n",
    "- Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var `HF_TOKEN`.\n",
    "- Requirements can be installed as shown below (`--no-warn-conflicts` meant for Colab's pre-populated Python env; feel free to remove for stricter usage):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
@ -21,35 +60,49 @@
    }
   ],
   "source": [
-    "# requirements for this example:\n",
+    "%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-readers-file python-dotenv"
    "%pip install -qq docling docling-core python-dotenv llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
-   "outputs": [
+   "outputs": [],
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
-    "from tempfile import TemporaryDirectory\n",
+    "from pathlib import Path\n",
    "from tempfile import mkdtemp\n",
    "from warnings import filterwarnings\n",
    "\n",
    "from dotenv import load_dotenv\n",
    "from pydantic import TypeAdapter\n",
    "from rich.pretty import pprint\n",
    "\n",
-    "load_dotenv()"
+    "\n",
    "def _get_env_from_colab_or_os(key):\n",
    "    try:\n",
    "        from google.colab import userdata\n",
    "\n",
    "        try:\n",
    "            return userdata.get(key)\n",
    "        except userdata.SecretNotFoundError:\n",
    "            pass\n",
    "    except ImportError:\n",
    "        pass\n",
    "    return os.getenv(key)\n",
    "\n",
    "\n",
    "load_dotenv()\n",
    "\n",
    "filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")\n",
    "filterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\")\n",
    "# https://github.com/huggingface/transformers/issues/5486:\n",
    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now define the main parameters:"
   ]
  },
  {
@ -58,250 +111,61 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "import warnings\n",
+    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n",
    "\n",
-    "warnings.filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic|torch\")\n",
+    "EMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\n",
-    "warnings.filterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\")"
+    "MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\n",
    "GEN_MODEL = HuggingFaceInferenceAPI(\n",
    "    token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n",
    "    model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n",
    ")\n",
    "SOURCE = \"https://arxiv.org/pdf/2408.09869\"  # Docling Technical Report\n",
    "QUERY = \"Which are the main AI models in Docling?\"\n",
    "\n",
    "embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Setup"
+    "## Using Markdown export"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Reader and node parser"
+    "To create a simple RAG pipeline, we can:\n",
-   ]
+    "- define a `DoclingReader`, which by default exports to Markdown, and\n",
-  },
+    "- use a standard node parser for these Markdown-based docs, e.g. a `MarkdownNodeParser`"
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below we set up:\n",
    "- a `Reader` which will be used to create LlamaIndex documents, and\n",
    "- a `NodeParser`, which will be used to create LlamaIndex nodes out of the documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from enum import Enum\n",
    "from typing import Iterable\n",
    "\n",
    "from llama_index.core.readers.base import BasePydanticReader\n",
    "from llama_index.core.schema import Document as LIDocument\n",
    "from pydantic import BaseModel\n",
    "\n",
    "from docling.document_converter import DocumentConverter\n",
    "\n",
    "\n",
    "class DocumentMetadata(BaseModel):\n",
    "    dl_doc_hash: str\n",
    "\n",
    "\n",
    "class DoclingPDFReader(BasePydanticReader):\n",
    "    class ParseType(str, Enum):\n",
    "        MARKDOWN = \"markdown\"\n",
    "        # JSON = \"json\"\n",
    "\n",
    "    parse_type: ParseType = ParseType.MARKDOWN\n",
    "\n",
    "    def lazy_load_data(self, file_path: str | list[str]) -> Iterable[LIDocument]:\n",
    "        file_paths = file_path if isinstance(file_path, list) else [file_path]\n",
    "        converter = DocumentConverter()\n",
    "        for source in file_paths:\n",
    "            dl_doc = converter.convert_single(source).output\n",
    "            match self.parse_type:\n",
    "                case self.ParseType.MARKDOWN:\n",
    "                    text = dl_doc.export_to_markdown()\n",
    "                # case self.ParseType.JSON:\n",
    "                #     text = dl_doc.model_dump_json()\n",
    "                case _:\n",
    "                    raise RuntimeError(\n",
    "                        f\"Unexpected parse type encountered: {self.parse_type}\"\n",
    "                    )\n",
    "            excl_metadata_keys = [\"dl_doc_hash\"]\n",
    "            li_doc = LIDocument(\n",
    "                doc_id=dl_doc.file_info.document_hash,\n",
    "                text=text,\n",
    "                excluded_embed_metadata_keys=excl_metadata_keys,\n",
    "                excluded_llm_metadata_keys=excl_metadata_keys,\n",
    "            )\n",
    "            li_doc.metadata = DocumentMetadata(\n",
    "                dl_doc_hash=dl_doc.file_info.document_hash,\n",
    "            ).model_dump()\n",
    "            yield li_doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.node_parser import MarkdownNodeParser\n",
    "\n",
    "reader = DoclingPDFReader(parse_type=DoclingPDFReader.ParseType.MARKDOWN)\n",
    "node_parser = MarkdownNodeParser()\n",
    "transformations = [node_parser]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One can include add more transformations, e.g. further chunking based on text size / overlap, as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from llama_index.core.node_parser import TokenTextSplitter\n",
    "\n",
    "# splitter = TokenTextSplitter(\n",
    "#     chunk_size=1024,\n",
    "#     chunk_overlap=20,\n",
    "# )\n",
    "# transformations.append(splitter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Embed model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "\n",
    "embed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vector store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "INGEST = True  # whether to ingest from scratch or reuse an existing vector store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
-     "name": "stderr",
+     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "Q: Which are the main AI models in Docling?\n",
-      "To disable this warning, you can either:\n",
+      "A: 1. A layout analysis model, an accurate object-detector for page elements. 2. TableFormer, a state-of-the-art table structure recognition model.\n",
-      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\n",
-      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+      "Sources:\n"
     ]
    }
   ],
   "source": [
    "from llama_index.vector_stores.milvus import MilvusVectorStore\n",
    "\n",
    "MILVUS_URL = os.environ.get(\n",
    "    \"MILVUS_URL\", f\"{(tmp_dir := TemporaryDirectory()).name}/milvus_demo.db\"\n",
    ")\n",
    "MILVUS_COLL_NAME = os.environ.get(\"MILVUS_COLL_NAME\", \"basic_llamaindex_pipeline\")\n",
    "MILVUS_KWARGS = TypeAdapter(dict).validate_json(os.environ.get(\"MILVUS_KWARGS\", \"{}\"))\n",
    "vector_store = MilvusVectorStore(\n",
    "    uri=MILVUS_URL,\n",
    "    collection_name=MILVUS_COLL_NAME,\n",
    "    dim=len(embed_model.get_text_embedding(\"hi\")),\n",
    "    overwrite=INGEST,\n",
    "    **MILVUS_KWARGS,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "536daee038de4d52a793445c6d853c72",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   </span><span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">Document</span><span style=\"font-weight: bold\">(</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">id_</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'5dfbd8c115a15fd3396b68409124cfee29fc8efac7b5c84663'</span>+<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">14</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">embedding</span>=<span style=\"color: #800080; text-decoration-color: #800080; font-style: italic\">None</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">metadata</span>=<span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'dl_doc_hash'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'5dfbd8c115a15fd3396b68409124cfee29fc8efac7b5c84663'</span>+<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">14</span><span style=\"font-weight: bold\">}</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">excluded_embed_metadata_keys</span>=<span style=\"font-weight: bold\">[</span><span style=\"color: #008000; text-decoration-color: #008000\">'dl_doc_hash'</span><span style=\"font-weight: bold\">]</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">excluded_llm_metadata_keys</span>=<span style=\"font-weight: bold\">[</span><span style=\"color: #008000; text-decoration-color: #008000\">'dl_doc_hash'</span><span style=\"font-weight: bold\">]</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">relationships</span>=<span style=\"font-weight: bold\">{}</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'## DocLayNet: A Large Human-Annotated Dataset for '</span>+<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">50593</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">mimetype</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'text/plain'</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">start_char_idx</span>=<span style=\"color: #800080; text-decoration-color: #800080; font-style: italic\">None</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">end_char_idx</span>=<span style=\"color: #800080; text-decoration-color: #800080; font-style: italic\">None</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">text_template</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{metadata_str}\\n\\n{content}'</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">metadata_template</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{key}: {value}'</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">metadata_seperator</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'\\n'</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   </span><span style=\"font-weight: bold\">)</span>\n",
       "<span style=\"font-weight: bold\">]</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
-       "\u001b[1m[\u001b[0m\n",
+       "[('3.2 AI models\\n\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n",
-       "\u001b[2;32m│   \u001b[0m\u001b[1;35mDocument\u001b[0m\u001b[1m(\u001b[0m\n",
+       "  {'dl_doc_hash': '556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa',\n",
-       "\u001b[2;32m│   │   \u001b[0m\u001b[33mid_\u001b[0m=\u001b[32m'5dfbd8c115a15fd3396b68409124cfee29fc8efac7b5c84663'\u001b[0m+\u001b[1;36m14\u001b[0m,\n",
+       "   'Header_2': '3.2 AI models'}),\n",
-       "\u001b[2;32m│   │   \u001b[0m\u001b[33membedding\u001b[0m=\u001b[3;35mNone\u001b[0m,\n",
+       " (\"5 Applications\\n\\nThanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.\",\n",
-       "\u001b[2;32m│   │   \u001b[0m\u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'dl_doc_hash'\u001b[0m: \u001b[32m'5dfbd8c115a15fd3396b68409124cfee29fc8efac7b5c84663'\u001b[0m+\u001b[1;36m14\u001b[0m\u001b[1m}\u001b[0m,\n",
+       "  {'dl_doc_hash': '556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa',\n",
-       "\u001b[2;32m│   │   \u001b[0m\u001b[33mexcluded_embed_metadata_keys\u001b[0m=\u001b[1m[\u001b[0m\u001b[32m'dl_doc_hash'\u001b[0m\u001b[1m]\u001b[0m,\n",
+       "   'Header_2': '5 Applications'})]"
       "\u001b[2;32m│   │   \u001b[0m\u001b[33mexcluded_llm_metadata_keys\u001b[0m=\u001b[1m[\u001b[0m\u001b[32m'dl_doc_hash'\u001b[0m\u001b[1m]\u001b[0m,\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[33mrelationships\u001b[0m=\u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[33mtext\u001b[0m=\u001b[32m'## DocLayNet: A Large Human-Annotated Dataset for '\u001b[0m+\u001b[1;36m50593\u001b[0m,\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[33mmimetype\u001b[0m=\u001b[32m'text/plain'\u001b[0m,\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[33mstart_char_idx\u001b[0m=\u001b[3;35mNone\u001b[0m,\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[33mend_char_idx\u001b[0m=\u001b[3;35mNone\u001b[0m,\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[33mtext_template\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32mmetadata_str\u001b[0m\u001b[32m}\u001b[0m\u001b[32m\\n\\n\u001b[0m\u001b[32m{\u001b[0m\u001b[32mcontent\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[33mmetadata_template\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32mkey\u001b[0m\u001b[32m}\u001b[0m\u001b[32m: \u001b[0m\u001b[32m{\u001b[0m\u001b[32mvalue\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[33mmetadata_seperator\u001b[0m=\u001b[32m'\\n'\u001b[0m\n",
       "\u001b[2;32m│   \u001b[0m\u001b[1m)\u001b[0m\n",
       "\u001b[1m]\u001b[0m\n"
      ]
     },
     "metadata": {},
@ -310,131 +174,83 @@
   ],
   "source": [
    "from llama_index.core import StorageContext, VectorStoreIndex\n",
    "from llama_index.core.node_parser import MarkdownNodeParser\n",
    "from llama_index.readers.docling import DoclingReader\n",
    "from llama_index.vector_stores.milvus import MilvusVectorStore\n",
    "\n",
-    "if INGEST:\n",
+    "reader = DoclingReader()\n",
-    "    # in this case we ingest the data into the vector store\n",
+    "node_parser = MarkdownNodeParser()\n",
-    "    docs = reader.load_data(\n",
+    "\n",
-    "        file_path=\"https://arxiv.org/pdf/2206.01062\",  # DocLayNet paper\n",
+    "vector_store = MilvusVectorStore(\n",
-    "    )\n",
+    "    uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed\n",
-    "    pprint(docs, max_length=1, max_string=50, max_depth=4)\n",
+    "    dim=embed_dim,\n",
-    "    storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
+    "    overwrite=True,\n",
-    "    index = VectorStoreIndex.from_documents(\n",
+    ")\n",
-    "        documents=docs,\n",
+    "index = VectorStoreIndex.from_documents(\n",
-    "        embed_model=embed_model,\n",
+    "    documents=reader.load_data(SOURCE),\n",
-    "        storage_context=storage_context,\n",
+    "    transformations=[node_parser],\n",
-    "        transformations=transformations,\n",
+    "    storage_context=StorageContext.from_defaults(vector_store=vector_store),\n",
-    "    )\n",
+    "    embed_model=EMBED_MODEL,\n",
-    "else:\n",
+    ")\n",
-    "    # in this case we just load the vector store index\n",
+    "result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\n",
-    "    index = VectorStoreIndex.from_vector_store(\n",
+    "print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
-    "        vector_store=vector_store,\n",
+    "display([(n.text, n.metadata) for n in result.source_nodes])"
    "        embed_model=embed_model,\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### LLM"
+    "## Using Docling format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n",
    "\n",
    "HF_API_KEY = os.environ.get(\"HF_API_KEY\")\n",
    "\n",
    "llm = HuggingFaceInferenceAPI(\n",
    "    token=HF_API_KEY,\n",
    "    model_name=\"mistralai/Mistral-7B-Instruct-v0.3\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## RAG"
+    "To leverage Docling's rich native format, we:\n",
    "- create a `DoclingReader` with JSON export type, and\n",
    "- employ a `DoclingNodeParser` in order to appropriately parse that Docling format.\n",
    "\n",
    "Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Q: Which are the main AI models in Docling?\n",
      "A: The main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art table structure recognition model.\n",
      "\n",
      "Sources:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">Response</span><span style=\"font-weight: bold\">(</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   </span><span style=\"color: #808000; text-decoration-color: #808000\">response</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'80863 pages were human annotated.'</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   </span><span style=\"color: #808000; text-decoration-color: #808000\">source_nodes</span>=<span style=\"font-weight: bold\">[</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">NodeWithScore</span><span style=\"font-weight: bold\">(</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">node</span>=<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">TextNode</span><span style=\"font-weight: bold\">(</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">id_</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'8874a117-d181-4f4f-a30b-0b5604370d77'</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">embedding</span>=<span style=\"color: #800080; text-decoration-color: #800080; font-style: italic\">None</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">metadata</span>=<span style=\"font-weight: bold\">{</span><span style=\"color: #808000; text-decoration-color: #808000\">...</span><span style=\"font-weight: bold\">}</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">excluded_embed_metadata_keys</span>=<span style=\"font-weight: bold\">[</span><span style=\"color: #808000; text-decoration-color: #808000\">...</span><span style=\"font-weight: bold\">]</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">excluded_llm_metadata_keys</span>=<span style=\"font-weight: bold\">[</span><span style=\"color: #808000; text-decoration-color: #808000\">...</span><span style=\"font-weight: bold\">]</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">relationships</span>=<span style=\"font-weight: bold\">{</span><span style=\"color: #808000; text-decoration-color: #808000\">...</span><span style=\"font-weight: bold\">}</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'3 THE DOCLAYNET DATASET\\n\\nDocLayNet contains 80863 PDF pages. Among these, 7059 carry two instances of human annotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape o'</span>+<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">5775</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">mimetype</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'text/plain'</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">start_char_idx</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">9089</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">end_char_idx</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">15114</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">text_template</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{metadata_str}\\n\\n{content}'</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">metadata_template</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{key}: {value}'</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">metadata_seperator</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'\\n'</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   </span><span style=\"font-weight: bold\">)</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">score</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.7367570400238037</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"font-weight: bold\">)</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">...</span> +<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   </span><span style=\"font-weight: bold\">]</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   </span><span style=\"color: #808000; text-decoration-color: #808000\">metadata</span>=<span style=\"font-weight: bold\">{</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #008000; text-decoration-color: #008000\">'8874a117-d181-4f4f-a30b-0b5604370d77'</span>: <span style=\"font-weight: bold\">{</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   </span><span style=\"color: #008000; text-decoration-color: #008000\">'dl_doc_hash'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'5dfbd8c115a15fd3396b68409124cfee29fc8efac7b5c846634ff924e635e0dc'</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">...</span> +<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"font-weight: bold\">}</span>,\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   │   </span><span style=\"color: #808000; text-decoration-color: #808000\">...</span> +<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>\n",
       "<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│   </span><span style=\"font-weight: bold\">}</span>\n",
       "<span style=\"font-weight: bold\">)</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
-       "\u001b[1;35mResponse\u001b[0m\u001b[1m(\u001b[0m\n",
+       "[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n",
-       "\u001b[2;32m│   \u001b[0m\u001b[33mresponse\u001b[0m=\u001b[32m'80863 pages were human annotated.'\u001b[0m,\n",
+       "  {'dl_doc_hash': '556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa',\n",
-       "\u001b[2;32m│   \u001b[0m\u001b[33msource_nodes\u001b[0m=\u001b[1m[\u001b[0m\n",
+       "   'path': '#/main-text/37',\n",
-       "\u001b[2;32m│   │   \u001b[0m\u001b[1;35mNodeWithScore\u001b[0m\u001b[1m(\u001b[0m\n",
+       "   'heading': '3.2 AI models',\n",
-       "\u001b[2;32m│   │   │   \u001b[0m\u001b[33mnode\u001b[0m=\u001b[1;35mTextNode\u001b[0m\u001b[1m(\u001b[0m\n",
+       "   'page': 3,\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mid_\u001b[0m=\u001b[32m'8874a117-d181-4f4f-a30b-0b5604370d77'\u001b[0m,\n",
+       "   'bbox': [107.36903381347656,\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33membedding\u001b[0m=\u001b[3;35mNone\u001b[0m,\n",
+       "    330.07513427734375,\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\u001b[33m...\u001b[0m\u001b[1m}\u001b[0m,\n",
+       "    506.29705810546875,\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mexcluded_embed_metadata_keys\u001b[0m=\u001b[1m[\u001b[0m\u001b[33m...\u001b[0m\u001b[1m]\u001b[0m,\n",
+       "    407.3725280761719]}),\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mexcluded_llm_metadata_keys\u001b[0m=\u001b[1m[\u001b[0m\u001b[33m...\u001b[0m\u001b[1m]\u001b[0m,\n",
+       " ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mrelationships\u001b[0m=\u001b[1m{\u001b[0m\u001b[33m...\u001b[0m\u001b[1m}\u001b[0m,\n",
+       "  {'dl_doc_hash': '556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa',\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mtext\u001b[0m=\u001b[32m'3 THE DOCLAYNET DATASET\\n\\nDocLayNet contains 80863 PDF pages. Among these, 7059 carry two instances of human annotations, and 1591 carry three. This amounts to 91104 total annotation instances. The annotations provide layout information in the shape o'\u001b[0m+\u001b[1;36m5775\u001b[0m,\n",
+       "   'path': '#/main-text/10',\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mmimetype\u001b[0m=\u001b[32m'text/plain'\u001b[0m,\n",
+       "   'heading': '1 Introduction',\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mstart_char_idx\u001b[0m=\u001b[1;36m9089\u001b[0m,\n",
+       "   'page': 1,\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mend_char_idx\u001b[0m=\u001b[1;36m15114\u001b[0m,\n",
+       "   'bbox': [107.33261108398438,\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mtext_template\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32mmetadata_str\u001b[0m\u001b[32m}\u001b[0m\u001b[32m\\n\\n\u001b[0m\u001b[32m{\u001b[0m\u001b[32mcontent\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
+       "    83.3067626953125,\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mmetadata_template\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32mkey\u001b[0m\u001b[32m}\u001b[0m\u001b[32m: \u001b[0m\u001b[32m{\u001b[0m\u001b[32mvalue\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
+       "    504.0033874511719,\n",
-       "\u001b[2;32m│   │   │   │   \u001b[0m\u001b[33mmetadata_seperator\u001b[0m=\u001b[32m'\\n'\u001b[0m\n",
+       "    136.45367431640625]})]"
       "\u001b[2;32m│   │   │   \u001b[0m\u001b[1m)\u001b[0m,\n",
       "\u001b[2;32m│   │   │   \u001b[0m\u001b[33mscore\u001b[0m=\u001b[1;36m0\u001b[0m\u001b[1;36m.7367570400238037\u001b[0m\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[1m)\u001b[0m,\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[33m...\u001b[0m +\u001b[1;36m1\u001b[0m\n",
       "\u001b[2;32m│   \u001b[0m\u001b[1m]\u001b[0m,\n",
       "\u001b[2;32m│   \u001b[0m\u001b[33mmetadata\u001b[0m=\u001b[1m{\u001b[0m\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[32m'8874a117-d181-4f4f-a30b-0b5604370d77'\u001b[0m: \u001b[1m{\u001b[0m\n",
       "\u001b[2;32m│   │   │   \u001b[0m\u001b[32m'dl_doc_hash'\u001b[0m: \u001b[32m'5dfbd8c115a15fd3396b68409124cfee29fc8efac7b5c846634ff924e635e0dc'\u001b[0m,\n",
       "\u001b[2;32m│   │   │   \u001b[0m\u001b[33m...\u001b[0m +\u001b[1;36m1\u001b[0m\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[1m}\u001b[0m,\n",
       "\u001b[2;32m│   │   \u001b[0m\u001b[33m...\u001b[0m +\u001b[1;36m1\u001b[0m\n",
       "\u001b[2;32m│   \u001b[0m\u001b[1m}\u001b[0m\n",
       "\u001b[1m)\u001b[0m\n"
      ]
     },
     "metadata": {},
@ -442,9 +258,148 @@
    }
   ],
   "source": [
-    "query_engine = index.as_query_engine(llm=llm)\n",
+    "from llama_index.node_parser.docling import DoclingNodeParser\n",
-    "query_res = query_engine.query(\"How many pages were human annotated?\")\n",
+    "\n",
-    "pprint(query_res, max_length=1, max_string=250, max_depth=4)"
+    "reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\n",
    "node_parser = DoclingNodeParser()\n",
    "\n",
    "vector_store = MilvusVectorStore(\n",
    "    uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed\n",
    "    dim=embed_dim,\n",
    "    overwrite=True,\n",
    ")\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents=reader.load_data(SOURCE),\n",
    "    transformations=[node_parser],\n",
    "    storage_context=StorageContext.from_defaults(vector_store=vector_store),\n",
    "    embed_model=EMBED_MODEL,\n",
    ")\n",
    "result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\n",
    "print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
    "display([(n.text, n.metadata) for n in result.source_nodes])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## With Simple Directory Reader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To demonstrate this usage pattern, we first set up a test document directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from tempfile import mkdtemp\n",
    "\n",
    "import requests\n",
    "\n",
    "tmp_dir_path = Path(mkdtemp())\n",
    "r = requests.get(SOURCE)\n",
    "with open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:\n",
    "    out_file.write(r.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the `reader` and `node_parser` definitions from any of the above variants, usage with `SimpleDirectoryReader` then looks as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading files: 100%|██████████| 1/1 [00:11<00:00, 11.15s/file]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Q: Which are the main AI models in Docling?\n",
      "A: The main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art table structure recognition model.\n",
      "\n",
      "Sources:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n",
       "  {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp4vsev3_r/2408.09869.pdf',\n",
       "   'file_name': '2408.09869.pdf',\n",
       "   'file_type': 'application/pdf',\n",
       "   'file_size': 5566574,\n",
       "   'creation_date': '2024-10-09',\n",
       "   'last_modified_date': '2024-10-09',\n",
       "   'dl_doc_hash': '556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa',\n",
       "   'path': '#/main-text/37',\n",
       "   'heading': '3.2 AI models',\n",
       "   'page': 3,\n",
       "   'bbox': [107.36903381347656,\n",
       "    330.07513427734375,\n",
       "    506.29705810546875,\n",
       "    407.3725280761719]}),\n",
       " ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n",
       "  {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp4vsev3_r/2408.09869.pdf',\n",
       "   'file_name': '2408.09869.pdf',\n",
       "   'file_type': 'application/pdf',\n",
       "   'file_size': 5566574,\n",
       "   'creation_date': '2024-10-09',\n",
       "   'last_modified_date': '2024-10-09',\n",
       "   'dl_doc_hash': '556ad9e23b6d2245e36b3208758cf0c8a709382bb4c859eacfe8e73b14e635aa',\n",
       "   'path': '#/main-text/10',\n",
       "   'heading': '1 Introduction',\n",
       "   'page': 1,\n",
       "   'bbox': [107.33261108398438,\n",
       "    83.3067626953125,\n",
       "    504.0033874511719,\n",
       "    136.45367431640625]})]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "dir_reader = SimpleDirectoryReader(\n",
    "    input_dir=tmp_dir_path,\n",
    "    file_extractor={\".pdf\": reader},\n",
    ")\n",
    "\n",
    "vector_store = MilvusVectorStore(\n",
    "    uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed\n",
    "    dim=embed_dim,\n",
    "    overwrite=True,\n",
    ")\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents=dir_reader.load_data(SOURCE),\n",
    "    transformations=[node_parser],\n",
    "    storage_context=StorageContext.from_defaults(vector_store=vector_store),\n",
    "    embed_model=EMBED_MODEL,\n",
    ")\n",
    "result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\n",
    "print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\n",
    "display([(n.text, n.metadata) for n in result.source_nodes])"
   ]
  },
  {
--- a/poetry.lock
+++ b/poetry.lock
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,6 +1,6 @@
 [tool.poetry]
 name = "docling"
-version = "1.18.0"  # DO NOT EDIT, updated automatically
+version = "1.19.0"  # DO NOT EDIT, updated automatically
 description = "Docling PDF conversion package"
 authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
 license = "MIT"
@ -37,7 +37,7 @@ torchvision = [
 ######################
 python = "^3.10"
 pydantic = "^2.0.0"
-docling-core = "^1.6.2"
+docling-core = "^1.7.1"
 docling-ibm-models = "^2.0.0"
 deepsearch-glm = "^0.22.0"
 filetype = "^1.2.0"
@ -46,13 +46,14 @@ pydantic-settings = "^2.3.0"
 huggingface_hub = ">=0.23,<1"
 requests = "^2.32.3"
 easyocr = "^1.7"
 tesserocr = { version = "^2.7.1", optional = true }
 docling-parse = {git = "ssh://git@github.com/DS4SD/docling-parse.git", rev = "5cbb4e48e6ff2a8596036a86096584156fdd4254"}
 certifi = ">=2024.7.4"
 rtree = "^1.3.0"
 scipy = "^1.14.1"
 pyarrow = "^16.1.0"
 typer = "^0.12.5"
 pandas = "^2.1.4"
 [tool.poetry.group.dev.dependencies]
 black = {extras = ["jupyter"], version = "^24.4.2"}
@ -67,7 +68,7 @@ pytest-xdist = "^3.3.1"
 types-requests = "^2.31.0.2"
 flake8-pyproject = "^1.2.3"
 pylint = "^2.17.5"
-pandas-stubs = "^2.2.2.240909"
+pandas-stubs = "^2.1.4.231227"
 ipykernel = "^6.29.5"
 ipywidgets = "^8.1.5"
 nbqa = "^1.9.0"
@ -75,6 +76,9 @@ nbqa = "^1.9.0"
 [tool.poetry.group.examples.dependencies]
 datasets = "^2.21.0"
 python-dotenv = "^1.0.1"
 llama-index-readers-docling = "^0.1.0"
 llama-index-node-parser-docling = "^0.1.0"
 llama-index-readers-file = "^0.2.2"
 llama-index-embeddings-huggingface = "^0.3.1"
 llama-index-llms-huggingface-api = "^0.2.0"
 llama-index-vector-stores-milvus = "^0.2.1"
@ -82,6 +86,9 @@ langchain-huggingface = "^0.0.3"
 langchain-milvus = "^0.1.4"
 langchain-text-splitters = "^0.2.4"
 [tool.poetry.extras]
 tesserocr = ["tesserocr"]
 [tool.poetry.scripts]
 docling = "docling.cli.main:app"
--- a/tests/data_scanned/ocr_test.doctags.txt
+++ b/tests/data_scanned/ocr_test.doctags.txt
@ -0,0 +1,3 @@
 <document>
 <paragraph><location><page_1><loc_12><loc_82><loc_86><loc_91></location>Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package</paragraph>
 </document>
--- a/tests/data_scanned/ocr_test.json
+++ b/tests/data_scanned/ocr_test.json
@ -0,0 +1 @@
 {"_name": "", "type": "pdf-document", "description": {"logs": []}, "file-info": {"filename": "ocr_test_8.pdf", "document-hash": "73f23122e9edbdb0a115b448e03c8064a0ea8bdc21d02917ce220cf032454f31", "#-pages": 1, "page-hashes": [{"hash": "8c5c5b766c1bdb92242142ca37260089b02380f9c57729703350f646cdf4771e", "model": "default", "page": 1}]}, "main-text": [{"prov": [{"bbox": [69.0, 688.58837890625, 509.4446716308594, 767.422119140625], "page": 1, "span": [0, 94]}], "text": "Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package", "type": "paragraph", "name": "Text"}], "figures": [], "tables": [], "equations": [], "footnotes": [], "page-dimensions": [{"height": 841.9216918945312, "page": 1, "width": 595.201171875}], "page-footers": [], "page-headers": []}
--- a/tests/data_scanned/ocr_test.md
+++ b/tests/data_scanned/ocr_test.md
@ -0,0 +1 @@
 Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package
--- a/tests/data_scanned/ocr_test.pages.json
+++ b/tests/data_scanned/ocr_test.pages.json
@ -0,0 +1 @@
 [{"page_no": 0, "page_hash": "8c5c5b766c1bdb92242142ca37260089b02380f9c57729703350f646cdf4771e", "size": {"width": 595.201171875, "height": 841.9216918945312}, "cells": [{"id": 0, "text": "Docling bundles PDF document conversion to", "bbox": {"l": 71.33333333333333, "t": 74.66666666666663, "r": 506.6666666666667, "b": 99.33333333333337, "coord_origin": "1"}}, {"id": 1, "text": "JSON and Markdown in an easy self contained", "bbox": {"l": 69.0, "t": 100.66666666666663, "r": 506.6666666666667, "b": 126.66666666666663, "coord_origin": "1"}}, {"id": 2, "text": "package", "bbox": {"l": 70.66666666666667, "t": 128.66666666666663, "r": 154.0, "b": 153.33333333333337, "coord_origin": "1"}}], "predictions": {"layout": {"clusters": [{"id": 0, "label": "Text", "bbox": {"l": 69.0, "t": 74.49958801269531, "r": 509.4446716308594, "b": 153.33333333333337, "coord_origin": "1"}, "confidence": 0.923837423324585, "cells": [{"id": 0, "text": "Docling bundles PDF document conversion to", "bbox": {"l": 71.33333333333333, "t": 74.66666666666663, "r": 506.6666666666667, "b": 99.33333333333337, "coord_origin": "1"}}, {"id": 1, "text": "JSON and Markdown in an easy self contained", "bbox": {"l": 69.0, "t": 100.66666666666663, "r": 506.6666666666667, "b": 126.66666666666663, "coord_origin": "1"}}, {"id": 2, "text": "package", "bbox": {"l": 70.66666666666667, "t": 128.66666666666663, "r": 154.0, "b": 153.33333333333337, "coord_origin": "1"}}]}]}, "tablestructure": {"table_map": {}}, "figures_classification": null, "equations_prediction": null}, "assembled": {"elements": [{"label": "Text", "id": 0, "page_no": 0, "cluster": {"id": 0, "label": "Text", "bbox": {"l": 69.0, "t": 74.49958801269531, "r": 509.4446716308594, "b": 153.33333333333337, "coord_origin": "1"}, "confidence": 0.923837423324585, "cells": [{"id": 0, "text": "Docling bundles PDF document conversion to", "bbox": {"l": 71.33333333333333, "t": 74.66666666666663, "r": 506.6666666666667, "b": 99.33333333333337, "coord_origin": "1"}}, {"id": 1, "text": "JSON and Markdown in an easy self contained", "bbox": {"l": 69.0, "t": 100.66666666666663, "r": 506.6666666666667, "b": 126.66666666666663, "coord_origin": "1"}}, {"id": 2, "text": "package", "bbox": {"l": 70.66666666666667, "t": 128.66666666666663, "r": 154.0, "b": 153.33333333333337, "coord_origin": "1"}}]}, "text": "Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package"}], "body": [{"label": "Text", "id": 0, "page_no": 0, "cluster": {"id": 0, "label": "Text", "bbox": {"l": 69.0, "t": 74.49958801269531, "r": 509.4446716308594, "b": 153.33333333333337, "coord_origin": "1"}, "confidence": 0.923837423324585, "cells": [{"id": 0, "text": "Docling bundles PDF document conversion to", "bbox": {"l": 71.33333333333333, "t": 74.66666666666663, "r": 506.6666666666667, "b": 99.33333333333337, "coord_origin": "1"}}, {"id": 1, "text": "JSON and Markdown in an easy self contained", "bbox": {"l": 69.0, "t": 100.66666666666663, "r": 506.6666666666667, "b": 126.66666666666663, "coord_origin": "1"}}, {"id": 2, "text": "package", "bbox": {"l": 70.66666666666667, "t": 128.66666666666663, "r": 154.0, "b": 153.33333333333337, "coord_origin": "1"}}]}, "text": "Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package"}], "headers": []}}]
--- a/tests/data_scanned/ocr_test.pdf
+++ b/tests/data_scanned/ocr_test.pdf
--- a/tests/test_e2e_ocr_conversion.py
+++ b/tests/test_e2e_ocr_conversion.py
@ -0,0 +1,98 @@
 from pathlib import Path
 from typing import List
 from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
 from docling.datamodel.document import ConversionResult
 from docling.datamodel.pipeline_options import (
    EasyOcrOptions,
    OcrOptions,
    PipelineOptions,
    TesseractCliOcrOptions,
    TesseractOcrOptions,
 )
 from docling.document_converter import DocumentConverter
 from .verify_utils import verify_conversion_result
 GENERATE = False
 # Debug
 def save_output(pdf_path: Path, doc_result: ConversionResult, engine: str):
    r""" """
    import json
    import os
    parent = pdf_path.parent
    eng = "" if engine is None else f".{engine}"
    dict_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.json")
    with open(dict_fn, "w") as fd:
        json.dump(doc_result.render_as_dict(), fd)
    pages_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.pages.json")
    pages = [p.model_dump() for p in doc_result.pages]
    with open(pages_fn, "w") as fd:
        json.dump(pages, fd)
    doctags_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.doctags.txt")
    with open(doctags_fn, "w") as fd:
        fd.write(doc_result.render_as_doctags())
    md_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.md")
    with open(md_fn, "w") as fd:
        fd.write(doc_result.render_as_markdown())
 def get_pdf_paths():
    # Define the directory you want to search
    directory = Path("./tests/data_scanned")
    # List all PDF files in the directory and its subdirectories
    pdf_files = sorted(directory.rglob("*.pdf"))
    return pdf_files
 def get_converter(ocr_options: OcrOptions):
    pipeline_options = PipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True
    pipeline_options.ocr_options = ocr_options
    converter = DocumentConverter(
        pipeline_options=pipeline_options,
        pdf_backend=DoclingParseDocumentBackend,
    )
    return converter
 def test_e2e_conversions():
    pdf_paths = get_pdf_paths()
    engines: List[OcrOptions] = [
        EasyOcrOptions(),
        TesseractOcrOptions(),
        TesseractCliOcrOptions(),
    ]
    for ocr_options in engines:
        print(f"Converting with ocr_engine: {ocr_options.kind}")
        converter = get_converter(ocr_options=ocr_options)
        for pdf_path in pdf_paths:
            print(f"converting {pdf_path}")
            doc_result: ConversionResult = converter.convert_single(pdf_path)
            # Save conversions
            # save_output(pdf_path, doc_result, None)
            # Debug
            verify_conversion_result(
                input_path=pdf_path,
                doc_result=doc_result,
                generate=GENERATE,
                fuzzy=True,
            )
--- a/tests/verify_utils.py
+++ b/tests/verify_utils.py
@ -11,6 +11,42 @@ from docling.datamodel.base_models import ConversionStatus, Page
 from docling.datamodel.document import ConversionResult
 def levenshtein(str1: str, str2: str) -> int:
    # Ensure str1 is the shorter string to optimize memory usage
    if len(str1) > len(str2):
        str1, str2 = str2, str1
    # Previous and current row buffers
    previous_row = list(range(len(str2) + 1))
    current_row = [0] * (len(str2) + 1)
    # Compute the Levenshtein distance row by row
    for i, c1 in enumerate(str1, start=1):
        current_row[0] = i
        for j, c2 in enumerate(str2, start=1):
            insertions = previous_row[j] + 1
            deletions = current_row[j - 1] + 1
            substitutions = previous_row[j - 1] + (c1 != c2)
            current_row[j] = min(insertions, deletions, substitutions)
        # Swap rows for the next iteration
        previous_row, current_row = current_row, previous_row
    # The result is in the last element of the previous row
    return previous_row[-1]
 def verify_text(gt: str, pred: str, fuzzy: bool, fuzzy_threshold: float = 0.4):
    if len(gt) == 0 or not fuzzy:
        assert gt == pred, f"{gt}!={pred}"
    else:
        dist = levenshtein(gt, pred)
        diff = dist / len(gt)
        assert diff < fuzzy_threshold, f"{gt}!~{pred}"
    return True
 def verify_cells(doc_pred_pages: List[Page], doc_true_pages: List[Page]):
    assert len(doc_pred_pages) == len(
@ -32,7 +68,6 @@ def verify_cells(doc_pred_pages: List[Page], doc_true_pages: List[Page]):
            true_text = cell_true_item.text
            pred_text = cell_pred_item.text
            assert true_text == pred_text, f"{true_text}!={pred_text}"
            true_bbox = cell_true_item.bbox.as_tuple()
@ -69,7 +104,7 @@ def verify_maintext(doc_pred: DsDocument, doc_true: DsDocument):
    return True
-def verify_tables(doc_pred: DsDocument, doc_true: DsDocument):
+def verify_tables(doc_pred: DsDocument, doc_true: DsDocument, fuzzy: bool):
    if doc_true.tables is None:
        # No tables to check
        assert doc_pred.tables is None, "not expecting any table on this document"
@ -102,9 +137,7 @@ def verify_tables(doc_pred: DsDocument, doc_true: DsDocument):
                # print("pred: ", pred_item.data[i][j].text)
                # print("")
-                assert (
+                verify_text(true_item.data[i][j].text, pred_item.data[i][j].text, fuzzy)
                    true_item.data[i][j].text == pred_item.data[i][j].text
                ), "table-cell does not have the same text"
                assert (
                    true_item.data[i][j].obj_type == pred_item.data[i][j].obj_type
@ -121,16 +154,20 @@ def verify_output(doc_pred: DsDocument, doc_true: DsDocument):
    return True
-def verify_md(doc_pred_md, doc_true_md):
+def verify_md(doc_pred_md: str, doc_true_md: str, fuzzy: bool):
-    return doc_pred_md == doc_true_md
+    return verify_text(doc_true_md, doc_pred_md, fuzzy)
-def verify_dt(doc_pred_dt, doc_true_dt):
+def verify_dt(doc_pred_dt: str, doc_true_dt: str, fuzzy: bool):
-    return doc_pred_dt == doc_true_dt
+    return verify_text(doc_true_dt, doc_pred_dt, fuzzy)
 def verify_conversion_result(
-    input_path: Path, doc_result: ConversionResult, generate=False
+    input_path: Path,
    doc_result: ConversionResult,
    generate: bool = False,
    ocr_engine: str = None,
    fuzzy: bool = False,
 ):
    PageList = TypeAdapter(List[Page])
@ -143,10 +180,11 @@ def verify_conversion_result(
    doc_pred_md = doc_result.render_as_markdown()
    doc_pred_dt = doc_result.render_as_doctags()
-    pages_path = input_path.with_suffix(".pages.json")
+    engine_suffix = "" if ocr_engine is None else f".{ocr_engine}"
-    json_path = input_path.with_suffix(".json")
+    pages_path = input_path.with_suffix(f"{engine_suffix}.pages.json")
-    md_path = input_path.with_suffix(".md")
+    json_path = input_path.with_suffix(f"{engine_suffix}.json")
-    dt_path = input_path.with_suffix(".doctags.txt")
+    md_path = input_path.with_suffix(f"{engine_suffix}.md")
    dt_path = input_path.with_suffix(f"{engine_suffix}.doctags.txt")
    if generate:  # only used when re-generating truth
        with open(pages_path, "w") as fw:
@ -173,22 +211,23 @@ def verify_conversion_result(
        with open(dt_path, "r") as fr:
            doc_true_dt = fr.read()
-        assert verify_cells(
+        if not fuzzy:
-            doc_pred_pages, doc_true_pages
+            assert verify_cells(
-        ), f"Mismatch in PDF cell prediction for {input_path}"
+                doc_pred_pages, doc_true_pages
            ), f"Mismatch in PDF cell prediction for {input_path}"
        # assert verify_output(
        #    doc_pred, doc_true
        # ), f"Mismatch in JSON prediction for {input_path}"
        assert verify_tables(
-            doc_pred, doc_true
+            doc_pred, doc_true, fuzzy
        ), f"verify_tables(doc_pred, doc_true) mismatch for {input_path}"
        assert verify_md(
-            doc_pred_md, doc_true_md
+            doc_pred_md, doc_true_md, fuzzy
        ), f"Mismatch in Markdown prediction for {input_path}"
        assert verify_dt(
-            doc_pred_dt, doc_true_dt
+            doc_pred_dt, doc_true_dt, fuzzy
        ), f"Mismatch in DocTags prediction for {input_path}"
		`@ -0,0 +1 @@`
							{"_name": "", "type": "pdf-document", "description": {"logs": []}, "file-info": {"filename": "ocr_test_8.pdf", "document-hash": "73f23122e9edbdb0a115b448e03c8064a0ea8bdc21d02917ce220cf032454f31", "#-pages": 1, "page-hashes": [{"hash": "8c5c5b766c1bdb92242142ca37260089b02380f9c57729703350f646cdf4771e", "model": "default", "page": 1}]}, "main-text": [{"prov": [{"bbox": [69.0, 688.58837890625, 509.4446716308594, 767.422119140625], "page": 1, "span": [0, 94]}], "text": "Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package", "type": "paragraph", "name": "Text"}], "figures": [], "tables": [], "equations": [], "footnotes": [], "page-dimensions": [{"height": 841.9216918945312, "page": 1, "width": 595.201171875}], "page-footers": [], "page-headers": []}
		`@ -0,0 +1 @@`
							`Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package`
		`@ -0,0 +1 @@`
							[{"page_no": 0, "page_hash": "8c5c5b766c1bdb92242142ca37260089b02380f9c57729703350f646cdf4771e", "size": {"width": 595.201171875, "height": 841.9216918945312}, "cells": [{"id": 0, "text": "Docling bundles PDF document conversion to", "bbox": {"l": 71.33333333333333, "t": 74.66666666666663, "r": 506.6666666666667, "b": 99.33333333333337, "coord_origin": "1"}}, {"id": 1, "text": "JSON and Markdown in an easy self contained", "bbox": {"l": 69.0, "t": 100.66666666666663, "r": 506.6666666666667, "b": 126.66666666666663, "coord_origin": "1"}}, {"id": 2, "text": "package", "bbox": {"l": 70.66666666666667, "t": 128.66666666666663, "r": 154.0, "b": 153.33333333333337, "coord_origin": "1"}}], "predictions": {"layout": {"clusters": [{"id": 0, "label": "Text", "bbox": {"l": 69.0, "t": 74.49958801269531, "r": 509.4446716308594, "b": 153.33333333333337, "coord_origin": "1"}, "confidence": 0.923837423324585, "cells": [{"id": 0, "text": "Docling bundles PDF document conversion to", "bbox": {"l": 71.33333333333333, "t": 74.66666666666663, "r": 506.6666666666667, "b": 99.33333333333337, "coord_origin": "1"}}, {"id": 1, "text": "JSON and Markdown in an easy self contained", "bbox": {"l": 69.0, "t": 100.66666666666663, "r": 506.6666666666667, "b": 126.66666666666663, "coord_origin": "1"}}, {"id": 2, "text": "package", "bbox": {"l": 70.66666666666667, "t": 128.66666666666663, "r": 154.0, "b": 153.33333333333337, "coord_origin": "1"}}]}]}, "tablestructure": {"table_map": {}}, "figures_classification": null, "equations_prediction": null}, "assembled": {"elements": [{"label": "Text", "id": 0, "page_no": 0, "cluster": {"id": 0, "label": "Text", "bbox": {"l": 69.0, "t": 74.49958801269531, "r": 509.4446716308594, "b": 153.33333333333337, "coord_origin": "1"}, "confidence": 0.923837423324585, "cells": [{"id": 0, "text": "Docling bundles PDF document conversion to", "bbox": {"l": 71.33333333333333, "t": 74.66666666666663, "r": 506.6666666666667, "b": 99.33333333333337, "coord_origin": "1"}}, {"id": 1, "text": "JSON and Markdown in an easy self contained", "bbox": {"l": 69.0, "t": 100.66666666666663, "r": 506.6666666666667, "b": 126.66666666666663, "coord_origin": "1"}}, {"id": 2, "text": "package", "bbox": {"l": 70.66666666666667, "t": 128.66666666666663, "r": 154.0, "b": 153.33333333333337, "coord_origin": "1"}}]}, "text": "Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package"}], "body": [{"label": "Text", "id": 0, "page_no": 0, "cluster": {"id": 0, "label": "Text", "bbox": {"l": 69.0, "t": 74.49958801269531, "r": 509.4446716308594, "b": 153.33333333333337, "coord_origin": "1"}, "confidence": 0.923837423324585, "cells": [{"id": 0, "text": "Docling bundles PDF document conversion to", "bbox": {"l": 71.33333333333333, "t": 74.66666666666663, "r": 506.6666666666667, "b": 99.33333333333337, "coord_origin": "1"}}, {"id": 1, "text": "JSON and Markdown in an easy self contained", "bbox": {"l": 69.0, "t": 100.66666666666663, "r": 506.6666666666667, "b": 126.66666666666663, "coord_origin": "1"}}, {"id": 2, "text": "package", "bbox": {"l": 70.66666666666667, "t": 128.66666666666663, "r": 154.0, "b": 153.33333333333337, "coord_origin": "1"}}]}, "text": "Docling bundles PDF document conversion to JSON and Markdown in an easy self contained package"}], "headers": []}}]