docs: update opensearch notebook and backend documentation (#2519)

* docs(opensearch): update the example notebook RAG with OpenSearch

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs(uspto): remove direct usage of the backend class for conversion

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: remove direct usage of backends from documentation

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
This commit is contained in:
Cesar Berrospi Ramis
2025-10-27 10:02:50 +01:00
committed by GitHub
parent 10c1f06b74
commit 9a6fdf936b
3 changed files with 536 additions and 307 deletions

View File

@@ -431,130 +431,6 @@
"print(f\"Fetched and exported {doc_num} documents.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using the backend converter (optional)\n",
"\n",
"- The custom backend converters `PubMedDocumentBackend` and `PatentUsptoDocumentBackend` aim at handling the parsing of PMC articles and USPTO patents, respectively.\n",
"- As any other backends, you can leverage the function `is_valid()` to check if the input document is supported by the this backend.\n",
"- Note that some XML sections in the original USPTO zip file may not represent patents, like sequence listings, and therefore they will show as invalid by the backend."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Document nihpp-2024.12.26.630351v1.nxml is a valid PMC article? True\n",
"Document ipg241217-1.xml is a valid patent? True\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "316241ca89a843bda3170f2a5c76c639",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/4014 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found 3928 patents out of 4014 XML files.\n"
]
}
],
"source": [
"from tqdm.notebook import tqdm\n",
"\n",
"from docling.backend.xml.jats_backend import JatsDocumentBackend\n",
"from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend\n",
"from docling.datamodel.base_models import InputFormat\n",
"from docling.datamodel.document import InputDocument\n",
"\n",
"# check PMC\n",
"in_doc = InputDocument(\n",
" path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\",\n",
" format=InputFormat.XML_JATS,\n",
" backend=JatsDocumentBackend,\n",
")\n",
"backend = JatsDocumentBackend(\n",
" in_doc=in_doc, path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"\n",
")\n",
"print(f\"Document {in_doc.file.name} is a valid PMC article? {backend.is_valid()}\")\n",
"\n",
"# check USPTO\n",
"in_doc = InputDocument(\n",
" path_or_stream=TEMP_DIR / \"ipg241217-1.xml\",\n",
" format=InputFormat.XML_USPTO,\n",
" backend=PatentUsptoDocumentBackend,\n",
")\n",
"backend = PatentUsptoDocumentBackend(\n",
" in_doc=in_doc, path_or_stream=TEMP_DIR / \"ipg241217-1.xml\"\n",
")\n",
"print(f\"Document {in_doc.file.name} is a valid patent? {backend.is_valid()}\")\n",
"\n",
"patent_valid = 0\n",
"pbar = tqdm(TEMP_DIR.glob(\"*.xml\"), total=doc_num)\n",
"for in_path in pbar:\n",
" in_doc = InputDocument(\n",
" path_or_stream=in_path,\n",
" format=InputFormat.XML_USPTO,\n",
" backend=PatentUsptoDocumentBackend,\n",
" )\n",
" backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path)\n",
" patent_valid += int(backend.is_valid())\n",
"\n",
"print(f\"Found {patent_valid} patents out of {doc_num} XML files.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calling the function `convert()` will convert the input document into a `DoclingDocument`"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Patent \"Semiconductor package\" has 19 claims\n"
]
}
],
"source": [
"doc = backend.convert()\n",
"\n",
"claims_sec = next(item for item in doc.texts if item.text == \"CLAIMS\")\n",
"print(f'Patent \"{doc.texts[0].text}\" has {len(claims_sec.children)} claims')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"✏️ **Tip**: in general, there is no need to use the backend converters to parse USPTO or JATS (PubMed) XML files. The generic `DocumentConverter` object tries to guess the input document format and applies the corresponding backend parser. The conversion shown in [Simple Conversion](#simple-conversion) is the recommended usage for the supported XML files."
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -923,7 +799,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
"version": "3.12.10"
}
},
"nbformat": 4,

View File

@@ -8,9 +8,9 @@
"\n",
"| Step | Tech | Execution |\n",
"| --- | --- | --- |\n",
"| Embedding | Ollama (IBM Granite Embedding 30M) | 💻 Local |\n",
"| Vector store | OpenSearch 2.19.3 | 💻 Local |\n",
"| Gen AI | Ollama (IBM Granite 3.3 8B) | 💻 Local |\n",
"| Embedding | HuggingFace (IBM Granite Embedding 30M) | 💻 Local |\n",
"| Vector store | OpenSearch 3.0.0 | 💻 Local |\n",
"| Gen AI | Ollama (IBM Granite 4.0 Tiny) | 💻 Local |\n",
"\n",
"\n",
"This is a code recipe that uses [OpenSearch](https://opensearch.org/), an open-source search and analytics tool,\n",
@@ -66,7 +66,11 @@
"metadata": {},
"outputs": [],
"source": [
"! uv pip install -q --no-progress notebook ipywidgets docling llama-index-readers-file llama-index-readers-docling llama-index-node-parser-docling llama-index-vector-stores-opensearch llama-index-embeddings-ollama llama-index-llms-ollama"
"import os\n",
"\n",
"os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
"\n",
"! uv pip install -q --no-progress notebook ipywidgets docling llama-index-readers-file llama-index-readers-docling llama-index-readers-elasticsearch llama-index-node-parser-docling llama-index-vector-stores-opensearch llama-index-embeddings-huggingface llama-index-llms-ollama"
]
},
{
@@ -80,7 +84,16 @@
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/ceb/git/docling/.venv/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'validate_default' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'validate_default' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.\n",
" warnings.warn(\n"
]
}
],
"source": [
"import logging\n",
"from pathlib import Path\n",
@@ -93,21 +106,28 @@
" ChunkingDocSerializer,\n",
" ChunkingSerializerProvider,\n",
")\n",
"from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n",
"from docling_core.transforms.serializer.markdown import MarkdownTableSerializer\n",
"from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex\n",
"from llama_index.core.schema import TransformComponent\n",
"from llama_index.core.data_structs import Node\n",
"from llama_index.core.response_synthesizers import get_response_synthesizer\n",
"from llama_index.core.schema import NodeWithScore, TransformComponent\n",
"from llama_index.core.vector_stores import MetadataFilter, MetadataFilters\n",
"from llama_index.core.vector_stores.types import VectorStoreQueryMode\n",
"from llama_index.embeddings.ollama import OllamaEmbedding\n",
"from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
"from llama_index.llms.ollama import Ollama\n",
"from llama_index.node_parser.docling import DoclingNodeParser\n",
"from llama_index.readers.docling import DoclingReader\n",
"from llama_index.readers.elasticsearch import ElasticsearchReader\n",
"from llama_index.vector_stores.opensearch import (\n",
" OpensearchVectorClient,\n",
" OpensearchVectorStore,\n",
")\n",
"from rich.console import Console\n",
"from rich.pretty import pprint\n",
"from transformers import AutoTokenizer\n",
"\n",
"from docling.chunking import HybridChunker\n",
"\n",
"logging.getLogger().setLevel(logging.WARNING)"
]
@@ -181,7 +201,7 @@
" -e DISABLE_INSTALL_DEMO_CONFIG=true \\\n",
" -e DISABLE_SECURITY_PLUGIN=true \\\n",
" --name opensearch-node \\\n",
" -d opensearchproject/opensearch:2.19.3\n",
" -d opensearchproject/opensearch:3.0.0\n",
"```\n",
"\n",
"Once the instance is running, verify that you can connect to OpenSearch:"
@@ -197,19 +217,19 @@
"output_type": "stream",
"text": [
"{\n",
" \"name\" : \"b8582205a25c\",\n",
" \"name\" : \"b20d8368e745\",\n",
" \"cluster_name\" : \"docker-cluster\",\n",
" \"cluster_uuid\" : \"VxJ5hoxDRn68jodknsNdag\",\n",
" \"cluster_uuid\" : \"0gEZCJQwRHabS_E-n_3i9g\",\n",
" \"version\" : {\n",
" \"distribution\" : \"opensearch\",\n",
" \"number\" : \"2.19.3\",\n",
" \"number\" : \"3.0.0\",\n",
" \"build_type\" : \"tar\",\n",
" \"build_hash\" : \"a90f864b8524bc75570a8461ccb569d2a4bfed42\",\n",
" \"build_date\" : \"2025-07-21T22:34:54.259463448Z\",\n",
" \"build_hash\" : \"dc4efa821904cc2d7ea7ef61c0f577d3fc0d8be9\",\n",
" \"build_date\" : \"2025-05-03T06:23:50.311109522Z\",\n",
" \"build_snapshot\" : false,\n",
" \"lucene_version\" : \"9.12.2\",\n",
" \"minimum_wire_compatibility_version\" : \"7.10.0\",\n",
" \"minimum_index_compatibility_version\" : \"7.0.0\"\n",
" \"lucene_version\" : \"10.1.0\",\n",
" \"minimum_wire_compatibility_version\" : \"2.19.0\",\n",
" \"minimum_index_compatibility_version\" : \"2.0.0\"\n",
" },\n",
" \"tagline\" : \"The OpenSearch Project: https://opensearch.org/\"\n",
"}\n",
@@ -226,19 +246,18 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ollama models\n",
"### Language models\n",
"\n",
"We will use [Ollama](https://ollama.com/), an open-source tool to run language models on your local computer, rather than relying on cloud services.\n",
"We will use [HuggingFace](https://huggingface.co/) and [Ollama](https://ollama.com/) to run language models on your local computer, rather than relying on cloud services.\n",
"\n",
"In this example, we will use:\n",
"- [IBM Granite Embedding 30M English](https://huggingface.co/ibm-granite/granite-embedding-30m-english) for text embeddings\n",
"- [IBM Granite 3.3 8B Instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) for model inference\n",
"In this example, the following models are considered:\n",
"- [IBM Granite Embedding 30M English](https://huggingface.co/ibm-granite/granite-embedding-30m-english) with HuggingFace for text embeddings\n",
"- [IBM Granite 4.0 Tiny](https://ollama.com/library/granite4:tiny-h) with Ollama for model inference\n",
"\n",
"Once Ollama is installed on your computer, you can pull and run the models above from your terminal:\n",
"Once Ollama is installed on your computer, you can pull the model above from your terminal:\n",
"\n",
"```shell\n",
"ollama run granite-embedding:30m\n",
"ollama run granite3.3:8b\n",
"ollama pull granite4:tiny-h\n",
"```"
]
},
@@ -270,10 +289,14 @@
"# index to store the Docling document vectors\n",
"OPENSEARCH_INDEX = \"docling-index\"\n",
"# the embedding model\n",
"EMBED_MODEL = OllamaEmbedding(model_name=\"granite-embedding:30m\")\n",
"EMBED_MODEL = HuggingFaceEmbedding(\n",
" model_name=\"ibm-granite/granite-embedding-30m-english\"\n",
")\n",
"# maximum chunk size in tokens\n",
"EMBED_MAX_TOKENS = 200\n",
"# the generation model\n",
"GEN_MODEL = Ollama(\n",
" model=\"granite3.3:8b\",\n",
" model=\"granite4:tiny-h\",\n",
" request_timeout=120.0,\n",
" # Manually set the context window to limit memory usage\n",
" context_window=8000,\n",
@@ -282,8 +305,6 @@
")\n",
"# a sample document\n",
"SOURCE = \"https://arxiv.org/pdf/2408.09869\"\n",
"# a sample query\n",
"QUERY = \"Which are the main AI models in Docling?\"\n",
"\n",
"embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\n",
"print(f\"The embedding dimension is {embed_dim}.\")"
@@ -303,35 +324,29 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this recipe, we will use a single PDF file, the [Docling Technical Report](https://arxiv.org/pdf/2408.09869). We will process it using a [Hierarchical Chunker](https://docling-project.github.io/docling/concepts/chunking/#hierarchical-chunker) provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks.\n",
"In this recipe, we will use a single PDF file, the [Docling Technical Report](https://arxiv.org/pdf/2408.09869). We will process it using the [Hybrid Chunker](https://docling-project.github.io/docling/concepts/chunking/#hybrid-chunker) provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run the document conversion pipeline\n",
"\n",
"\n",
"💡 The [Hybrid Chunker](https://docling-project.github.io/docling/concepts/chunking/#hybrid-chunker) is an alternative with additional capabilities for an efficient segmentation of the document. Check the [Hybrid Chunking](https://docling-project.github.io/docling/examples/hybrid_chunking/) example for more details."
"We will convert the original PDF file into a `DoclingDocument` format using a `DoclingReader` object. We specify the JSON export type to retain the document hierarchical structure as an input for the next step (chunking the document)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/ceb/git/docling/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py:684: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n",
" warnings.warn(warn_msg)\n",
"/Users/ceb/git/docling/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py:684: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n",
" warnings.warn(warn_msg)\n"
]
}
],
"outputs": [],
"source": [
"tmp_dir_path = Path(mkdtemp())\n",
"req = requests.get(SOURCE)\n",
"with open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:\n",
" out_file.write(req.content)\n",
"\n",
"# create a Docling reader and a node parser with default Hierarchical chunker\n",
"reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\n",
"dir_reader = SimpleDirectoryReader(\n",
" input_dir=tmp_dir_path,\n",
@@ -352,8 +367,12 @@
"\n",
"Before the actual ingestion of data, we need to define the data transformations to apply on the `DoclingDocument`:\n",
"\n",
"- `DoclingNodeParser` executes the document-based chunking\n",
"- `MetadataTransform` is a custom transformation to ensure that generated chunk metadata is best formatted for indexing with OpenSearch"
"- `DoclingNodeParser` executes the document-based chunking with the hybrid chunker, which leverages the tokenizer of the embedding model to ensure that the resulting chunks fit within the model input text limit.\n",
"- `MetadataTransform` is a custom transformation to ensure that generated chunk metadata is best formatted for indexing with OpenSearch\n",
"\n",
"\n",
"💡 For demonstration purposes, we configure the hybrid chunker to produce chunks capped at 200 tokens. The optimal limit will vary according to the specific requirements of the AI application in question.\n",
"If this value is omitted, the chunker automatically derives the maximum size from the tokenizer. This safeguard guarantees that each chunk remains within the bounds supported by the underlying embedding model."
]
},
{
@@ -362,8 +381,15 @@
"metadata": {},
"outputs": [],
"source": [
"# create the hybrid chunker\n",
"tokenizer = HuggingFaceTokenizer(\n",
" tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL.model_name),\n",
" max_tokens=EMBED_MAX_TOKENS,\n",
")\n",
"chunker = HybridChunker(tokenizer=tokenizer)\n",
"\n",
"# create a Docling node parser\n",
"node_parser = DoclingNodeParser()\n",
"node_parser = DoclingNodeParser(chunker=chunker)\n",
"\n",
"\n",
"# create a custom transformation to avoid out-of-range integers\n",
@@ -384,7 +410,12 @@
"\n",
"In this step, we create an `OpenSearchVectorClient`, which encapsulates the logic for a single OpenSearch index with vector search enabled.\n",
"\n",
"We then initialize the index using our sample data (a single PDF file), the Docling node parser, and the OpenSearch client that we just created.\n"
"We then initialize the index using our sample data (a single PDF file), the Docling node parser, and the OpenSearch client that we just created.\n",
"\n",
"💡 You may get a warning message like:\n",
"> Token indices sequence length is longer than the specified maximum sequence length for this model\n",
"\n",
"This is a _false alarm_ and you may get more background explanation in [Docling's FAQ](https://docling-project.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model) page."
]
},
{
@@ -396,7 +427,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"2025-09-10 13:16:53,752 - WARNING - GET http://localhost:9200/docling-index [status:404 request:0.015s]\n"
"2025-10-24 15:05:49,841 - WARNING - GET http://localhost:9200/docling-index [status:404 request:0.006s]\n"
]
}
],
@@ -407,9 +438,10 @@
"embed_field = \"embedding\"\n",
"\n",
"client = OpensearchVectorClient(\n",
" endpoint=\"http://localhost:9200\",\n",
" endpoint=OPENSEARCH_ENDPOINT,\n",
" index=OPENSEARCH_INDEX,\n",
" dim=embed_dim,\n",
" engine=\"faiss\",\n",
" embedding_field=embed_field,\n",
" text_field=text_field,\n",
")\n",
@@ -450,20 +482,24 @@
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: Which are the main AI models in Docling?\n",
"🤖: Docling primarily utilizes two AI models. The first one is a layout analysis model, \n",
"serving as an accurate object-detector for page elements. The second model is \n",
"TableFormer, a state-of-the-art table structure recognition model. Both models are \n",
"pre-trained and their weights are hosted on Hugging Face. They also power the \n",
"deepsearch-experience, a cloud-native service for knowledge exploration tasks.\n",
"🤖: The two main AI models used in Docling are:\n",
"\n",
"<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>. A layout analysis model, an accurate object-detector for page elements \n",
"<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span>. TableFormer, a state-of-the-art table structure recognition model\n",
"\n",
"These models were initially released as part of the open-source Docling package to help \n",
"with document understanding tasks.\n",
"</pre>\n"
],
"text/plain": [
"👤: Which are the main AI models in Docling?\n",
"🤖: Docling primarily utilizes two AI models. The first one is a layout analysis model, \n",
"serving as an accurate object-detector for page elements. The second model is \n",
"TableFormer, a state-of-the-art table structure recognition model. Both models are \n",
"pre-trained and their weights are hosted on Hugging Face. They also power the \n",
"deepsearch-experience, a cloud-native service for knowledge exploration tasks.\n"
"🤖: The two main AI models used in Docling are:\n",
"\n",
"\u001b[1;36m1\u001b[0m. A layout analysis model, an accurate object-detector for page elements \n",
"\u001b[1;36m2\u001b[0m. TableFormer, a state-of-the-art table structure recognition model\n",
"\n",
"These models were initially released as part of the open-source Docling package to help \n",
"with document understanding tasks.\n"
]
},
"metadata": {},
@@ -499,23 +535,23 @@
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: What are the performance metrics of Docling-native PDF backend with <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">16</span> threads?\n",
"🤖: The Docling-native PDF backend, when utilized with <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">16</span> threads on an Apple M3 Max \n",
"system, completed the processing in approximately <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">167</span> seconds. It achieved a throughput \n",
"of about <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1.34</span> pages per second and peaked at a memory usage of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">6.20</span> GB <span style=\"font-weight: bold\">(</span>resident set \n",
"size<span style=\"font-weight: bold\">)</span>. On an Intel Xeon E5-<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2690</span> system with the same thread count, it took around <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">244</span> \n",
"seconds to process, managed a throughput of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.92</span> pages per second, and reached a peak \n",
"memory usage of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">6.16</span> GB.\n",
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: What is the time to solution with the native backend on Intel?\n",
"🤖: The time to solution <span style=\"font-weight: bold\">(</span>TTS<span style=\"font-weight: bold\">)</span> for the native backend on Intel is:\n",
"- For Apple M3 Max <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">16</span> cores<span style=\"font-weight: bold\">)</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">375</span> seconds \n",
"- For <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">Intel</span><span style=\"font-weight: bold\">(</span>R<span style=\"font-weight: bold\">)</span> Xeon E5-<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2690</span>, native backend: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">244</span> seconds\n",
"\n",
"So the TTS with the native backend on Intel ranges from approximately <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">244</span> to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">375</span> seconds\n",
"depending on the specific configuration.\n",
"</pre>\n"
],
"text/plain": [
"👤: What are the performance metrics of Docling-native PDF backend with \u001b[1;36m16\u001b[0m threads?\n",
"🤖: The Docling-native PDF backend, when utilized with \u001b[1;36m16\u001b[0m threads on an Apple M3 Max \n",
"system, completed the processing in approximately \u001b[1;36m167\u001b[0m seconds. It achieved a throughput \n",
"of about \u001b[1;36m1.34\u001b[0m pages per second and peaked at a memory usage of \u001b[1;36m6.20\u001b[0m GB \u001b[1m(\u001b[0mresident set \n",
"size\u001b[1m)\u001b[0m. On an Intel Xeon E5-\u001b[1;36m2690\u001b[0m system with the same thread count, it took around \u001b[1;36m244\u001b[0m \n",
"seconds to process, managed a throughput of \u001b[1;36m0.92\u001b[0m pages per second, and reached a peak \n",
"memory usage of \u001b[1;36m6.16\u001b[0m GB.\n"
"👤: What is the time to solution with the native backend on Intel?\n",
"🤖: The time to solution \u001b[1m(\u001b[0mTTS\u001b[1m)\u001b[0m for the native backend on Intel is:\n",
"- For Apple M3 Max \u001b[1m(\u001b[0m\u001b[1;36m16\u001b[0m cores\u001b[1m)\u001b[0m: \u001b[1;36m375\u001b[0m seconds \n",
"- For \u001b[1;35mIntel\u001b[0m\u001b[1m(\u001b[0mR\u001b[1m)\u001b[0m Xeon E5-\u001b[1;36m2690\u001b[0m, native backend: \u001b[1;36m244\u001b[0m seconds\n",
"\n",
"So the TTS with the native backend on Intel ranges from approximately \u001b[1;36m244\u001b[0m to \u001b[1;36m375\u001b[0m seconds\n",
"depending on the specific configuration.\n"
]
},
"metadata": {},
@@ -523,9 +559,7 @@
}
],
"source": [
"QUERY = (\n",
" \"What are the performance metrics of Docling-native PDF backend with 16 threads?\"\n",
")\n",
"QUERY = \"What is the time to solution with the native backend on Intel?\"\n",
"query_engine = index.as_query_engine(llm=GEN_MODEL)\n",
"res = query_engine.query(QUERY)\n",
"console.print(f\"👤: {QUERY}\\n🤖: {res.response.strip()}\")"
@@ -546,7 +580,15 @@
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Token indices sequence length is longer than the specified maximum sequence length for this model (538 > 512). Running this sequence through the model will result in indexing errors\n"
]
}
],
"source": [
"class MDTableSerializerProvider(ChunkingSerializerProvider):\n",
" def get_serializer(self, doc):\n",
@@ -561,7 +603,9 @@
"client.clear()\n",
"vector_store.clear()\n",
"\n",
"chunker = HierarchicalChunker(\n",
"chunker = HybridChunker(\n",
" tokenizer=tokenizer,\n",
" max_tokens=EMBED_MAX_TOKENS,\n",
" serializer_provider=MDTableSerializerProvider(),\n",
")\n",
"node_parser = DoclingNodeParser(chunker=chunker)\n",
@@ -573,13 +617,6 @@
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe that the generated response is now more accurate. Refer to the [Advanced chunking & serialization](https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/) example for more details on serialization strategies."
]
},
{
"cell_type": "code",
"execution_count": 12,
@@ -588,19 +625,25 @@
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: Which backend is faster on Intel with <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span> threads?\n",
"🤖: The pypdfium backend is faster than the Docling-native PDF backend for an Intel Xeon\n",
"E5-<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2690</span> CPU with a thread budget of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span>, as indicated in Table <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>. The pypdfium backend \n",
"completes the processing in <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">239</span> seconds, achieving a throughput of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.94</span> pages per \n",
"second, while the Docling-native PDF backend takes <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">375</span> seconds.\n",
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: What is the time to solution with the native backend on Intel?\n",
"🤖: The table shows that for the native backend on Intel systems, the time-to-solution \n",
"<span style=\"font-weight: bold\">(</span>TTS<span style=\"font-weight: bold\">)</span> ranges from <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">239</span> seconds to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">375</span> seconds. Specifically:\n",
"- With <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span> threads, the TTS is <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">239</span> seconds.\n",
"- With <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">16</span> threads, the TTS is <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">244</span> seconds.\n",
"\n",
"So the time to solution with the native backend on Intel varies between approximately \n",
"<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">239</span> and <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">375</span> seconds depending on the thread budget used.\n",
"</pre>\n"
],
"text/plain": [
"👤: Which backend is faster on Intel with \u001b[1;36m4\u001b[0m threads?\n",
"🤖: The pypdfium backend is faster than the Docling-native PDF backend for an Intel Xeon\n",
"E5-\u001b[1;36m2690\u001b[0m CPU with a thread budget of \u001b[1;36m4\u001b[0m, as indicated in Table \u001b[1;36m1\u001b[0m. The pypdfium backend \n",
"completes the processing in \u001b[1;36m239\u001b[0m seconds, achieving a throughput of \u001b[1;36m0.94\u001b[0m pages per \n",
"second, while the Docling-native PDF backend takes \u001b[1;36m375\u001b[0m seconds.\n"
"👤: What is the time to solution with the native backend on Intel?\n",
"🤖: The table shows that for the native backend on Intel systems, the time-to-solution \n",
"\u001b[1m(\u001b[0mTTS\u001b[1m)\u001b[0m ranges from \u001b[1;36m239\u001b[0m seconds to \u001b[1;36m375\u001b[0m seconds. Specifically:\n",
"- With \u001b[1;36m4\u001b[0m threads, the TTS is \u001b[1;36m239\u001b[0m seconds.\n",
"- With \u001b[1;36m16\u001b[0m threads, the TTS is \u001b[1;36m244\u001b[0m seconds.\n",
"\n",
"So the time to solution with the native backend on Intel varies between approximately \n",
"\u001b[1;36m239\u001b[0m and \u001b[1;36m375\u001b[0m seconds depending on the thread budget used.\n"
]
},
"metadata": {},
@@ -609,7 +652,6 @@
],
"source": [
"query_engine = index.as_query_engine(llm=GEN_MODEL)\n",
"QUERY = \"Which backend is faster on Intel with 4 threads?\"\n",
"res = query_engine.query(QUERY)\n",
"console.print(f\"👤: {QUERY}\\n🤖: {res.response.strip()}\")"
]
@@ -618,7 +660,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Refer to the [Advanced chunking & serialization](https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/) example for more details on serialization strategies."
"Observe that the generated response is now more accurate. Refer to the [Advanced chunking & serialization](https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/) example for more details on serialization strategies."
]
},
{
@@ -671,9 +713,14 @@
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ </span><span style=\"font-weight: bold\">{</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'k'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'score'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.6800267</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'text'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'If you need to run Docling in very low-resource environments, please consider configuring the pypdfium backend. While it is faster and more memory efficient than the default docling-parse backend, it '</span>+<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">90</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'items'</span>: <span style=\"font-weight: bold\">[{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/texts/68'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'text'</span><span style=\"font-weight: bold\">}]</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'score'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.694972</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'text'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'- [13] B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar. Doclaynet: a large humanannotated dataset for document-layout segmentation. pages 3743-3751, 2022.\\n- [14] pypdf Maintainers. pypdf: '</span>+<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">314</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'items'</span>: <span style=\"font-weight: bold\">[</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ │ </span><span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/texts/93'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'list_item'</span><span style=\"font-weight: bold\">}</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ │ </span><span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/texts/94'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'list_item'</span><span style=\"font-weight: bold\">}</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ │ </span><span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/texts/95'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'list_item'</span><span style=\"font-weight: bold\">}</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ │ </span><span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/texts/96'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'list_item'</span><span style=\"font-weight: bold\">}</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"font-weight: bold\">]</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ </span><span style=\"font-weight: bold\">}</span>\n",
"<span style=\"font-weight: bold\">]</span>\n",
"</pre>\n"
@@ -682,9 +729,14 @@
"\u001b[1m[\u001b[0m\n",
"\u001b[2;32m│ \u001b[0m\u001b[1m{\u001b[0m\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'k'\u001b[0m: \u001b[1;36m1\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6800267\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'text'\u001b[0m: \u001b[32m'If you need to run Docling in very low-resource environments, please consider configuring the pypdfium backend. While it is faster and more memory efficient than the default docling-parse backend, it '\u001b[0m+\u001b[1;36m90\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'items'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/68'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'text'\u001b[0m\u001b[1m}\u001b[0m\u001b[1m]\u001b[0m\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.694972\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'text'\u001b[0m: \u001b[32m'- \u001b[0m\u001b[32m[\u001b[0m\u001b[32m13\u001b[0m\u001b[32m]\u001b[0m\u001b[32m B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar. Doclaynet: a large humanannotated dataset for document-layout segmentation. pages 3743-3751, 2022.\\n- \u001b[0m\u001b[32m[\u001b[0m\u001b[32m14\u001b[0m\u001b[32m]\u001b[0m\u001b[32m pypdf Maintainers. pypdf: '\u001b[0m+\u001b[1;36m314\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'items'\u001b[0m: \u001b[1m[\u001b[0m\n",
"\u001b[2;32m│ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/93'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m,\n",
"\u001b[2;32m│ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/94'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m,\n",
"\u001b[2;32m│ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/95'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m,\n",
"\u001b[2;32m│ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/96'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[1m]\u001b[0m\n",
"\u001b[2;32m│ \u001b[0m\u001b[1m}\u001b[0m\n",
"\u001b[1m]\u001b[0m\n"
]
@@ -728,9 +780,9 @@
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ </span><span style=\"font-weight: bold\">{</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'k'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'score'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.6078317</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'text'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution (TT'</span>+<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1014</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'items'</span>: <span style=\"font-weight: bold\">[{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/texts/72'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'caption'</span><span style=\"font-weight: bold\">}</span>, <span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/tables/0'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'table'</span><span style=\"font-weight: bold\">}]</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'score'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.6238112</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'text'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution (TT'</span>+<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">515</span>,\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ │ </span><span style=\"color: #008000; text-decoration-color: #008000\">'items'</span>: <span style=\"font-weight: bold\">[{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/tables/0'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'table'</span><span style=\"font-weight: bold\">}</span>, <span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'ref'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'#/tables/0'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'label'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'table'</span><span style=\"font-weight: bold\">}]</span>\n",
"<span style=\"color: #7fbf7f; text-decoration-color: #7fbf7f\">│ </span><span style=\"font-weight: bold\">}</span>\n",
"<span style=\"font-weight: bold\">]</span>\n",
"</pre>\n"
@@ -739,9 +791,9 @@
"\u001b[1m[\u001b[0m\n",
"\u001b[2;32m│ \u001b[0m\u001b[1m{\u001b[0m\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'k'\u001b[0m: \u001b[1;36m1\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6078317\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'text'\u001b[0m: \u001b[32m'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u001b[0m\u001b[32m(\u001b[0m\u001b[32mTT'\u001b[0m+\u001b[1;36m1014\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'items'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/72'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'caption'\u001b[0m\u001b[1m}\u001b[0m, \u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/tables/0'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'table'\u001b[0m\u001b[1m}\u001b[0m\u001b[1m]\u001b[0m\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6238112\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'text'\u001b[0m: \u001b[32m'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u001b[0m\u001b[32m(\u001b[0m\u001b[32mTT'\u001b[0m+\u001b[1;36m515\u001b[0m,\n",
"\u001b[2;32m│ │ \u001b[0m\u001b[32m'items'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/tables/0'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'table'\u001b[0m\u001b[1m}\u001b[0m, \u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/tables/0'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'table'\u001b[0m\u001b[1m}\u001b[0m\u001b[1m]\u001b[0m\n",
"\u001b[2;32m│ \u001b[0m\u001b[1m}\u001b[0m\n",
"\u001b[1m]\u001b[0m\n"
]
@@ -816,7 +868,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"2025-09-10 13:17:10,104 - WARNING - GET http://localhost:9200/docling-index-rrf [status:404 request:0.001s]\n"
"2025-10-24 15:06:05,175 - WARNING - GET http://localhost:9200/docling-index-rrf [status:404 request:0.001s]\n"
]
}
],
@@ -825,6 +877,7 @@
" endpoint=OPENSEARCH_ENDPOINT,\n",
" index=f\"{OPENSEARCH_INDEX}-rrf\",\n",
" dim=embed_dim,\n",
" engine=\"faiss\",\n",
" embedding_field=embed_field,\n",
" text_field=text_field,\n",
" search_pipeline=\"rrf-pipeline\",\n",
@@ -857,6 +910,13 @@
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span> ***\n",
"Docling is designed to allow easy extension of the model library and pipelines. In the \n",
"future, we plan to extend Docling with several more models, such as a figure-classifier \n",
"model, an equationrecognition model, a code-recognition model and more. This will help \n",
"improve the quality of conversion for specific types of content, as well as augment \n",
"extracted document metadata with additional information. Further investment into testing\n",
"and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
"on our roadmap, too.\n",
"We encourage everyone to propose or implement additional features and models, and will \n",
"gladly take your inputs and contributions under review . The codebase of Docling is open\n",
"for use and contribution, under the MIT license agreement and in alignment with our \n",
@@ -866,6 +926,13 @@
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m1\u001b[0m ***\n",
"Docling is designed to allow easy extension of the model library and pipelines. In the \n",
"future, we plan to extend Docling with several more models, such as a figure-classifier \n",
"model, an equationrecognition model, a code-recognition model and more. This will help \n",
"improve the quality of conversion for specific types of content, as well as augment \n",
"extracted document metadata with additional information. Further investment into testing\n",
"and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
"on our roadmap, too.\n",
"We encourage everyone to propose or implement additional features and models, and will \n",
"gladly take your inputs and contributions under review . The codebase of Docling is open\n",
"for use and contribution, under the MIT license agreement and in alignment with our \n",
@@ -880,20 +947,26 @@
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span> ***\n",
"Optionally, you can configure custom pipeline features and runtime options, such as \n",
"turning on or off features <span style=\"font-weight: bold\">(</span>e.g. OCR, table structure recognition<span style=\"font-weight: bold\">)</span>, enforcing limits on \n",
"the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
"and options are documented in the README file. <span style=\"color: #808000; text-decoration-color: #808000; font-weight: bold\">Docling also provides a Dockerfile</span> to \n",
"demonstrate how to install and run it inside a container.\n",
"In the final pipeline stage, Docling assembles all prediction results produced on each \n",
"page into a well-defined datatype that encapsulates a converted document, as defined in \n",
"the auxiliary package docling-core . The generated document object is passed through a \n",
"post-processing model which leverages several algorithms to augment features, such as \n",
"detection of the document language, correcting the reading order, matching figures with \n",
"captions and labelling metadata such as title, authors and references. The final output \n",
"can then be serialized to JSON or transformed into a Markdown representation at the \n",
"users request.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m2\u001b[0m ***\n",
"Optionally, you can configure custom pipeline features and runtime options, such as \n",
"turning on or off features \u001b[1m(\u001b[0me.g. OCR, table structure recognition\u001b[1m)\u001b[0m, enforcing limits on \n",
"the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
"and options are documented in the README file. \u001b[1;33mDocling also provides a Dockerfile\u001b[0m to \n",
"demonstrate how to install and run it inside a container.\n"
"In the final pipeline stage, Docling assembles all prediction results produced on each \n",
"page into a well-defined datatype that encapsulates a converted document, as defined in \n",
"the auxiliary package docling-core . The generated document object is passed through a \n",
"post-processing model which leverages several algorithms to augment features, such as \n",
"detection of the document language, correcting the reading order, matching figures with \n",
"captions and labelling metadata such as title, authors and references. The final output \n",
"can then be serialized to JSON or transformed into a Markdown representation at the \n",
"users request.\n"
]
},
"metadata": {},
@@ -903,24 +976,32 @@
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> ***\n",
"Docling is designed to allow easy extension of the model library and pipelines. In the \n",
"future, we plan to extend Docling with several more models, such as a figure-classifier \n",
"model, an equationrecognition model, a code-recognition model and more. This will help \n",
"improve the quality of conversion for specific types of content, as well as augment \n",
"extracted document metadata with additional information. Further investment into testing\n",
"and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
"on our roadmap, too.\n",
"```\n",
"source = <span style=\"color: #008000; text-decoration-color: #008000\">\"https://arxiv.org/pdf/2206.01062\"</span> # PDF path or URL converter = \n",
"<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">DocumentConverter</span><span style=\"font-weight: bold\">()</span> result = <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">converter.convert_single</span><span style=\"font-weight: bold\">(</span>source<span style=\"font-weight: bold\">)</span> \n",
"<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">print</span><span style=\"font-weight: bold\">(</span><span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">result.render_as_markdown</span><span style=\"font-weight: bold\">())</span> # output: <span style=\"color: #008000; text-decoration-color: #008000\">\"## DocLayNet: A Large Human -Annotated </span>\n",
"<span style=\"color: #008000; text-decoration-color: #008000\">Dataset for Document -Layout Analysis [...]\"</span>\n",
"```\n",
"Optionally, you can configure custom pipeline features and runtime options, such as \n",
"turning on or off features <span style=\"font-weight: bold\">(</span>e.g. OCR, table structure recognition<span style=\"font-weight: bold\">)</span>, enforcing limits on \n",
"the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
"and options are documented in the README file. <span style=\"color: #808000; text-decoration-color: #808000; font-weight: bold\">Docling also provides a Dockerfile</span> to \n",
"demonstrate how to install and run it inside a container.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m3\u001b[0m ***\n",
"Docling is designed to allow easy extension of the model library and pipelines. In the \n",
"future, we plan to extend Docling with several more models, such as a figure-classifier \n",
"model, an equationrecognition model, a code-recognition model and more. This will help \n",
"improve the quality of conversion for specific types of content, as well as augment \n",
"extracted document metadata with additional information. Further investment into testing\n",
"and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
"on our roadmap, too.\n"
"```\n",
"source = \u001b[32m\"https://arxiv.org/pdf/2206.01062\"\u001b[0m # PDF path or URL converter = \n",
"\u001b[1;35mDocumentConverter\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m result = \u001b[1;35mconverter.convert_single\u001b[0m\u001b[1m(\u001b[0msource\u001b[1m)\u001b[0m \n",
"\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0m\u001b[1;35mresult.render_as_markdown\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # output: \u001b[32m\"## DocLayNet: A Large Human -Annotated \u001b[0m\n",
"\u001b[32mDataset for Document -Layout Analysis \u001b[0m\u001b[32m[\u001b[0m\u001b[32m...\u001b[0m\u001b[32m]\u001b[0m\u001b[32m\"\u001b[0m\n",
"```\n",
"Optionally, you can configure custom pipeline features and runtime options, such as \n",
"turning on or off features \u001b[1m(\u001b[0me.g. OCR, table structure recognition\u001b[1m)\u001b[0m, enforcing limits on \n",
"the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
"and options are documented in the README file. \u001b[1;33mDocling also provides a Dockerfile\u001b[0m to \n",
"demonstrate how to install and run it inside a container.\n"
]
},
"metadata": {},
@@ -956,6 +1037,12 @@
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span> ***\n",
"```\n",
"source = <span style=\"color: #008000; text-decoration-color: #008000\">\"https://arxiv.org/pdf/2206.01062\"</span> # PDF path or URL converter = \n",
"<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">DocumentConverter</span><span style=\"font-weight: bold\">()</span> result = <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">converter.convert_single</span><span style=\"font-weight: bold\">(</span>source<span style=\"font-weight: bold\">)</span> \n",
"<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">print</span><span style=\"font-weight: bold\">(</span><span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">result.render_as_markdown</span><span style=\"font-weight: bold\">())</span> # output: <span style=\"color: #008000; text-decoration-color: #008000\">\"## DocLayNet: A Large Human -Annotated </span>\n",
"<span style=\"color: #008000; text-decoration-color: #008000\">Dataset for Document -Layout Analysis [...]\"</span>\n",
"```\n",
"Optionally, you can configure custom pipeline features and runtime options, such as \n",
"turning on or off features <span style=\"font-weight: bold\">(</span>e.g. OCR, table structure recognition<span style=\"font-weight: bold\">)</span>, enforcing limits on \n",
"the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
@@ -965,6 +1052,12 @@
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m1\u001b[0m ***\n",
"```\n",
"source = \u001b[32m\"https://arxiv.org/pdf/2206.01062\"\u001b[0m # PDF path or URL converter = \n",
"\u001b[1;35mDocumentConverter\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m result = \u001b[1;35mconverter.convert_single\u001b[0m\u001b[1m(\u001b[0msource\u001b[1m)\u001b[0m \n",
"\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0m\u001b[1;35mresult.render_as_markdown\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # output: \u001b[32m\"## DocLayNet: A Large Human -Annotated \u001b[0m\n",
"\u001b[32mDataset for Document -Layout Analysis \u001b[0m\u001b[32m[\u001b[0m\u001b[32m...\u001b[0m\u001b[32m]\u001b[0m\u001b[32m\"\u001b[0m\n",
"```\n",
"Optionally, you can configure custom pipeline features and runtime options, such as \n",
"turning on or off features \u001b[1m(\u001b[0me.g. OCR, table structure recognition\u001b[1m)\u001b[0m, enforcing limits on \n",
"the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
@@ -975,6 +1068,204 @@
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span> ***\n",
"Docling is designed to allow easy extension of the model library and pipelines. In the \n",
"future, we plan to extend Docling with several more models, such as a figure-classifier \n",
"model, an equationrecognition model, a code-recognition model and more. This will help \n",
"improve the quality of conversion for specific types of content, as well as augment \n",
"extracted document metadata with additional information. Further investment into testing\n",
"and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
"on our roadmap, too.\n",
"We encourage everyone to propose or implement additional features and models, and will \n",
"gladly take your inputs and contributions under review . The codebase of Docling is open\n",
"for use and contribution, under the MIT license agreement and in alignment with our \n",
"contributing guidelines included in the Docling repository. If you use Docling in your \n",
"projects, please consider citing this technical report.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m2\u001b[0m ***\n",
"Docling is designed to allow easy extension of the model library and pipelines. In the \n",
"future, we plan to extend Docling with several more models, such as a figure-classifier \n",
"model, an equationrecognition model, a code-recognition model and more. This will help \n",
"improve the quality of conversion for specific types of content, as well as augment \n",
"extracted document metadata with additional information. Further investment into testing\n",
"and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
"on our roadmap, too.\n",
"We encourage everyone to propose or implement additional features and models, and will \n",
"gladly take your inputs and contributions under review . The codebase of Docling is open\n",
"for use and contribution, under the MIT license agreement and in alignment with our \n",
"contributing guidelines included in the Docling repository. If you use Docling in your \n",
"projects, please consider citing this technical report.\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> ***\n",
"We therefore decided to provide multiple backend choices, and additionally open-source a\n",
"custombuilt PDF parser, which is based on the low-level qpdf <span style=\"font-weight: bold\">[</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span><span style=\"font-weight: bold\">]</span> library. It is made \n",
"available in a separate package named docling-parse and powers the default PDF backend \n",
"in Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \n",
"be a safe backup choice in certain cases, e.g. if issues are seen with particular font \n",
"encodings.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m3\u001b[0m ***\n",
"We therefore decided to provide multiple backend choices, and additionally open-source a\n",
"custombuilt PDF parser, which is based on the low-level qpdf \u001b[1m[\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1m]\u001b[0m library. It is made \n",
"available in a separate package named docling-parse and powers the default PDF backend \n",
"in Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \n",
"be a safe backup choice in certain cases, e.g. if issues are seen with particular font \n",
"encodings.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"retriever_rrf = index_hybrid.as_retriever(\n",
" vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=3\n",
")\n",
"nodes = retriever_rrf.retrieve(QUERY)\n",
"for idx, item in enumerate(nodes):\n",
" console.print(\n",
" f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Context expansion\n",
"\n",
"Using small chunks can offer several benefits: it increases retrieval precision and it keeps the answer generation tightly focused, which improves accuracy, reduces hallucination, and speeds up inferece.\n",
"However, your RAG system may overlook contextual information necessary for producing a fully grounded response.\n",
"\n",
"Docling's preservation of document structure enables you to employ various strategies for enriching the context available during answer generation within the RAG pipeline.\n",
"For example, after identifying the most relevant chunk, you might include adjacent chunks from the same section as additional groudning material before generating the final answer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the following example, the generated response is wrong, since the top retrieved chunks do not contain all the information that is required to answer the question."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n",
"have limited resources and complex tables?\n",
"🤖: According to the tests in this section using both the MacBook Pro M3 Max and \n",
"bare-metal server running Ubuntu <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">20.04</span> LTS on an Intel Xeon E5-<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2690</span> CPU with a fixed \n",
"thread budget of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span>, Docling achieved faster processing speeds when using the \n",
"custom-built PDF backend based on the low-level qpdf library <span style=\"font-weight: bold\">(</span>docling-parse<span style=\"font-weight: bold\">)</span> compared to\n",
"the alternative PDF backend relying on pypdfium.\n",
"\n",
"Furthermore, the context mentions that Docling provides a separate package named \n",
"docling-ibm-models which includes pre-trained weights and inference code for \n",
"TableFormer, a state-of-the-art table structure recognition model. This suggests that if\n",
"you have complex tables in your documents, using this specialized table recognition \n",
"model could be beneficial.\n",
"\n",
"Therefore, based on the tests with arXiv papers and IBM Redbooks, if you have limited \n",
"resources <span style=\"font-weight: bold\">(</span>likely referring to computational power<span style=\"font-weight: bold\">)</span> and need to process documents \n",
"containing complex tables, it would be recommended to use the docling-parse PDF backend \n",
"along with the TableFormer AI model from docling-ibm-models. This combination should \n",
"provide a good balance of performance and table recognition capabilities for your \n",
"specific needs.\n",
"</pre>\n"
],
"text/plain": [
"👤: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n",
"have limited resources and complex tables?\n",
"🤖: According to the tests in this section using both the MacBook Pro M3 Max and \n",
"bare-metal server running Ubuntu \u001b[1;36m20.04\u001b[0m LTS on an Intel Xeon E5-\u001b[1;36m2690\u001b[0m CPU with a fixed \n",
"thread budget of \u001b[1;36m4\u001b[0m, Docling achieved faster processing speeds when using the \n",
"custom-built PDF backend based on the low-level qpdf library \u001b[1m(\u001b[0mdocling-parse\u001b[1m)\u001b[0m compared to\n",
"the alternative PDF backend relying on pypdfium.\n",
"\n",
"Furthermore, the context mentions that Docling provides a separate package named \n",
"docling-ibm-models which includes pre-trained weights and inference code for \n",
"TableFormer, a state-of-the-art table structure recognition model. This suggests that if\n",
"you have complex tables in your documents, using this specialized table recognition \n",
"model could be beneficial.\n",
"\n",
"Therefore, based on the tests with arXiv papers and IBM Redbooks, if you have limited \n",
"resources \u001b[1m(\u001b[0mlikely referring to computational power\u001b[1m)\u001b[0m and need to process documents \n",
"containing complex tables, it would be recommended to use the docling-parse PDF backend \n",
"along with the TableFormer AI model from docling-ibm-models. This combination should \n",
"provide a good balance of performance and table recognition capabilities for your \n",
"specific needs.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"QUERY = \"According to the tests with arXiv and IBM Redbooks, which backend should I use if I have limited resources and complex tables?\"\n",
"query_rrf = index_hybrid.as_query_engine(\n",
" vector_store_query_mode=VectorStoreQueryMode.HYBRID,\n",
" llm=GEN_MODEL,\n",
" similarity_top_k=3,\n",
")\n",
"res = query_rrf.query(QUERY)\n",
"console.print(f\"👤: {QUERY}\\n🤖: {res.response.strip()}\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span> ***\n",
"In this section, we establish some reference numbers for the processing speed of Docling\n",
"and the resource budget it requires. All tests in this section are run with default \n",
"options on our standard test set distributed with Docling, which consists of three \n",
"papers from arXiv and two IBM Redbooks, with a total of <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">225</span> pages. Measurements were \n",
"taken using both available PDF backends on two different hardware systems: one MacBook \n",
"Pro M3 Max, and one bare-metal server running Ubuntu <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">20.04</span> LTS on an Intel Xeon E5-<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2690</span> \n",
"CPU. For reproducibility, we fixed the thread budget <span style=\"font-weight: bold\">(</span>through setting OMP NUM THREADS \n",
"environment variable <span style=\"font-weight: bold\">)</span> once to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span> <span style=\"font-weight: bold\">(</span>Docling default<span style=\"font-weight: bold\">)</span> and once to <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">16</span> <span style=\"font-weight: bold\">(</span>equal to full core \n",
"count on the test hardware<span style=\"font-weight: bold\">)</span>. All results are shown in Table <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m1\u001b[0m ***\n",
"In this section, we establish some reference numbers for the processing speed of Docling\n",
"and the resource budget it requires. All tests in this section are run with default \n",
"options on our standard test set distributed with Docling, which consists of three \n",
"papers from arXiv and two IBM Redbooks, with a total of \u001b[1;36m225\u001b[0m pages. Measurements were \n",
"taken using both available PDF backends on two different hardware systems: one MacBook \n",
"Pro M3 Max, and one bare-metal server running Ubuntu \u001b[1;36m20.04\u001b[0m LTS on an Intel Xeon E5-\u001b[1;36m2690\u001b[0m \n",
"CPU. For reproducibility, we fixed the thread budget \u001b[1m(\u001b[0mthrough setting OMP NUM THREADS \n",
"environment variable \u001b[1m)\u001b[0m once to \u001b[1;36m4\u001b[0m \u001b[1m(\u001b[0mDocling default\u001b[1m)\u001b[0m and once to \u001b[1;36m16\u001b[0m \u001b[1m(\u001b[0mequal to full core \n",
"count on the test hardware\u001b[1m)\u001b[0m. All results are shown in Table \u001b[1;36m1\u001b[0m.\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
@@ -1004,20 +1295,26 @@
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">*** <span style=\"color: #808000; text-decoration-color: #808000\">k</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span> ***\n",
"We encourage everyone to propose or implement additional features and models, and will \n",
"gladly take your inputs and contributions under review . The codebase of Docling is open\n",
"for use and contribution, under the MIT license agreement and in alignment with our \n",
"contributing guidelines included in the Docling repository. If you use Docling in your \n",
"projects, please consider citing this technical report.\n",
"As part of Docling, we initially release two highly capable AI models to the open-source\n",
"community, which have been developed and published recently by our team. The first model\n",
"is a layout analysis model, an accurate object-detector for page elements <span style=\"font-weight: bold\">[</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">13</span><span style=\"font-weight: bold\">]</span>. The \n",
"second model is TableFormer <span style=\"font-weight: bold\">[</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">12</span>, <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">9</span><span style=\"font-weight: bold\">]</span>, a state-of-the-art table structure recognition \n",
"model. We provide the pre-trained weights <span style=\"font-weight: bold\">(</span>hosted on huggingface<span style=\"font-weight: bold\">)</span> and a separate package\n",
"for the inference code as docling-ibm-models . Both models are also powering the \n",
"open-access deepsearch-experience, our cloud-native service for knowledge exploration \n",
"tasks.\n",
"</pre>\n"
],
"text/plain": [
"*** \u001b[33mk\u001b[0m=\u001b[1;36m3\u001b[0m ***\n",
"We encourage everyone to propose or implement additional features and models, and will \n",
"gladly take your inputs and contributions under review . The codebase of Docling is open\n",
"for use and contribution, under the MIT license agreement and in alignment with our \n",
"contributing guidelines included in the Docling repository. If you use Docling in your \n",
"projects, please consider citing this technical report.\n"
"As part of Docling, we initially release two highly capable AI models to the open-source\n",
"community, which have been developed and published recently by our team. The first model\n",
"is a layout analysis model, an accurate object-detector for page elements \u001b[1m[\u001b[0m\u001b[1;36m13\u001b[0m\u001b[1m]\u001b[0m. The \n",
"second model is TableFormer \u001b[1m[\u001b[0m\u001b[1;36m12\u001b[0m, \u001b[1;36m9\u001b[0m\u001b[1m]\u001b[0m, a state-of-the-art table structure recognition \n",
"model. We provide the pre-trained weights \u001b[1m(\u001b[0mhosted on huggingface\u001b[1m)\u001b[0m and a separate package\n",
"for the inference code as docling-ibm-models . Both models are also powering the \n",
"open-access deepsearch-experience, our cloud-native service for knowledge exploration \n",
"tasks.\n"
]
},
"metadata": {},
@@ -1025,15 +1322,105 @@
}
],
"source": [
"retriever_rrf = index_hybrid.as_retriever(\n",
" vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=3\n",
")\n",
"nodes = retriever_rrf.retrieve(QUERY)\n",
"for idx, item in enumerate(nodes):\n",
" console.print(\n",
" f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Even though the top retrieved chunks are relevant for the question, the key information lays in the paragraph after the first chunk:\n",
"\n",
"> If you need to run Docling in very low-resource environments, please consider configuring the pypdfium backend. While it is faster and more memory efficient than the default docling-parse backend, it will come at the expense of worse quality results, especially in table structure recovery."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We next examine the fragments that immediately precede and follow the topretrieved chunk, so long as those neighbors remain within the same section, to preserve the semantic integrity of the context.\n",
"The generated answer is now accurate because it has been grounded in the necessary contextual information.\n",
"\n",
"💡 In a production setting, it may be preferable to persist the parsed documents (i.e., `DoclingDocument` objects) as JSON in an object store or database and then fetch them when you need to traverse the document for contextexpansion scenarios. In this simplified example, however, we will query the OpenSearch index directly to obtain the required chunks."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">👤: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n",
"have limited resources and complex tables?\n",
"🤖: According to the tests described in the provided context, if you need to run Docling\n",
"in a very low-resource environment and are dealing with complex tables that require \n",
"high-quality table structure recovery, you should consider configuring the pypdfium \n",
"backend. The context mentions that while it is faster and more memory efficient than the\n",
"default docling-parse backend, it may come at the expense of worse quality results, \n",
"especially in table structure recovery. Therefore, for limited resources and complex \n",
"tables where quality is crucial, pypdfium would be a suitable choice despite its \n",
"potential drawbacks compared to the default backend.\n",
"</pre>\n"
],
"text/plain": [
"👤: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n",
"have limited resources and complex tables?\n",
"🤖: According to the tests described in the provided context, if you need to run Docling\n",
"in a very low-resource environment and are dealing with complex tables that require \n",
"high-quality table structure recovery, you should consider configuring the pypdfium \n",
"backend. The context mentions that while it is faster and more memory efficient than the\n",
"default docling-parse backend, it may come at the expense of worse quality results, \n",
"especially in table structure recovery. Therefore, for limited resources and complex \n",
"tables where quality is crucial, pypdfium would be a suitable choice despite its \n",
"potential drawbacks compared to the default backend.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"top_headings = nodes[0].metadata[\"headings\"]\n",
"top_text = nodes[0].text\n",
"\n",
"rdr = ElasticsearchReader(endpoint=OPENSEARCH_ENDPOINT, index=OPENSEARCH_INDEX)\n",
"docs = rdr.load_data(\n",
" field=text_field,\n",
" query={\n",
" \"query\": {\n",
" \"terms_set\": {\n",
" \"metadata.headings.keyword\": {\n",
" \"terms\": top_headings,\n",
" \"minimum_should_match_script\": {\"source\": \"params.num_terms\"},\n",
" }\n",
" }\n",
" }\n",
" },\n",
")\n",
"ext_nodes = []\n",
"for idx, item in enumerate(docs):\n",
" if item.text == top_text:\n",
" ext_nodes.append(NodeWithScore(node=Node(text=item.text), score=1.0))\n",
" if idx > 0:\n",
" ext_nodes.append(\n",
" NodeWithScore(node=Node(text=docs[idx - 1].text), score=1.0)\n",
" )\n",
" if idx < len(docs) - 1:\n",
" ext_nodes.append(\n",
" NodeWithScore(node=Node(text=docs[idx + 1].text), score=1.0)\n",
" )\n",
" break\n",
"\n",
"synthesizer = get_response_synthesizer(llm=GEN_MODEL)\n",
"res = synthesizer.synthesize(query=QUERY, nodes=ext_nodes)\n",
"console.print(f\"👤: {QUERY}\\n🤖: {res.response.strip()}\")"
]
}
],
"metadata": {

View File

@@ -163,37 +163,3 @@ result = converter.convert(source)
## Limit resource usage
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
## Use specific backend converters
!!! note
This section discusses directly invoking a [backend](../concepts/architecture.md),
i.e. using a low-level API. This should only be done when necessary. For most cases,
using a `DocumentConverter` (high-level API) as discussed in the sections above
should suffice  and is the recommended way.
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](supported_formats.md)).
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example.
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
```python
import urllib.request
from io import BytesIO
from docling.backend.html_backend import HTMLDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import InputDocument
url = "https://en.wikipedia.org/wiki/Duck"
text = urllib.request.urlopen(url).read()
in_doc = InputDocument(
path_or_stream=BytesIO(text),
format=InputFormat.HTML,
backend=HTMLDocumentBackend,
filename="duck.html",
)
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
dl_doc = backend.convert()
print(dl_doc.export_to_markdown())
```