diff --git a/docs/examples/backend_xml_rag.ipynb b/docs/examples/backend_xml_rag.ipynb index 60c5839a..bd44754b 100644 --- a/docs/examples/backend_xml_rag.ipynb +++ b/docs/examples/backend_xml_rag.ipynb @@ -431,130 +431,6 @@ "print(f\"Fetched and exported {doc_num} documents.\")" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Using the backend converter (optional)\n", - "\n", - "- The custom backend converters `PubMedDocumentBackend` and `PatentUsptoDocumentBackend` aim at handling the parsing of PMC articles and USPTO patents, respectively.\n", - "- As any other backends, you can leverage the function `is_valid()` to check if the input document is supported by the this backend.\n", - "- Note that some XML sections in the original USPTO zip file may not represent patents, like sequence listings, and therefore they will show as invalid by the backend." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Document nihpp-2024.12.26.630351v1.nxml is a valid PMC article? True\n", - "Document ipg241217-1.xml is a valid patent? True\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "316241ca89a843bda3170f2a5c76c639", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - " 0%| | 0/4014 [00:00 Token indices sequence length is longer than the specified maximum sequence length for this model\n", + "\n", + "This is a _false alarm_ and you may get more background explanation in [Docling's FAQ](https://docling-project.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model) page." ] }, { @@ -396,7 +427,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "2025-09-10 13:16:53,752 - WARNING - GET http://localhost:9200/docling-index [status:404 request:0.015s]\n" + "2025-10-24 15:05:49,841 - WARNING - GET http://localhost:9200/docling-index [status:404 request:0.006s]\n" ] } ], @@ -407,9 +438,10 @@ "embed_field = \"embedding\"\n", "\n", "client = OpensearchVectorClient(\n", - " endpoint=\"http://localhost:9200\",\n", + " endpoint=OPENSEARCH_ENDPOINT,\n", " index=OPENSEARCH_INDEX,\n", " dim=embed_dim,\n", + " engine=\"faiss\",\n", " embedding_field=embed_field,\n", " text_field=text_field,\n", ")\n", @@ -450,20 +482,24 @@ "data": { "text/html": [ "
πŸ‘€: Which are the main AI models in Docling?\n",
-       "πŸ€–: Docling primarily utilizes two AI models. The first one is a layout analysis model, \n",
-       "serving as an accurate object-detector for page elements. The second model is \n",
-       "TableFormer, a state-of-the-art table structure recognition model. Both models are \n",
-       "pre-trained and their weights are hosted on Hugging Face. They also power the \n",
-       "deepsearch-experience, a cloud-native service for knowledge exploration tasks.\n",
+       "πŸ€–: The two main AI models used in Docling are:\n",
+       "\n",
+       "1. A layout analysis model, an accurate object-detector for page elements \n",
+       "2. TableFormer, a state-of-the-art table structure recognition model\n",
+       "\n",
+       "These models were initially released as part of the open-source Docling package to help \n",
+       "with document understanding tasks.\n",
        "
\n" ], "text/plain": [ "πŸ‘€: Which are the main AI models in Docling?\n", - "πŸ€–: Docling primarily utilizes two AI models. The first one is a layout analysis model, \n", - "serving as an accurate object-detector for page elements. The second model is \n", - "TableFormer, a state-of-the-art table structure recognition model. Both models are \n", - "pre-trained and their weights are hosted on Hugging Face. They also power the \n", - "deepsearch-experience, a cloud-native service for knowledge exploration tasks.\n" + "πŸ€–: The two main AI models used in Docling are:\n", + "\n", + "\u001b[1;36m1\u001b[0m. A layout analysis model, an accurate object-detector for page elements \n", + "\u001b[1;36m2\u001b[0m. TableFormer, a state-of-the-art table structure recognition model\n", + "\n", + "These models were initially released as part of the open-source Docling package to help \n", + "with document understanding tasks.\n" ] }, "metadata": {}, @@ -499,23 +535,23 @@ { "data": { "text/html": [ - "
πŸ‘€: What are the performance metrics of Docling-native PDF backend with 16 threads?\n",
-       "πŸ€–: The Docling-native PDF backend, when utilized with 16 threads on an Apple M3 Max \n",
-       "system, completed the processing in approximately 167 seconds. It achieved a throughput \n",
-       "of about 1.34 pages per second and peaked at a memory usage of 6.20 GB (resident set \n",
-       "size). On an Intel Xeon E5-2690 system with the same thread count, it took around 244 \n",
-       "seconds to process, managed a throughput of 0.92 pages per second, and reached a peak \n",
-       "memory usage of 6.16 GB.\n",
+       "
πŸ‘€: What is the time to solution with the native backend on Intel?\n",
+       "πŸ€–: The time to solution (TTS) for the native backend on Intel is:\n",
+       "- For Apple M3 Max (16 cores): 375 seconds \n",
+       "- For Intel(R) Xeon E5-2690, native backend: 244 seconds\n",
+       "\n",
+       "So the TTS with the native backend on Intel ranges from approximately 244 to 375 seconds\n",
+       "depending on the specific configuration.\n",
        "
\n" ], "text/plain": [ - "πŸ‘€: What are the performance metrics of Docling-native PDF backend with \u001b[1;36m16\u001b[0m threads?\n", - "πŸ€–: The Docling-native PDF backend, when utilized with \u001b[1;36m16\u001b[0m threads on an Apple M3 Max \n", - "system, completed the processing in approximately \u001b[1;36m167\u001b[0m seconds. It achieved a throughput \n", - "of about \u001b[1;36m1.34\u001b[0m pages per second and peaked at a memory usage of \u001b[1;36m6.20\u001b[0m GB \u001b[1m(\u001b[0mresident set \n", - "size\u001b[1m)\u001b[0m. On an Intel Xeon E5-\u001b[1;36m2690\u001b[0m system with the same thread count, it took around \u001b[1;36m244\u001b[0m \n", - "seconds to process, managed a throughput of \u001b[1;36m0.92\u001b[0m pages per second, and reached a peak \n", - "memory usage of \u001b[1;36m6.16\u001b[0m GB.\n" + "πŸ‘€: What is the time to solution with the native backend on Intel?\n", + "πŸ€–: The time to solution \u001b[1m(\u001b[0mTTS\u001b[1m)\u001b[0m for the native backend on Intel is:\n", + "- For Apple M3 Max \u001b[1m(\u001b[0m\u001b[1;36m16\u001b[0m cores\u001b[1m)\u001b[0m: \u001b[1;36m375\u001b[0m seconds \n", + "- For \u001b[1;35mIntel\u001b[0m\u001b[1m(\u001b[0mR\u001b[1m)\u001b[0m Xeon E5-\u001b[1;36m2690\u001b[0m, native backend: \u001b[1;36m244\u001b[0m seconds\n", + "\n", + "So the TTS with the native backend on Intel ranges from approximately \u001b[1;36m244\u001b[0m to \u001b[1;36m375\u001b[0m seconds\n", + "depending on the specific configuration.\n" ] }, "metadata": {}, @@ -523,9 +559,7 @@ } ], "source": [ - "QUERY = (\n", - " \"What are the performance metrics of Docling-native PDF backend with 16 threads?\"\n", - ")\n", + "QUERY = \"What is the time to solution with the native backend on Intel?\"\n", "query_engine = index.as_query_engine(llm=GEN_MODEL)\n", "res = query_engine.query(QUERY)\n", "console.print(f\"πŸ‘€: {QUERY}\\nπŸ€–: {res.response.strip()}\")" @@ -546,7 +580,15 @@ "cell_type": "code", "execution_count": 11, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Token indices sequence length is longer than the specified maximum sequence length for this model (538 > 512). Running this sequence through the model will result in indexing errors\n" + ] + } + ], "source": [ "class MDTableSerializerProvider(ChunkingSerializerProvider):\n", " def get_serializer(self, doc):\n", @@ -561,7 +603,9 @@ "client.clear()\n", "vector_store.clear()\n", "\n", - "chunker = HierarchicalChunker(\n", + "chunker = HybridChunker(\n", + " tokenizer=tokenizer,\n", + " max_tokens=EMBED_MAX_TOKENS,\n", " serializer_provider=MDTableSerializerProvider(),\n", ")\n", "node_parser = DoclingNodeParser(chunker=chunker)\n", @@ -573,13 +617,6 @@ ")" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Observe that the generated response is now more accurate. Refer to the [Advanced chunking & serialization](https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/) example for more details on serialization strategies." - ] - }, { "cell_type": "code", "execution_count": 12, @@ -588,19 +625,25 @@ { "data": { "text/html": [ - "
πŸ‘€: Which backend is faster on Intel with 4 threads?\n",
-       "πŸ€–: The pypdfium backend is faster than the Docling-native PDF backend for an Intel Xeon\n",
-       "E5-2690 CPU with a thread budget of 4, as indicated in Table 1. The pypdfium backend \n",
-       "completes the processing in 239 seconds, achieving a throughput of 0.94 pages per \n",
-       "second, while the Docling-native PDF backend takes 375 seconds.\n",
+       "
πŸ‘€: What is the time to solution with the native backend on Intel?\n",
+       "πŸ€–: The table shows that for the native backend on Intel systems, the time-to-solution \n",
+       "(TTS) ranges from 239 seconds to 375 seconds. Specifically:\n",
+       "- With 4 threads, the TTS is 239 seconds.\n",
+       "- With 16 threads, the TTS is 244 seconds.\n",
+       "\n",
+       "So the time to solution with the native backend on Intel varies between approximately \n",
+       "239 and 375 seconds depending on the thread budget used.\n",
        "
\n" ], "text/plain": [ - "πŸ‘€: Which backend is faster on Intel with \u001b[1;36m4\u001b[0m threads?\n", - "πŸ€–: The pypdfium backend is faster than the Docling-native PDF backend for an Intel Xeon\n", - "E5-\u001b[1;36m2690\u001b[0m CPU with a thread budget of \u001b[1;36m4\u001b[0m, as indicated in Table \u001b[1;36m1\u001b[0m. The pypdfium backend \n", - "completes the processing in \u001b[1;36m239\u001b[0m seconds, achieving a throughput of \u001b[1;36m0.94\u001b[0m pages per \n", - "second, while the Docling-native PDF backend takes \u001b[1;36m375\u001b[0m seconds.\n" + "πŸ‘€: What is the time to solution with the native backend on Intel?\n", + "πŸ€–: The table shows that for the native backend on Intel systems, the time-to-solution \n", + "\u001b[1m(\u001b[0mTTS\u001b[1m)\u001b[0m ranges from \u001b[1;36m239\u001b[0m seconds to \u001b[1;36m375\u001b[0m seconds. Specifically:\n", + "- With \u001b[1;36m4\u001b[0m threads, the TTS is \u001b[1;36m239\u001b[0m seconds.\n", + "- With \u001b[1;36m16\u001b[0m threads, the TTS is \u001b[1;36m244\u001b[0m seconds.\n", + "\n", + "So the time to solution with the native backend on Intel varies between approximately \n", + "\u001b[1;36m239\u001b[0m and \u001b[1;36m375\u001b[0m seconds depending on the thread budget used.\n" ] }, "metadata": {}, @@ -609,7 +652,6 @@ ], "source": [ "query_engine = index.as_query_engine(llm=GEN_MODEL)\n", - "QUERY = \"Which backend is faster on Intel with 4 threads?\"\n", "res = query_engine.query(QUERY)\n", "console.print(f\"πŸ‘€: {QUERY}\\nπŸ€–: {res.response.strip()}\")" ] @@ -618,7 +660,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Refer to the [Advanced chunking & serialization](https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/) example for more details on serialization strategies." + "Observe that the generated response is now more accurate. Refer to the [Advanced chunking & serialization](https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/) example for more details on serialization strategies." ] }, { @@ -671,9 +713,14 @@ "
[\n",
        "β”‚   {\n",
        "β”‚   β”‚   'k': 1,\n",
-       "β”‚   β”‚   'score': 0.6800267,\n",
-       "β”‚   β”‚   'text': 'If you need to run Docling in very low-resource environments, please consider configuring the pypdfium backend. While it is faster and more memory efficient than the default docling-parse backend, it '+90,\n",
-       "β”‚   β”‚   'items': [{'ref': '#/texts/68', 'label': 'text'}]\n",
+       "β”‚   β”‚   'score': 0.694972,\n",
+       "β”‚   β”‚   'text': '- [13] B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar. Doclaynet: a large humanannotated dataset for document-layout segmentation. pages 3743-3751, 2022.\\n- [14] pypdf Maintainers. pypdf: '+314,\n",
+       "β”‚   β”‚   'items': [\n",
+       "β”‚   β”‚   β”‚   {'ref': '#/texts/93', 'label': 'list_item'},\n",
+       "β”‚   β”‚   β”‚   {'ref': '#/texts/94', 'label': 'list_item'},\n",
+       "β”‚   β”‚   β”‚   {'ref': '#/texts/95', 'label': 'list_item'},\n",
+       "β”‚   β”‚   β”‚   {'ref': '#/texts/96', 'label': 'list_item'}\n",
+       "β”‚   β”‚   ]\n",
        "β”‚   }\n",
        "]\n",
        "
\n" @@ -682,9 +729,14 @@ "\u001b[1m[\u001b[0m\n", "\u001b[2;32mβ”‚ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'k'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", - "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6800267\u001b[0m,\n", - "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'text'\u001b[0m: \u001b[32m'If you need to run Docling in very low-resource environments, please consider configuring the pypdfium backend. While it is faster and more memory efficient than the default docling-parse backend, it '\u001b[0m+\u001b[1;36m90\u001b[0m,\n", - "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'items'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/68'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'text'\u001b[0m\u001b[1m}\u001b[0m\u001b[1m]\u001b[0m\n", + "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.694972\u001b[0m,\n", + "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'text'\u001b[0m: \u001b[32m'- \u001b[0m\u001b[32m[\u001b[0m\u001b[32m13\u001b[0m\u001b[32m]\u001b[0m\u001b[32m B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar. Doclaynet: a large humanannotated dataset for document-layout segmentation. pages 3743-3751, 2022.\\n- \u001b[0m\u001b[32m[\u001b[0m\u001b[32m14\u001b[0m\u001b[32m]\u001b[0m\u001b[32m pypdf Maintainers. pypdf: '\u001b[0m+\u001b[1;36m314\u001b[0m,\n", + "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'items'\u001b[0m: \u001b[1m[\u001b[0m\n", + "\u001b[2;32mβ”‚ β”‚ β”‚ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/93'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m,\n", + "\u001b[2;32mβ”‚ β”‚ β”‚ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/94'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m,\n", + "\u001b[2;32mβ”‚ β”‚ β”‚ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/95'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m,\n", + "\u001b[2;32mβ”‚ β”‚ β”‚ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/96'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'list_item'\u001b[0m\u001b[1m}\u001b[0m\n", + "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[1m]\u001b[0m\n", "\u001b[2;32mβ”‚ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[1m]\u001b[0m\n" ] @@ -728,9 +780,9 @@ "
[\n",
        "β”‚   {\n",
        "β”‚   β”‚   'k': 1,\n",
-       "β”‚   β”‚   'score': 0.6078317,\n",
-       "β”‚   β”‚   'text': 'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution (TT'+1014,\n",
-       "β”‚   β”‚   'items': [{'ref': '#/texts/72', 'label': 'caption'}, {'ref': '#/tables/0', 'label': 'table'}]\n",
+       "β”‚   β”‚   'score': 0.6238112,\n",
+       "β”‚   β”‚   'text': 'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution (TT'+515,\n",
+       "β”‚   β”‚   'items': [{'ref': '#/tables/0', 'label': 'table'}, {'ref': '#/tables/0', 'label': 'table'}]\n",
        "β”‚   }\n",
        "]\n",
        "
\n" @@ -739,9 +791,9 @@ "\u001b[1m[\u001b[0m\n", "\u001b[2;32mβ”‚ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'k'\u001b[0m: \u001b[1;36m1\u001b[0m,\n", - "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6078317\u001b[0m,\n", - "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'text'\u001b[0m: \u001b[32m'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u001b[0m\u001b[32m(\u001b[0m\u001b[32mTT'\u001b[0m+\u001b[1;36m1014\u001b[0m,\n", - "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'items'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/texts/72'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'caption'\u001b[0m\u001b[1m}\u001b[0m, \u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/tables/0'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'table'\u001b[0m\u001b[1m}\u001b[0m\u001b[1m]\u001b[0m\n", + "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.6238112\u001b[0m,\n", + "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'text'\u001b[0m: \u001b[32m'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u001b[0m\u001b[32m(\u001b[0m\u001b[32mTT'\u001b[0m+\u001b[1;36m515\u001b[0m,\n", + "\u001b[2;32mβ”‚ β”‚ \u001b[0m\u001b[32m'items'\u001b[0m: \u001b[1m[\u001b[0m\u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/tables/0'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'table'\u001b[0m\u001b[1m}\u001b[0m, \u001b[1m{\u001b[0m\u001b[32m'ref'\u001b[0m: \u001b[32m'#/tables/0'\u001b[0m, \u001b[32m'label'\u001b[0m: \u001b[32m'table'\u001b[0m\u001b[1m}\u001b[0m\u001b[1m]\u001b[0m\n", "\u001b[2;32mβ”‚ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[1m]\u001b[0m\n" ] @@ -816,7 +868,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "2025-09-10 13:17:10,104 - WARNING - GET http://localhost:9200/docling-index-rrf [status:404 request:0.001s]\n" + "2025-10-24 15:06:05,175 - WARNING - GET http://localhost:9200/docling-index-rrf [status:404 request:0.001s]\n" ] } ], @@ -825,6 +877,7 @@ " endpoint=OPENSEARCH_ENDPOINT,\n", " index=f\"{OPENSEARCH_INDEX}-rrf\",\n", " dim=embed_dim,\n", + " engine=\"faiss\",\n", " embedding_field=embed_field,\n", " text_field=text_field,\n", " search_pipeline=\"rrf-pipeline\",\n", @@ -857,6 +910,13 @@ "data": { "text/html": [ "
*** k=1 ***\n",
+       "Docling is designed to allow easy extension of the model library and pipelines. In the \n",
+       "future, we plan to extend Docling with several more models, such as a figure-classifier \n",
+       "model, an equationrecognition model, a code-recognition model and more. This will help \n",
+       "improve the quality of conversion for specific types of content, as well as augment \n",
+       "extracted document metadata with additional information. Further investment into testing\n",
+       "and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
+       "on our roadmap, too.\n",
        "We encourage everyone to propose or implement additional features and models, and will \n",
        "gladly take your inputs and contributions under review . The codebase of Docling is open\n",
        "for use and contribution, under the MIT license agreement and in alignment with our \n",
@@ -866,6 +926,13 @@
       ],
       "text/plain": [
        "*** \u001b[33mk\u001b[0m=\u001b[1;36m1\u001b[0m ***\n",
+       "Docling is designed to allow easy extension of the model library and pipelines. In the \n",
+       "future, we plan to extend Docling with several more models, such as a figure-classifier \n",
+       "model, an equationrecognition model, a code-recognition model and more. This will help \n",
+       "improve the quality of conversion for specific types of content, as well as augment \n",
+       "extracted document metadata with additional information. Further investment into testing\n",
+       "and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
+       "on our roadmap, too.\n",
        "We encourage everyone to propose or implement additional features and models, and will \n",
        "gladly take your inputs and contributions under review . The codebase of Docling is open\n",
        "for use and contribution, under the MIT license agreement and in alignment with our \n",
@@ -880,20 +947,26 @@
      "data": {
       "text/html": [
        "
*** k=2 ***\n",
-       "Optionally, you can configure custom pipeline features and runtime options, such as \n",
-       "turning on or off features (e.g. OCR, table structure recognition), enforcing limits on \n",
-       "the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
-       "and options are documented in the README file. Docling also provides a Dockerfile to \n",
-       "demonstrate how to install and run it inside a container.\n",
+       "In the final pipeline stage, Docling assembles all prediction results produced on each \n",
+       "page into a well-defined datatype that encapsulates a converted document, as defined in \n",
+       "the auxiliary package docling-core . The generated document object is passed through a \n",
+       "post-processing model which leverages several algorithms to augment features, such as \n",
+       "detection of the document language, correcting the reading order, matching figures with \n",
+       "captions and labelling metadata such as title, authors and references. The final output \n",
+       "can then be serialized to JSON or transformed into a Markdown representation at the \n",
+       "users request.\n",
        "
\n" ], "text/plain": [ "*** \u001b[33mk\u001b[0m=\u001b[1;36m2\u001b[0m ***\n", - "Optionally, you can configure custom pipeline features and runtime options, such as \n", - "turning on or off features \u001b[1m(\u001b[0me.g. OCR, table structure recognition\u001b[1m)\u001b[0m, enforcing limits on \n", - "the input document size, and defining the budget of CPU threads. Advanced usage examples\n", - "and options are documented in the README file. \u001b[1;33mDocling also provides a Dockerfile\u001b[0m to \n", - "demonstrate how to install and run it inside a container.\n" + "In the final pipeline stage, Docling assembles all prediction results produced on each \n", + "page into a well-defined datatype that encapsulates a converted document, as defined in \n", + "the auxiliary package docling-core . The generated document object is passed through a \n", + "post-processing model which leverages several algorithms to augment features, such as \n", + "detection of the document language, correcting the reading order, matching figures with \n", + "captions and labelling metadata such as title, authors and references. The final output \n", + "can then be serialized to JSON or transformed into a Markdown representation at the \n", + "users request.\n" ] }, "metadata": {}, @@ -903,24 +976,32 @@ "data": { "text/html": [ "
*** k=3 ***\n",
-       "Docling is designed to allow easy extension of the model library and pipelines. In the \n",
-       "future, we plan to extend Docling with several more models, such as a figure-classifier \n",
-       "model, an equationrecognition model, a code-recognition model and more. This will help \n",
-       "improve the quality of conversion for specific types of content, as well as augment \n",
-       "extracted document metadata with additional information. Further investment into testing\n",
-       "and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
-       "on our roadmap, too.\n",
+       "```\n",
+       "source = \"https://arxiv.org/pdf/2206.01062\" # PDF path or URL converter = \n",
+       "DocumentConverter() result = converter.convert_single(source) \n",
+       "print(result.render_as_markdown()) # output: \"## DocLayNet: A Large Human -Annotated \n",
+       "Dataset for Document -Layout Analysis [...]\"\n",
+       "```\n",
+       "Optionally, you can configure custom pipeline features and runtime options, such as \n",
+       "turning on or off features (e.g. OCR, table structure recognition), enforcing limits on \n",
+       "the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
+       "and options are documented in the README file. Docling also provides a Dockerfile to \n",
+       "demonstrate how to install and run it inside a container.\n",
        "
\n" ], "text/plain": [ "*** \u001b[33mk\u001b[0m=\u001b[1;36m3\u001b[0m ***\n", - "Docling is designed to allow easy extension of the model library and pipelines. In the \n", - "future, we plan to extend Docling with several more models, such as a figure-classifier \n", - "model, an equationrecognition model, a code-recognition model and more. This will help \n", - "improve the quality of conversion for specific types of content, as well as augment \n", - "extracted document metadata with additional information. Further investment into testing\n", - "and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n", - "on our roadmap, too.\n" + "```\n", + "source = \u001b[32m\"https://arxiv.org/pdf/2206.01062\"\u001b[0m # PDF path or URL converter = \n", + "\u001b[1;35mDocumentConverter\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m result = \u001b[1;35mconverter.convert_single\u001b[0m\u001b[1m(\u001b[0msource\u001b[1m)\u001b[0m \n", + "\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0m\u001b[1;35mresult.render_as_markdown\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # output: \u001b[32m\"## DocLayNet: A Large Human -Annotated \u001b[0m\n", + "\u001b[32mDataset for Document -Layout Analysis \u001b[0m\u001b[32m[\u001b[0m\u001b[32m...\u001b[0m\u001b[32m]\u001b[0m\u001b[32m\"\u001b[0m\n", + "```\n", + "Optionally, you can configure custom pipeline features and runtime options, such as \n", + "turning on or off features \u001b[1m(\u001b[0me.g. OCR, table structure recognition\u001b[1m)\u001b[0m, enforcing limits on \n", + "the input document size, and defining the budget of CPU threads. Advanced usage examples\n", + "and options are documented in the README file. \u001b[1;33mDocling also provides a Dockerfile\u001b[0m to \n", + "demonstrate how to install and run it inside a container.\n" ] }, "metadata": {}, @@ -956,6 +1037,12 @@ "data": { "text/html": [ "
*** k=1 ***\n",
+       "```\n",
+       "source = \"https://arxiv.org/pdf/2206.01062\" # PDF path or URL converter = \n",
+       "DocumentConverter() result = converter.convert_single(source) \n",
+       "print(result.render_as_markdown()) # output: \"## DocLayNet: A Large Human -Annotated \n",
+       "Dataset for Document -Layout Analysis [...]\"\n",
+       "```\n",
        "Optionally, you can configure custom pipeline features and runtime options, such as \n",
        "turning on or off features (e.g. OCR, table structure recognition), enforcing limits on \n",
        "the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
@@ -965,6 +1052,12 @@
       ],
       "text/plain": [
        "*** \u001b[33mk\u001b[0m=\u001b[1;36m1\u001b[0m ***\n",
+       "```\n",
+       "source = \u001b[32m\"https://arxiv.org/pdf/2206.01062\"\u001b[0m # PDF path or URL converter = \n",
+       "\u001b[1;35mDocumentConverter\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m result = \u001b[1;35mconverter.convert_single\u001b[0m\u001b[1m(\u001b[0msource\u001b[1m)\u001b[0m \n",
+       "\u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0m\u001b[1;35mresult.render_as_markdown\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\u001b[1m)\u001b[0m # output: \u001b[32m\"## DocLayNet: A Large Human -Annotated \u001b[0m\n",
+       "\u001b[32mDataset for Document -Layout Analysis \u001b[0m\u001b[32m[\u001b[0m\u001b[32m...\u001b[0m\u001b[32m]\u001b[0m\u001b[32m\"\u001b[0m\n",
+       "```\n",
        "Optionally, you can configure custom pipeline features and runtime options, such as \n",
        "turning on or off features \u001b[1m(\u001b[0me.g. OCR, table structure recognition\u001b[1m)\u001b[0m, enforcing limits on \n",
        "the input document size, and defining the budget of CPU threads. Advanced usage examples\n",
@@ -975,6 +1068,204 @@
      "metadata": {},
      "output_type": "display_data"
     },
+    {
+     "data": {
+      "text/html": [
+       "
*** k=2 ***\n",
+       "Docling is designed to allow easy extension of the model library and pipelines. In the \n",
+       "future, we plan to extend Docling with several more models, such as a figure-classifier \n",
+       "model, an equationrecognition model, a code-recognition model and more. This will help \n",
+       "improve the quality of conversion for specific types of content, as well as augment \n",
+       "extracted document metadata with additional information. Further investment into testing\n",
+       "and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n",
+       "on our roadmap, too.\n",
+       "We encourage everyone to propose or implement additional features and models, and will \n",
+       "gladly take your inputs and contributions under review . The codebase of Docling is open\n",
+       "for use and contribution, under the MIT license agreement and in alignment with our \n",
+       "contributing guidelines included in the Docling repository. If you use Docling in your \n",
+       "projects, please consider citing this technical report.\n",
+       "
\n" + ], + "text/plain": [ + "*** \u001b[33mk\u001b[0m=\u001b[1;36m2\u001b[0m ***\n", + "Docling is designed to allow easy extension of the model library and pipelines. In the \n", + "future, we plan to extend Docling with several more models, such as a figure-classifier \n", + "model, an equationrecognition model, a code-recognition model and more. This will help \n", + "improve the quality of conversion for specific types of content, as well as augment \n", + "extracted document metadata with additional information. Further investment into testing\n", + "and optimizing GPU acceleration as well as improving the Docling-native PDF backend are \n", + "on our roadmap, too.\n", + "We encourage everyone to propose or implement additional features and models, and will \n", + "gladly take your inputs and contributions under review . The codebase of Docling is open\n", + "for use and contribution, under the MIT license agreement and in alignment with our \n", + "contributing guidelines included in the Docling repository. If you use Docling in your \n", + "projects, please consider citing this technical report.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
*** k=3 ***\n",
+       "We therefore decided to provide multiple backend choices, and additionally open-source a\n",
+       "custombuilt PDF parser, which is based on the low-level qpdf [4] library. It is made \n",
+       "available in a separate package named docling-parse and powers the default PDF backend \n",
+       "in Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \n",
+       "be a safe backup choice in certain cases, e.g. if issues are seen with particular font \n",
+       "encodings.\n",
+       "
\n" + ], + "text/plain": [ + "*** \u001b[33mk\u001b[0m=\u001b[1;36m3\u001b[0m ***\n", + "We therefore decided to provide multiple backend choices, and additionally open-source a\n", + "custombuilt PDF parser, which is based on the low-level qpdf \u001b[1m[\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1m]\u001b[0m library. It is made \n", + "available in a separate package named docling-parse and powers the default PDF backend \n", + "in Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \n", + "be a safe backup choice in certain cases, e.g. if issues are seen with particular font \n", + "encodings.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "retriever_rrf = index_hybrid.as_retriever(\n", + " vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=3\n", + ")\n", + "nodes = retriever_rrf.retrieve(QUERY)\n", + "for idx, item in enumerate(nodes):\n", + " console.print(\n", + " f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Context expansion\n", + "\n", + "Using small chunks can offer several benefits: it increases retrieval precision and it keeps the answer generation tightly focused, which improves accuracy, reduces hallucination, and speeds up inferece.\n", + "However, your RAG system may overlook contextual information necessary for producing a fully grounded response.\n", + "\n", + "Docling's preservation of document structure enables you to employ various strategies for enriching the context available during answer generation within the RAG pipeline.\n", + "For example, after identifying the most relevant chunk, you might include adjacent chunks from the same section as additional groudning material before generating the final answer." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the following example, the generated response is wrong, since the top retrieved chunks do not contain all the information that is required to answer the question." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
πŸ‘€: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n",
+       "have limited resources and complex tables?\n",
+       "πŸ€–: According to the tests in this section using both the MacBook Pro M3 Max and \n",
+       "bare-metal server running Ubuntu 20.04 LTS on an Intel Xeon E5-2690 CPU with a fixed \n",
+       "thread budget of 4, Docling achieved faster processing speeds when using the \n",
+       "custom-built PDF backend based on the low-level qpdf library (docling-parse) compared to\n",
+       "the alternative PDF backend relying on pypdfium.\n",
+       "\n",
+       "Furthermore, the context mentions that Docling provides a separate package named \n",
+       "docling-ibm-models which includes pre-trained weights and inference code for \n",
+       "TableFormer, a state-of-the-art table structure recognition model. This suggests that if\n",
+       "you have complex tables in your documents, using this specialized table recognition \n",
+       "model could be beneficial.\n",
+       "\n",
+       "Therefore, based on the tests with arXiv papers and IBM Redbooks, if you have limited \n",
+       "resources (likely referring to computational power) and need to process documents \n",
+       "containing complex tables, it would be recommended to use the docling-parse PDF backend \n",
+       "along with the TableFormer AI model from docling-ibm-models. This combination should \n",
+       "provide a good balance of performance and table recognition capabilities for your \n",
+       "specific needs.\n",
+       "
\n" + ], + "text/plain": [ + "πŸ‘€: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n", + "have limited resources and complex tables?\n", + "πŸ€–: According to the tests in this section using both the MacBook Pro M3 Max and \n", + "bare-metal server running Ubuntu \u001b[1;36m20.04\u001b[0m LTS on an Intel Xeon E5-\u001b[1;36m2690\u001b[0m CPU with a fixed \n", + "thread budget of \u001b[1;36m4\u001b[0m, Docling achieved faster processing speeds when using the \n", + "custom-built PDF backend based on the low-level qpdf library \u001b[1m(\u001b[0mdocling-parse\u001b[1m)\u001b[0m compared to\n", + "the alternative PDF backend relying on pypdfium.\n", + "\n", + "Furthermore, the context mentions that Docling provides a separate package named \n", + "docling-ibm-models which includes pre-trained weights and inference code for \n", + "TableFormer, a state-of-the-art table structure recognition model. This suggests that if\n", + "you have complex tables in your documents, using this specialized table recognition \n", + "model could be beneficial.\n", + "\n", + "Therefore, based on the tests with arXiv papers and IBM Redbooks, if you have limited \n", + "resources \u001b[1m(\u001b[0mlikely referring to computational power\u001b[1m)\u001b[0m and need to process documents \n", + "containing complex tables, it would be recommended to use the docling-parse PDF backend \n", + "along with the TableFormer AI model from docling-ibm-models. This combination should \n", + "provide a good balance of performance and table recognition capabilities for your \n", + "specific needs.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "QUERY = \"According to the tests with arXiv and IBM Redbooks, which backend should I use if I have limited resources and complex tables?\"\n", + "query_rrf = index_hybrid.as_query_engine(\n", + " vector_store_query_mode=VectorStoreQueryMode.HYBRID,\n", + " llm=GEN_MODEL,\n", + " similarity_top_k=3,\n", + ")\n", + "res = query_rrf.query(QUERY)\n", + "console.print(f\"πŸ‘€: {QUERY}\\nπŸ€–: {res.response.strip()}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
*** k=1 ***\n",
+       "In this section, we establish some reference numbers for the processing speed of Docling\n",
+       "and the resource budget it requires. All tests in this section are run with default \n",
+       "options on our standard test set distributed with Docling, which consists of three \n",
+       "papers from arXiv and two IBM Redbooks, with a total of 225 pages. Measurements were \n",
+       "taken using both available PDF backends on two different hardware systems: one MacBook \n",
+       "Pro M3 Max, and one bare-metal server running Ubuntu 20.04 LTS on an Intel Xeon E5-2690 \n",
+       "CPU. For reproducibility, we fixed the thread budget (through setting OMP NUM THREADS \n",
+       "environment variable ) once to 4 (Docling default) and once to 16 (equal to full core \n",
+       "count on the test hardware). All results are shown in Table 1.\n",
+       "
\n" + ], + "text/plain": [ + "*** \u001b[33mk\u001b[0m=\u001b[1;36m1\u001b[0m ***\n", + "In this section, we establish some reference numbers for the processing speed of Docling\n", + "and the resource budget it requires. All tests in this section are run with default \n", + "options on our standard test set distributed with Docling, which consists of three \n", + "papers from arXiv and two IBM Redbooks, with a total of \u001b[1;36m225\u001b[0m pages. Measurements were \n", + "taken using both available PDF backends on two different hardware systems: one MacBook \n", + "Pro M3 Max, and one bare-metal server running Ubuntu \u001b[1;36m20.04\u001b[0m LTS on an Intel Xeon E5-\u001b[1;36m2690\u001b[0m \n", + "CPU. For reproducibility, we fixed the thread budget \u001b[1m(\u001b[0mthrough setting OMP NUM THREADS \n", + "environment variable \u001b[1m)\u001b[0m once to \u001b[1;36m4\u001b[0m \u001b[1m(\u001b[0mDocling default\u001b[1m)\u001b[0m and once to \u001b[1;36m16\u001b[0m \u001b[1m(\u001b[0mequal to full core \n", + "count on the test hardware\u001b[1m)\u001b[0m. All results are shown in Table \u001b[1;36m1\u001b[0m.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, { "data": { "text/html": [ @@ -1004,20 +1295,26 @@ "data": { "text/html": [ "
*** k=3 ***\n",
-       "We encourage everyone to propose or implement additional features and models, and will \n",
-       "gladly take your inputs and contributions under review . The codebase of Docling is open\n",
-       "for use and contribution, under the MIT license agreement and in alignment with our \n",
-       "contributing guidelines included in the Docling repository. If you use Docling in your \n",
-       "projects, please consider citing this technical report.\n",
+       "As part of Docling, we initially release two highly capable AI models to the open-source\n",
+       "community, which have been developed and published recently by our team. The first model\n",
+       "is a layout analysis model, an accurate object-detector for page elements [13]. The \n",
+       "second model is TableFormer [12, 9], a state-of-the-art table structure recognition \n",
+       "model. We provide the pre-trained weights (hosted on huggingface) and a separate package\n",
+       "for the inference code as docling-ibm-models . Both models are also powering the \n",
+       "open-access deepsearch-experience, our cloud-native service for knowledge exploration \n",
+       "tasks.\n",
        "
\n" ], "text/plain": [ "*** \u001b[33mk\u001b[0m=\u001b[1;36m3\u001b[0m ***\n", - "We encourage everyone to propose or implement additional features and models, and will \n", - "gladly take your inputs and contributions under review . The codebase of Docling is open\n", - "for use and contribution, under the MIT license agreement and in alignment with our \n", - "contributing guidelines included in the Docling repository. If you use Docling in your \n", - "projects, please consider citing this technical report.\n" + "As part of Docling, we initially release two highly capable AI models to the open-source\n", + "community, which have been developed and published recently by our team. The first model\n", + "is a layout analysis model, an accurate object-detector for page elements \u001b[1m[\u001b[0m\u001b[1;36m13\u001b[0m\u001b[1m]\u001b[0m. The \n", + "second model is TableFormer \u001b[1m[\u001b[0m\u001b[1;36m12\u001b[0m, \u001b[1;36m9\u001b[0m\u001b[1m]\u001b[0m, a state-of-the-art table structure recognition \n", + "model. We provide the pre-trained weights \u001b[1m(\u001b[0mhosted on huggingface\u001b[1m)\u001b[0m and a separate package\n", + "for the inference code as docling-ibm-models . Both models are also powering the \n", + "open-access deepsearch-experience, our cloud-native service for knowledge exploration \n", + "tasks.\n" ] }, "metadata": {}, @@ -1025,15 +1322,105 @@ } ], "source": [ - "retriever_rrf = index_hybrid.as_retriever(\n", - " vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=3\n", - ")\n", "nodes = retriever_rrf.retrieve(QUERY)\n", "for idx, item in enumerate(nodes):\n", " console.print(\n", " f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n", " )" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Even though the top retrieved chunks are relevant for the question, the key information lays in the paragraph after the first chunk:\n", + "\n", + "> If you need to run Docling in very low-resource environments, please consider configuring the pypdfium backend. While it is faster and more memory efficient than the default docling-parse backend, it will come at the expense of worse quality results, especially in table structure recovery." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We next examine the fragments that immediately precede and follow the top‑retrieved chunk, so long as those neighbors remain within the same section, to preserve the semantic integrity of the context.\n", + "The generated answer is now accurate because it has been grounded in the necessary contextual information.\n", + "\n", + "πŸ’‘ In a production setting, it may be preferable to persist the parsed documents (i.e., `DoclingDocument` objects) as JSON in an object store or database and then fetch them when you need to traverse the document for context‑expansion scenarios. In this simplified example, however, we will query the OpenSearch index directly to obtain the required chunks." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
πŸ‘€: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n",
+       "have limited resources and complex tables?\n",
+       "πŸ€–: According to the tests described in the provided context, if you need to run Docling\n",
+       "in a very low-resource environment and are dealing with complex tables that require \n",
+       "high-quality table structure recovery, you should consider configuring the pypdfium \n",
+       "backend. The context mentions that while it is faster and more memory efficient than the\n",
+       "default docling-parse backend, it may come at the expense of worse quality results, \n",
+       "especially in table structure recovery. Therefore, for limited resources and complex \n",
+       "tables where quality is crucial, pypdfium would be a suitable choice despite its \n",
+       "potential drawbacks compared to the default backend.\n",
+       "
\n" + ], + "text/plain": [ + "πŸ‘€: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \n", + "have limited resources and complex tables?\n", + "πŸ€–: According to the tests described in the provided context, if you need to run Docling\n", + "in a very low-resource environment and are dealing with complex tables that require \n", + "high-quality table structure recovery, you should consider configuring the pypdfium \n", + "backend. The context mentions that while it is faster and more memory efficient than the\n", + "default docling-parse backend, it may come at the expense of worse quality results, \n", + "especially in table structure recovery. Therefore, for limited resources and complex \n", + "tables where quality is crucial, pypdfium would be a suitable choice despite its \n", + "potential drawbacks compared to the default backend.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "top_headings = nodes[0].metadata[\"headings\"]\n", + "top_text = nodes[0].text\n", + "\n", + "rdr = ElasticsearchReader(endpoint=OPENSEARCH_ENDPOINT, index=OPENSEARCH_INDEX)\n", + "docs = rdr.load_data(\n", + " field=text_field,\n", + " query={\n", + " \"query\": {\n", + " \"terms_set\": {\n", + " \"metadata.headings.keyword\": {\n", + " \"terms\": top_headings,\n", + " \"minimum_should_match_script\": {\"source\": \"params.num_terms\"},\n", + " }\n", + " }\n", + " }\n", + " },\n", + ")\n", + "ext_nodes = []\n", + "for idx, item in enumerate(docs):\n", + " if item.text == top_text:\n", + " ext_nodes.append(NodeWithScore(node=Node(text=item.text), score=1.0))\n", + " if idx > 0:\n", + " ext_nodes.append(\n", + " NodeWithScore(node=Node(text=docs[idx - 1].text), score=1.0)\n", + " )\n", + " if idx < len(docs) - 1:\n", + " ext_nodes.append(\n", + " NodeWithScore(node=Node(text=docs[idx + 1].text), score=1.0)\n", + " )\n", + " break\n", + "\n", + "synthesizer = get_response_synthesizer(llm=GEN_MODEL)\n", + "res = synthesizer.synthesize(query=QUERY, nodes=ext_nodes)\n", + "console.print(f\"πŸ‘€: {QUERY}\\nπŸ€–: {res.response.strip()}\")" + ] } ], "metadata": { diff --git a/docs/usage/advanced_options.md b/docs/usage/advanced_options.md index 92f33808..6059d64e 100644 --- a/docs/usage/advanced_options.md +++ b/docs/usage/advanced_options.md @@ -163,37 +163,3 @@ result = converter.convert(source) ## Limit resource usage You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads. - - -## Use specific backend converters - -!!! note - - This section discusses directly invoking a [backend](../concepts/architecture.md), - i.e. using a low-level API. This should only be done when necessary. For most cases, - using a `DocumentConverter` (high-level API) as discussed in the sections above - should sufficeΒ β€”Β and is the recommended way. - -By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](supported_formats.md)). -You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](../examples/run_with_formats.py) example. -Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages: - -```python -import urllib.request -from io import BytesIO -from docling.backend.html_backend import HTMLDocumentBackend -from docling.datamodel.base_models import InputFormat -from docling.datamodel.document import InputDocument - -url = "https://en.wikipedia.org/wiki/Duck" -text = urllib.request.urlopen(url).read() -in_doc = InputDocument( - path_or_stream=BytesIO(text), - format=InputFormat.HTML, - backend=HTMLDocumentBackend, - filename="duck.html", -) -backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text)) -dl_doc = backend.convert() -print(dl_doc.export_to_markdown()) -```