docs: update opensearch notebook and backend documentation (#2519)

* docs(opensearch): update the example notebook RAG with OpenSearch Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs(uspto): remove direct usage of the backend class for conversion Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs: remove direct usage of backends from documentation Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-12-08 12:48:28 +00:00 · 2025-10-27 10:02:50 +01:00
parent 10c1f06b74
commit 9a6fdf936b
3 changed files with 536 additions and 307 deletions
--- a/docs/examples/backend_xml_rag.ipynb
+++ b/docs/examples/backend_xml_rag.ipynb
@@ -431,130 +431,6 @@
    "print(f\"Fetched and exported {doc_num} documents.\")"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Using the backend converter (optional)\n",
-    "\n",
-    "- The custom backend converters `PubMedDocumentBackend` and `PatentUsptoDocumentBackend` aim at handling the parsing of PMC articles and USPTO patents, respectively.\n",
-    "- As any other backends, you can leverage the function `is_valid()` to check if the input document is supported by the this backend.\n",
-    "- Note that some XML sections in the original USPTO zip file may not represent patents, like sequence listings, and therefore they will show as invalid by the backend."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Document nihpp-2024.12.26.630351v1.nxml is a valid PMC article? True\n",
-      "Document ipg241217-1.xml is a valid patent? True\n"
-     ]
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "316241ca89a843bda3170f2a5c76c639",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "  0%|          | 0/4014 [00:00<?, ?it/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Found 3928 patents out of 4014 XML files.\n"
-     ]
-    }
-   ],
-   "source": [
-    "from tqdm.notebook import tqdm\n",
-    "\n",
-    "from docling.backend.xml.jats_backend import JatsDocumentBackend\n",
-    "from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend\n",
-    "from docling.datamodel.base_models import InputFormat\n",
-    "from docling.datamodel.document import InputDocument\n",
-    "\n",
-    "# check PMC\n",
-    "in_doc = InputDocument(\n",
-    "    path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\",\n",
-    "    format=InputFormat.XML_JATS,\n",
-    "    backend=JatsDocumentBackend,\n",
-    ")\n",
-    "backend = JatsDocumentBackend(\n",
-    "    in_doc=in_doc, path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"\n",
-    ")\n",
-    "print(f\"Document {in_doc.file.name} is a valid PMC article? {backend.is_valid()}\")\n",
-    "\n",
-    "# check USPTO\n",
-    "in_doc = InputDocument(\n",
-    "    path_or_stream=TEMP_DIR / \"ipg241217-1.xml\",\n",
-    "    format=InputFormat.XML_USPTO,\n",
-    "    backend=PatentUsptoDocumentBackend,\n",
-    ")\n",
-    "backend = PatentUsptoDocumentBackend(\n",
-    "    in_doc=in_doc, path_or_stream=TEMP_DIR / \"ipg241217-1.xml\"\n",
-    ")\n",
-    "print(f\"Document {in_doc.file.name} is a valid patent? {backend.is_valid()}\")\n",
-    "\n",
-    "patent_valid = 0\n",
-    "pbar = tqdm(TEMP_DIR.glob(\"*.xml\"), total=doc_num)\n",
-    "for in_path in pbar:\n",
-    "    in_doc = InputDocument(\n",
-    "        path_or_stream=in_path,\n",
-    "        format=InputFormat.XML_USPTO,\n",
-    "        backend=PatentUsptoDocumentBackend,\n",
-    "    )\n",
-    "    backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path)\n",
-    "    patent_valid += int(backend.is_valid())\n",
-    "\n",
-    "print(f\"Found {patent_valid} patents out of {doc_num} XML files.\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Calling the function `convert()` will convert the input document into a `DoclingDocument`"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Patent \"Semiconductor package\" has 19 claims\n"
-     ]
-    }
-   ],
-   "source": [
-    "doc = backend.convert()\n",
-    "\n",
-    "claims_sec = next(item for item in doc.texts if item.text == \"CLAIMS\")\n",
-    "print(f'Patent \"{doc.texts[0].text}\" has {len(claims_sec.children)} claims')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "✏️ **Tip**: in general, there is no need to use the backend converters to parse USPTO or JATS (PubMed) XML files. The generic `DocumentConverter` object tries to guess the input document format and applies the corresponding backend parser. The conversion shown in [Simple Conversion](#simple-conversion) is the recommended usage for the supported XML files."
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -923,7 +799,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.12.8"
+   "version": "3.12.10"
  }
 },
 "nbformat": 4,