docs: update opensearch notebook and backend documentation (#2519)

* docs(opensearch): update the example notebook RAG with OpenSearch

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs(uspto): remove direct usage of the backend class for conversion

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: remove direct usage of backends from documentation

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
This commit is contained in:
Cesar Berrospi Ramis
2025-10-27 10:02:50 +01:00
committed by GitHub
parent 10c1f06b74
commit 9a6fdf936b
3 changed files with 536 additions and 307 deletions

View File

@@ -431,130 +431,6 @@
"print(f\"Fetched and exported {doc_num} documents.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using the backend converter (optional)\n",
"\n",
"- The custom backend converters `PubMedDocumentBackend` and `PatentUsptoDocumentBackend` aim at handling the parsing of PMC articles and USPTO patents, respectively.\n",
"- As any other backends, you can leverage the function `is_valid()` to check if the input document is supported by the this backend.\n",
"- Note that some XML sections in the original USPTO zip file may not represent patents, like sequence listings, and therefore they will show as invalid by the backend."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Document nihpp-2024.12.26.630351v1.nxml is a valid PMC article? True\n",
"Document ipg241217-1.xml is a valid patent? True\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "316241ca89a843bda3170f2a5c76c639",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/4014 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found 3928 patents out of 4014 XML files.\n"
]
}
],
"source": [
"from tqdm.notebook import tqdm\n",
"\n",
"from docling.backend.xml.jats_backend import JatsDocumentBackend\n",
"from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend\n",
"from docling.datamodel.base_models import InputFormat\n",
"from docling.datamodel.document import InputDocument\n",
"\n",
"# check PMC\n",
"in_doc = InputDocument(\n",
" path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\",\n",
" format=InputFormat.XML_JATS,\n",
" backend=JatsDocumentBackend,\n",
")\n",
"backend = JatsDocumentBackend(\n",
" in_doc=in_doc, path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"\n",
")\n",
"print(f\"Document {in_doc.file.name} is a valid PMC article? {backend.is_valid()}\")\n",
"\n",
"# check USPTO\n",
"in_doc = InputDocument(\n",
" path_or_stream=TEMP_DIR / \"ipg241217-1.xml\",\n",
" format=InputFormat.XML_USPTO,\n",
" backend=PatentUsptoDocumentBackend,\n",
")\n",
"backend = PatentUsptoDocumentBackend(\n",
" in_doc=in_doc, path_or_stream=TEMP_DIR / \"ipg241217-1.xml\"\n",
")\n",
"print(f\"Document {in_doc.file.name} is a valid patent? {backend.is_valid()}\")\n",
"\n",
"patent_valid = 0\n",
"pbar = tqdm(TEMP_DIR.glob(\"*.xml\"), total=doc_num)\n",
"for in_path in pbar:\n",
" in_doc = InputDocument(\n",
" path_or_stream=in_path,\n",
" format=InputFormat.XML_USPTO,\n",
" backend=PatentUsptoDocumentBackend,\n",
" )\n",
" backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path)\n",
" patent_valid += int(backend.is_valid())\n",
"\n",
"print(f\"Found {patent_valid} patents out of {doc_num} XML files.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calling the function `convert()` will convert the input document into a `DoclingDocument`"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Patent \"Semiconductor package\" has 19 claims\n"
]
}
],
"source": [
"doc = backend.convert()\n",
"\n",
"claims_sec = next(item for item in doc.texts if item.text == \"CLAIMS\")\n",
"print(f'Patent \"{doc.texts[0].text}\" has {len(claims_sec.children)} claims')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"✏️ **Tip**: in general, there is no need to use the backend converters to parse USPTO or JATS (PubMed) XML files. The generic `DocumentConverter` object tries to guess the input document format and applies the corresponding backend parser. The conversion shown in [Simple Conversion](#simple-conversion) is the recommended usage for the supported XML files."
]
},
{
"cell_type": "markdown",
"metadata": {},
@@ -923,7 +799,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
"version": "3.12.10"
}
},
"nbformat": 4,