mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 12:48:28 +00:00
docs: update opensearch notebook and backend documentation (#2519)
* docs(opensearch): update the example notebook RAG with OpenSearch Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs(uspto): remove direct usage of the backend class for conversion Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * docs: remove direct usage of backends from documentation Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
This commit is contained in:
committed by
GitHub
parent
10c1f06b74
commit
9a6fdf936b
126
docs/examples/backend_xml_rag.ipynb
vendored
126
docs/examples/backend_xml_rag.ipynb
vendored
@@ -431,130 +431,6 @@
|
||||
"print(f\"Fetched and exported {doc_num} documents.\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Using the backend converter (optional)\n",
|
||||
"\n",
|
||||
"- The custom backend converters `PubMedDocumentBackend` and `PatentUsptoDocumentBackend` aim at handling the parsing of PMC articles and USPTO patents, respectively.\n",
|
||||
"- As any other backends, you can leverage the function `is_valid()` to check if the input document is supported by the this backend.\n",
|
||||
"- Note that some XML sections in the original USPTO zip file may not represent patents, like sequence listings, and therefore they will show as invalid by the backend."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Document nihpp-2024.12.26.630351v1.nxml is a valid PMC article? True\n",
|
||||
"Document ipg241217-1.xml is a valid patent? True\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "316241ca89a843bda3170f2a5c76c639",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
" 0%| | 0/4014 [00:00<?, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Found 3928 patents out of 4014 XML files.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from tqdm.notebook import tqdm\n",
|
||||
"\n",
|
||||
"from docling.backend.xml.jats_backend import JatsDocumentBackend\n",
|
||||
"from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend\n",
|
||||
"from docling.datamodel.base_models import InputFormat\n",
|
||||
"from docling.datamodel.document import InputDocument\n",
|
||||
"\n",
|
||||
"# check PMC\n",
|
||||
"in_doc = InputDocument(\n",
|
||||
" path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\",\n",
|
||||
" format=InputFormat.XML_JATS,\n",
|
||||
" backend=JatsDocumentBackend,\n",
|
||||
")\n",
|
||||
"backend = JatsDocumentBackend(\n",
|
||||
" in_doc=in_doc, path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"\n",
|
||||
")\n",
|
||||
"print(f\"Document {in_doc.file.name} is a valid PMC article? {backend.is_valid()}\")\n",
|
||||
"\n",
|
||||
"# check USPTO\n",
|
||||
"in_doc = InputDocument(\n",
|
||||
" path_or_stream=TEMP_DIR / \"ipg241217-1.xml\",\n",
|
||||
" format=InputFormat.XML_USPTO,\n",
|
||||
" backend=PatentUsptoDocumentBackend,\n",
|
||||
")\n",
|
||||
"backend = PatentUsptoDocumentBackend(\n",
|
||||
" in_doc=in_doc, path_or_stream=TEMP_DIR / \"ipg241217-1.xml\"\n",
|
||||
")\n",
|
||||
"print(f\"Document {in_doc.file.name} is a valid patent? {backend.is_valid()}\")\n",
|
||||
"\n",
|
||||
"patent_valid = 0\n",
|
||||
"pbar = tqdm(TEMP_DIR.glob(\"*.xml\"), total=doc_num)\n",
|
||||
"for in_path in pbar:\n",
|
||||
" in_doc = InputDocument(\n",
|
||||
" path_or_stream=in_path,\n",
|
||||
" format=InputFormat.XML_USPTO,\n",
|
||||
" backend=PatentUsptoDocumentBackend,\n",
|
||||
" )\n",
|
||||
" backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path)\n",
|
||||
" patent_valid += int(backend.is_valid())\n",
|
||||
"\n",
|
||||
"print(f\"Found {patent_valid} patents out of {doc_num} XML files.\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Calling the function `convert()` will convert the input document into a `DoclingDocument`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Patent \"Semiconductor package\" has 19 claims\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"doc = backend.convert()\n",
|
||||
"\n",
|
||||
"claims_sec = next(item for item in doc.texts if item.text == \"CLAIMS\")\n",
|
||||
"print(f'Patent \"{doc.texts[0].text}\" has {len(claims_sec.children)} claims')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"✏️ **Tip**: in general, there is no need to use the backend converters to parse USPTO or JATS (PubMed) XML files. The generic `DocumentConverter` object tries to guess the input document format and applies the corresponding backend parser. The conversion shown in [Simple Conversion](#simple-conversion) is the recommended usage for the supported XML files."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -923,7 +799,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.12.8"
|
||||
"version": "3.12.10"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
Reference in New Issue
Block a user