mirror of
https://github.com/DS4SD/docling.git
synced 2025-12-08 12:48:28 +00:00
docs: add information extraction example (#2199)
* docs: add information exctraction example Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update README Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * minor typo Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update README Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
This commit is contained in:
@@ -37,9 +37,12 @@ Docling simplifies document processing, parsing diverse formats — including ad
|
||||
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||
* 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
|
||||
* 🎙️ Support for Audio with Automatic Speech Recognition (ASR) models
|
||||
* 🎙️ Audio support with Automatic Speech Recognition (ASR) models
|
||||
* 💻 Simple and convenient CLI
|
||||
|
||||
### What's new
|
||||
* 📤 Structured [information extraction][extraction] \[🧪 beta\]
|
||||
|
||||
### Coming soon
|
||||
|
||||
* 📝 Metadata extraction, including title, authors, references & language
|
||||
@@ -150,3 +153,4 @@ The project was started by the AI for knowledge team at IBM Research Zurich.
|
||||
[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
|
||||
[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
|
||||
[integrations]: https://docling-project.github.io/docling/integrations/
|
||||
[extraction]: https://docling-project.github.io/docling/examples/extraction/
|
||||
|
||||
@@ -5,7 +5,7 @@
|
||||
"id": "3f312845",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 🛡️ Chunking and tokenizing HTML documents using Data Prep Kit and the Docling Transforms\n",
|
||||
"# Chunking & tokenization with Data Prep Kit\n",
|
||||
"\n",
|
||||
"This notebook demonstrates how to build a sequence of <a href=https://github.com/data-prep-kit/data-prep-kit> <b>DPK transforms</b> </a> for ingesting HTML documents using Docling2Parquet transforms and chunking them using Doc_Chunk transform. Both transforms are based on the <a href=https://docling-project.github.io/docling/> Docling library</a>. \n",
|
||||
"\n",
|
||||
675
docs/examples/extraction.ipynb
vendored
Normal file
675
docs/examples/extraction.ipynb
vendored
Normal file
@@ -0,0 +1,675 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "15674164",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Information extraction"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8d796485",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"> 👉 **NOTE**: The extraction API is currently <i>in beta</i> and may change without prior notice."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "932f12cd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Docling provides the capability of extracting information, i.e. structured data, from unstructured documents.\n",
|
||||
"\n",
|
||||
"The user can provide the desired data schema AKA *template*, either as a dictionary or as a Pydantic model, and Docling will return\n",
|
||||
"the extracted data as a standardized output, organized by page.\n",
|
||||
"\n",
|
||||
"Check out the subsections below for different usage scenarios."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "f97abf2e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from IPython import display\n",
|
||||
"from pydantic import BaseModel, Field\n",
|
||||
"from rich import print"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cda07006",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this notebook, we will work with an example input image — let's quickly inspect it:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "15846b44",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<img src='https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg' height='1000'>"
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"file_path = (\n",
|
||||
" \"https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg\"\n",
|
||||
")\n",
|
||||
"display.HTML(f\"<img src='{file_path}' height='1000'>\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dbbc173c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Defining the extractor"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e9871c8d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's first define our extractor:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "7a8c6ff0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from docling.datamodel.base_models import InputFormat\n",
|
||||
"from docling.document_extractor import DocumentExtractor\n",
|
||||
"\n",
|
||||
"extractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e2b1933e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Following, we look at different ways to define the data template."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "20e62dfd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using a string template"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "4c5119b0",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/Users/pva/work/github.com/DS4SD/docling/docling/document_extractor.py:143: UserWarning: The extract API is currently experimental and may change without prior notice.\n",
|
||||
"Only PDF and image formats are supported.\n",
|
||||
" return next(all_res)\n",
|
||||
"You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.\n",
|
||||
"The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
|
||||
" <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtractedPageData</span><span style=\"font-weight: bold\">(</span>\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">page_no</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">extracted_data</span>=<span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'bill_no'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'total'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span><span style=\"font-weight: bold\">}</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">raw_text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{\"bill_no\": \"3139\", \"total\": 3949.75}'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">errors</span>=<span style=\"font-weight: bold\">[]</span>\n",
|
||||
" <span style=\"font-weight: bold\">)</span>\n",
|
||||
"<span style=\"font-weight: bold\">]</span>\n",
|
||||
"</pre>\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"\u001b[1m[\u001b[0m\n",
|
||||
" \u001b[1;35mExtractedPageData\u001b[0m\u001b[1m(\u001b[0m\n",
|
||||
" \u001b[33mpage_no\u001b[0m=\u001b[1;36m1\u001b[0m,\n",
|
||||
" \u001b[33mextracted_data\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'bill_no'\u001b[0m: \u001b[32m'3139'\u001b[0m, \u001b[32m'total'\u001b[0m: \u001b[1;36m3949.75\u001b[0m\u001b[1m}\u001b[0m,\n",
|
||||
" \u001b[33mraw_text\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"bill_no\": \"3139\", \"total\": 3949.75\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
|
||||
" \u001b[33merrors\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
|
||||
" \u001b[1m)\u001b[0m\n",
|
||||
"\u001b[1m]\u001b[0m\n"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result = extractor.extract(\n",
|
||||
" source=file_path,\n",
|
||||
" template='{\"bill_no\": \"string\", \"total\": \"float\"}',\n",
|
||||
")\n",
|
||||
"print(result.pages)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0da85c9c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using a dict template"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "e0df82f6",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
|
||||
" <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtractedPageData</span><span style=\"font-weight: bold\">(</span>\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">page_no</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">extracted_data</span>=<span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'bill_no'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'total'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span><span style=\"font-weight: bold\">}</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">raw_text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{\"bill_no\": \"3139\", \"total\": 3949.75}'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">errors</span>=<span style=\"font-weight: bold\">[]</span>\n",
|
||||
" <span style=\"font-weight: bold\">)</span>\n",
|
||||
"<span style=\"font-weight: bold\">]</span>\n",
|
||||
"</pre>\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"\u001b[1m[\u001b[0m\n",
|
||||
" \u001b[1;35mExtractedPageData\u001b[0m\u001b[1m(\u001b[0m\n",
|
||||
" \u001b[33mpage_no\u001b[0m=\u001b[1;36m1\u001b[0m,\n",
|
||||
" \u001b[33mextracted_data\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'bill_no'\u001b[0m: \u001b[32m'3139'\u001b[0m, \u001b[32m'total'\u001b[0m: \u001b[1;36m3949.75\u001b[0m\u001b[1m}\u001b[0m,\n",
|
||||
" \u001b[33mraw_text\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"bill_no\": \"3139\", \"total\": 3949.75\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
|
||||
" \u001b[33merrors\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
|
||||
" \u001b[1m)\u001b[0m\n",
|
||||
"\u001b[1m]\u001b[0m\n"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result = extractor.extract(\n",
|
||||
" source=file_path,\n",
|
||||
" template={\n",
|
||||
" \"bill_no\": \"string\",\n",
|
||||
" \"total\": \"float\",\n",
|
||||
" },\n",
|
||||
")\n",
|
||||
"print(result.pages)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "925c1804",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Using a Pydantic model template"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "01aee19d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First we define the Pydantic model we want to use"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"id": "69facb7b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from typing import Optional\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"class Invoice(BaseModel):\n",
|
||||
" bill_no: str = Field(\n",
|
||||
" examples=[\"A123\", \"5414\"]\n",
|
||||
" ) # provide some examples, but no default value\n",
|
||||
" total: float = Field(\n",
|
||||
" default=10, examples=[20]\n",
|
||||
" ) # provide some examples and a default value\n",
|
||||
" tax_id: Optional[str] = Field(default=None, examples=[\"1234567890\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fbcbce95",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The class itself can then be used directly as the template: "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "81db63b1",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
|
||||
" <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtractedPageData</span><span style=\"font-weight: bold\">(</span>\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">page_no</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">extracted_data</span>=<span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'bill_no'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'total'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'tax_id'</span>: <span style=\"color: #800080; text-decoration-color: #800080; font-style: italic\">None</span><span style=\"font-weight: bold\">}</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">raw_text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": null}'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">errors</span>=<span style=\"font-weight: bold\">[]</span>\n",
|
||||
" <span style=\"font-weight: bold\">)</span>\n",
|
||||
"<span style=\"font-weight: bold\">]</span>\n",
|
||||
"</pre>\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"\u001b[1m[\u001b[0m\n",
|
||||
" \u001b[1;35mExtractedPageData\u001b[0m\u001b[1m(\u001b[0m\n",
|
||||
" \u001b[33mpage_no\u001b[0m=\u001b[1;36m1\u001b[0m,\n",
|
||||
" \u001b[33mextracted_data\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'bill_no'\u001b[0m: \u001b[32m'3139'\u001b[0m, \u001b[32m'total'\u001b[0m: \u001b[1;36m3949.75\u001b[0m, \u001b[32m'tax_id'\u001b[0m: \u001b[3;35mNone\u001b[0m\u001b[1m}\u001b[0m,\n",
|
||||
" \u001b[33mraw_text\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": null\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
|
||||
" \u001b[33merrors\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
|
||||
" \u001b[1m)\u001b[0m\n",
|
||||
"\u001b[1m]\u001b[0m\n"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result = extractor.extract(\n",
|
||||
" source=file_path,\n",
|
||||
" template=Invoice,\n",
|
||||
")\n",
|
||||
"print(result.pages)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2bd8736b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Alternatively, a Pydantic model instance can be passed as a template instead, allowing to override the default values.\n",
|
||||
"\n",
|
||||
"This can be very useful in scenarios where we happen to have available context that is more relevant than the\n",
|
||||
"default values predefined in the model definition.\n",
|
||||
"\n",
|
||||
"E.g. in the example below:\n",
|
||||
"- `bill_no` and `total` are actually set from the value extracted from the data,\n",
|
||||
"- there was no `tax_id` to be extracted, so the updated default we provided was applied"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"id": "b531a20d",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
|
||||
" <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtractedPageData</span><span style=\"font-weight: bold\">(</span>\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">page_no</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">extracted_data</span>=<span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'bill_no'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'total'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'tax_id'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'42'</span><span style=\"font-weight: bold\">}</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">raw_text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": \"42\"}'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">errors</span>=<span style=\"font-weight: bold\">[]</span>\n",
|
||||
" <span style=\"font-weight: bold\">)</span>\n",
|
||||
"<span style=\"font-weight: bold\">]</span>\n",
|
||||
"</pre>\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"\u001b[1m[\u001b[0m\n",
|
||||
" \u001b[1;35mExtractedPageData\u001b[0m\u001b[1m(\u001b[0m\n",
|
||||
" \u001b[33mpage_no\u001b[0m=\u001b[1;36m1\u001b[0m,\n",
|
||||
" \u001b[33mextracted_data\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'bill_no'\u001b[0m: \u001b[32m'3139'\u001b[0m, \u001b[32m'total'\u001b[0m: \u001b[1;36m3949.75\u001b[0m, \u001b[32m'tax_id'\u001b[0m: \u001b[32m'42'\u001b[0m\u001b[1m}\u001b[0m,\n",
|
||||
" \u001b[33mraw_text\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": \"42\"\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
|
||||
" \u001b[33merrors\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
|
||||
" \u001b[1m)\u001b[0m\n",
|
||||
"\u001b[1m]\u001b[0m\n"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result = extractor.extract(\n",
|
||||
" source=file_path,\n",
|
||||
" template=Invoice(\n",
|
||||
" bill_no=\"41\",\n",
|
||||
" total=100,\n",
|
||||
" tax_id=\"42\",\n",
|
||||
" ),\n",
|
||||
")\n",
|
||||
"print(result.pages)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dc38e143",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Advanced Pydantic model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5a1ee898",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Besides a flat template, we can in principle use any Pydantic model, thus leveraging reuse and being able to capture\n",
|
||||
"hierarchies:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "dca8289a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"class Contact(BaseModel):\n",
|
||||
" name: Optional[str] = Field(default=None, examples=[\"Smith\"])\n",
|
||||
" address: str = Field(default=\"123 Main St\", examples=[\"456 Elm St\"])\n",
|
||||
" postal_code: str = Field(default=\"12345\", examples=[\"67890\"])\n",
|
||||
" city: str = Field(default=\"Anytown\", examples=[\"Othertown\"])\n",
|
||||
" country: Optional[str] = Field(default=None, examples=[\"Canada\"])\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"class ExtendedInvoice(BaseModel):\n",
|
||||
" bill_no: str = Field(\n",
|
||||
" examples=[\"A123\", \"5414\"]\n",
|
||||
" ) # provide some examples, but not the actual value of the test sample\n",
|
||||
" total: float = Field(\n",
|
||||
" default=10, examples=[20]\n",
|
||||
" ) # provide a default value and some examples\n",
|
||||
" garden_work_hours: int = Field(default=1, examples=[2])\n",
|
||||
" sender: Contact = Field(default=Contact(), examples=[Contact()])\n",
|
||||
" receiver: Contact = Field(default=Contact(), examples=[Contact()])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"id": "5896662d",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
|
||||
" <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtractedPageData</span><span style=\"font-weight: bold\">(</span>\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">page_no</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">extracted_data</span>=<span style=\"font-weight: bold\">{</span>\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'bill_no'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'total'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'garden_work_hours'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">28</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'sender'</span>: <span style=\"font-weight: bold\">{</span>\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'name'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Robert Schneider'</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'address'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Rue du Lac 1268'</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'postal_code'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'2501'</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'city'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Biel'</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'country'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Switzerland'</span>\n",
|
||||
" <span style=\"font-weight: bold\">}</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'receiver'</span>: <span style=\"font-weight: bold\">{</span>\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'name'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Pia Rutschmann'</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'address'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Marktgasse 28'</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'postal_code'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'9400'</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'city'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Rorschach'</span>,\n",
|
||||
" <span style=\"color: #008000; text-decoration-color: #008000\">'country'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Switzerland'</span>\n",
|
||||
" <span style=\"font-weight: bold\">}</span>\n",
|
||||
" <span style=\"font-weight: bold\">}</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">raw_text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{\"bill_no\": \"3139\", \"total\": 3949.75, \"garden_work_hours\": 28, \"sender\": {\"name\": \"Robert </span>\n",
|
||||
"<span style=\"color: #008000; text-decoration-color: #008000\">Schneider\", \"address\": \"Rue du Lac 1268\", \"postal_code\": \"2501\", \"city\": \"Biel\", \"country\": \"Switzerland\"}, </span>\n",
|
||||
"<span style=\"color: #008000; text-decoration-color: #008000\">\"receiver\": {\"name\": \"Pia Rutschmann\", \"address\": \"Marktgasse 28\", \"postal_code\": \"9400\", \"city\": \"Rorschach\", </span>\n",
|
||||
"<span style=\"color: #008000; text-decoration-color: #008000\">\"country\": \"Switzerland\"}}'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">errors</span>=<span style=\"font-weight: bold\">[]</span>\n",
|
||||
" <span style=\"font-weight: bold\">)</span>\n",
|
||||
"<span style=\"font-weight: bold\">]</span>\n",
|
||||
"</pre>\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"\u001b[1m[\u001b[0m\n",
|
||||
" \u001b[1;35mExtractedPageData\u001b[0m\u001b[1m(\u001b[0m\n",
|
||||
" \u001b[33mpage_no\u001b[0m=\u001b[1;36m1\u001b[0m,\n",
|
||||
" \u001b[33mextracted_data\u001b[0m=\u001b[1m{\u001b[0m\n",
|
||||
" \u001b[32m'bill_no'\u001b[0m: \u001b[32m'3139'\u001b[0m,\n",
|
||||
" \u001b[32m'total'\u001b[0m: \u001b[1;36m3949.75\u001b[0m,\n",
|
||||
" \u001b[32m'garden_work_hours'\u001b[0m: \u001b[1;36m28\u001b[0m,\n",
|
||||
" \u001b[32m'sender'\u001b[0m: \u001b[1m{\u001b[0m\n",
|
||||
" \u001b[32m'name'\u001b[0m: \u001b[32m'Robert Schneider'\u001b[0m,\n",
|
||||
" \u001b[32m'address'\u001b[0m: \u001b[32m'Rue du Lac 1268'\u001b[0m,\n",
|
||||
" \u001b[32m'postal_code'\u001b[0m: \u001b[32m'2501'\u001b[0m,\n",
|
||||
" \u001b[32m'city'\u001b[0m: \u001b[32m'Biel'\u001b[0m,\n",
|
||||
" \u001b[32m'country'\u001b[0m: \u001b[32m'Switzerland'\u001b[0m\n",
|
||||
" \u001b[1m}\u001b[0m,\n",
|
||||
" \u001b[32m'receiver'\u001b[0m: \u001b[1m{\u001b[0m\n",
|
||||
" \u001b[32m'name'\u001b[0m: \u001b[32m'Pia Rutschmann'\u001b[0m,\n",
|
||||
" \u001b[32m'address'\u001b[0m: \u001b[32m'Marktgasse 28'\u001b[0m,\n",
|
||||
" \u001b[32m'postal_code'\u001b[0m: \u001b[32m'9400'\u001b[0m,\n",
|
||||
" \u001b[32m'city'\u001b[0m: \u001b[32m'Rorschach'\u001b[0m,\n",
|
||||
" \u001b[32m'country'\u001b[0m: \u001b[32m'Switzerland'\u001b[0m\n",
|
||||
" \u001b[1m}\u001b[0m\n",
|
||||
" \u001b[1m}\u001b[0m,\n",
|
||||
" \u001b[33mraw_text\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"bill_no\": \"3139\", \"total\": 3949.75, \"garden_work_hours\": 28, \"sender\": \u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"name\": \"Robert \u001b[0m\n",
|
||||
"\u001b[32mSchneider\", \"address\": \"Rue du Lac 1268\", \"postal_code\": \"2501\", \"city\": \"Biel\", \"country\": \"Switzerland\"\u001b[0m\u001b[32m}\u001b[0m\u001b[32m, \u001b[0m\n",
|
||||
"\u001b[32m\"receiver\": \u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"name\": \"Pia Rutschmann\", \"address\": \"Marktgasse 28\", \"postal_code\": \"9400\", \"city\": \"Rorschach\", \u001b[0m\n",
|
||||
"\u001b[32m\"country\": \"Switzerland\"\u001b[0m\u001b[32m}\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
|
||||
" \u001b[33merrors\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
|
||||
" \u001b[1m)\u001b[0m\n",
|
||||
"\u001b[1m]\u001b[0m\n"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"result = extractor.extract(\n",
|
||||
" source=file_path,\n",
|
||||
" template=ExtendedInvoice,\n",
|
||||
")\n",
|
||||
"print(result.pages)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e873f65d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Validating and loading the extracted data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "080991f6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The generated response data can be easily validated and loaded via Pydantic:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "a015bf60",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtendedInvoice</span><span style=\"font-weight: bold\">(</span>\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">bill_no</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">total</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">garden_work_hours</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">28</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">sender</span>=<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">Contact</span><span style=\"font-weight: bold\">(</span>\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">name</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Robert Schneider'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">address</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Rue du Lac 1268'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">postal_code</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'2501'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">city</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Biel'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">country</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Switzerland'</span>\n",
|
||||
" <span style=\"font-weight: bold\">)</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">receiver</span>=<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">Contact</span><span style=\"font-weight: bold\">(</span>\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">name</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Pia Rutschmann'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">address</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Marktgasse 28'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">postal_code</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'9400'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">city</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Rorschach'</span>,\n",
|
||||
" <span style=\"color: #808000; text-decoration-color: #808000\">country</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Switzerland'</span>\n",
|
||||
" <span style=\"font-weight: bold\">)</span>\n",
|
||||
"<span style=\"font-weight: bold\">)</span>\n",
|
||||
"</pre>\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"\u001b[1;35mExtendedInvoice\u001b[0m\u001b[1m(\u001b[0m\n",
|
||||
" \u001b[33mbill_no\u001b[0m=\u001b[32m'3139'\u001b[0m,\n",
|
||||
" \u001b[33mtotal\u001b[0m=\u001b[1;36m3949\u001b[0m\u001b[1;36m.75\u001b[0m,\n",
|
||||
" \u001b[33mgarden_work_hours\u001b[0m=\u001b[1;36m28\u001b[0m,\n",
|
||||
" \u001b[33msender\u001b[0m=\u001b[1;35mContact\u001b[0m\u001b[1m(\u001b[0m\n",
|
||||
" \u001b[33mname\u001b[0m=\u001b[32m'Robert Schneider'\u001b[0m,\n",
|
||||
" \u001b[33maddress\u001b[0m=\u001b[32m'Rue du Lac 1268'\u001b[0m,\n",
|
||||
" \u001b[33mpostal_code\u001b[0m=\u001b[32m'2501'\u001b[0m,\n",
|
||||
" \u001b[33mcity\u001b[0m=\u001b[32m'Biel'\u001b[0m,\n",
|
||||
" \u001b[33mcountry\u001b[0m=\u001b[32m'Switzerland'\u001b[0m\n",
|
||||
" \u001b[1m)\u001b[0m,\n",
|
||||
" \u001b[33mreceiver\u001b[0m=\u001b[1;35mContact\u001b[0m\u001b[1m(\u001b[0m\n",
|
||||
" \u001b[33mname\u001b[0m=\u001b[32m'Pia Rutschmann'\u001b[0m,\n",
|
||||
" \u001b[33maddress\u001b[0m=\u001b[32m'Marktgasse 28'\u001b[0m,\n",
|
||||
" \u001b[33mpostal_code\u001b[0m=\u001b[32m'9400'\u001b[0m,\n",
|
||||
" \u001b[33mcity\u001b[0m=\u001b[32m'Rorschach'\u001b[0m,\n",
|
||||
" \u001b[33mcountry\u001b[0m=\u001b[32m'Switzerland'\u001b[0m\n",
|
||||
" \u001b[1m)\u001b[0m\n",
|
||||
"\u001b[1m)\u001b[0m\n"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"invoice = ExtendedInvoice.model_validate(result.pages[0].extracted_data)\n",
|
||||
"print(invoice)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ae593926",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This way, we can get from completely unstructured data to a very structured and developer-friendly representation:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "32844e40",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Invoice #<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3139</span> was sent by Robert Schneider to Pia Rutschmann at Rue du Lac <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1268</span>.\n",
|
||||
"</pre>\n"
|
||||
],
|
||||
"text/plain": [
|
||||
"Invoice #\u001b[1;36m3139\u001b[0m was sent by Robert Schneider to Pia Rutschmann at Rue du Lac \u001b[1;36m1268\u001b[0m.\n"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(\n",
|
||||
" f\"Invoice #{invoice.bill_no} was sent by {invoice.sender.name} \"\n",
|
||||
" f\"to {invoice.receiver.name} at {invoice.sender.address}.\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6c1dbe41",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.12.11"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
1
docs/examples/index.md
vendored
1
docs/examples/index.md
vendored
@@ -4,6 +4,7 @@ Here some of our picks to get you started:
|
||||
|
||||
- 🔀 conversion examples ranging from [simple conversion to Markdown](./minimal.py) and export of [figures](./export_figures.py) & [tables](./export_tables.py), to [VLM](./minimal_vlm_pipeline.py) and [audio](./minimal_asr_pipeline.py) pipelines
|
||||
- 💬 various RAG examples, e.g. based on [LangChain](./rag_langchain.ipynb), [LlamaIndex](./rag_llamaindex.ipynb), or [Haystack](./rag_haystack.ipynb), including [visual grounding](./visual_grounding.ipynb), and using different vector stores like [Milvus](./rag_milvus.ipynb), [Weaviate](./rag_weaviate.ipynb), or [Qdrant](./retrieval_qdrant.ipynb)
|
||||
- 📤 [{==\[:fontawesome-solid-flask:{ title="beta feature" } beta\]==} structured data extraction](./extraction.ipynb)
|
||||
- examples for ✍️ [serialization](./serialization.ipynb) and ✂️ [chunking](./hybrid_chunking.ipynb), including [user-defined customizations](./advanced_chunking_and_serialization.ipynb)
|
||||
- 🖼️ [picture annotations](./pictures_description.ipynb) and [enrichments](./enrich_doclingdocument.py)
|
||||
|
||||
|
||||
@@ -37,6 +37,7 @@ theme:
|
||||
- content.tabs.link
|
||||
- content.code.annotate
|
||||
- content.code.copy
|
||||
- content.tooltips
|
||||
- announce.dismiss
|
||||
- navigation.footer
|
||||
- navigation.tabs
|
||||
@@ -99,8 +100,8 @@ nav:
|
||||
- examples/serialization.ipynb
|
||||
- examples/hybrid_chunking.ipynb
|
||||
- examples/advanced_chunking_and_serialization.ipynb
|
||||
- ✂️ Data Preparation and Embedding Pipeline:
|
||||
- examples/dpk-ingest-chunck-tokenize.ipynb
|
||||
- 📤 Information extraction:
|
||||
- examples/extraction.ipynb
|
||||
- 🤖 RAG with AI dev frameworks:
|
||||
- examples/rag_haystack.ipynb
|
||||
- examples/rag_langchain.ipynb
|
||||
@@ -114,6 +115,7 @@ nav:
|
||||
- "Formula enrichment": examples/develop_formula_understanding.py
|
||||
- "Enrich a DoclingDocument": examples/enrich_doclingdocument.py
|
||||
- 🗂️ More examples:
|
||||
- examples/dpk-ingest-chunk-tokenize.ipynb
|
||||
- examples/rag_milvus.ipynb
|
||||
- examples/rag_weaviate.ipynb
|
||||
- RAG with Granite [↗]: https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Docling_RAG.ipynb
|
||||
@@ -155,6 +157,7 @@ nav:
|
||||
- CLI reference: reference/cli.md
|
||||
|
||||
markdown_extensions:
|
||||
- pymdownx.critic
|
||||
- pymdownx.superfences
|
||||
- pymdownx.tabbed:
|
||||
alternate_style: true
|
||||
|
||||
Reference in New Issue
Block a user