docs: add information extraction example (#2199)

* docs: add information exctraction example

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update README

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* minor typo

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update README

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
This commit is contained in:
Panos Vagenas
2025-09-05 11:27:09 +02:00
committed by GitHub
parent b3d7542061
commit a9f41b088e
5 changed files with 689 additions and 6 deletions

View File

@@ -29,17 +29,20 @@ Docling simplifies document processing, parsing diverse formats — including ad
## Features ## Features
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more * 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more * 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format * 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON * ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
* 🔒 Local execution capabilities for sensitive data and air-gapped environments * 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI * 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 Extensive OCR support for scanned PDFs and images * 🔍 Extensive OCR support for scanned PDFs and images
* 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview)) * 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
* 🎙️ Support for Audio with Automatic Speech Recognition (ASR) models * 🎙️ Audio support with Automatic Speech Recognition (ASR) models
* 💻 Simple and convenient CLI * 💻 Simple and convenient CLI
### What's new
* 📤 Structured [information extraction][extraction] \[🧪 beta\]
### Coming soon ### Coming soon
* 📝 Metadata extraction, including title, authors, references & language * 📝 Metadata extraction, including title, authors, references & language
@@ -150,3 +153,4 @@ The project was started by the AI for knowledge team at IBM Research Zurich.
[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/ [supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/ [docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
[integrations]: https://docling-project.github.io/docling/integrations/ [integrations]: https://docling-project.github.io/docling/integrations/
[extraction]: https://docling-project.github.io/docling/examples/extraction/

View File

@@ -5,7 +5,7 @@
"id": "3f312845", "id": "3f312845",
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 🛡️ Chunking and tokenizing HTML documents using Data Prep Kit and the Docling Transforms\n", "# Chunking & tokenization with Data Prep Kit\n",
"\n", "\n",
"This notebook demonstrates how to build a sequence of <a href=https://github.com/data-prep-kit/data-prep-kit> <b>DPK transforms</b> </a> for ingesting HTML documents using Docling2Parquet transforms and chunking them using Doc_Chunk transform. Both transforms are based on the <a href=https://docling-project.github.io/docling/> Docling library</a>. \n", "This notebook demonstrates how to build a sequence of <a href=https://github.com/data-prep-kit/data-prep-kit> <b>DPK transforms</b> </a> for ingesting HTML documents using Docling2Parquet transforms and chunking them using Doc_Chunk transform. Both transforms are based on the <a href=https://docling-project.github.io/docling/> Docling library</a>. \n",
"\n", "\n",

675
docs/examples/extraction.ipynb vendored Normal file
View File

@@ -0,0 +1,675 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "15674164",
"metadata": {},
"source": [
"# Information extraction"
]
},
{
"cell_type": "markdown",
"id": "8d796485",
"metadata": {},
"source": [
"> 👉 **NOTE**: The extraction API is currently <i>in beta</i> and may change without prior notice."
]
},
{
"cell_type": "markdown",
"id": "932f12cd",
"metadata": {},
"source": [
"Docling provides the capability of extracting information, i.e. structured data, from unstructured documents.\n",
"\n",
"The user can provide the desired data schema AKA *template*, either as a dictionary or as a Pydantic model, and Docling will return\n",
"the extracted data as a standardized output, organized by page.\n",
"\n",
"Check out the subsections below for different usage scenarios."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f97abf2e",
"metadata": {},
"outputs": [],
"source": [
"from IPython import display\n",
"from pydantic import BaseModel, Field\n",
"from rich import print"
]
},
{
"cell_type": "markdown",
"id": "cda07006",
"metadata": {},
"source": [
"In this notebook, we will work with an example input image — let's quickly inspect it:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "15846b44",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<img src='https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg' height='1000'>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"file_path = (\n",
" \"https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg\"\n",
")\n",
"display.HTML(f\"<img src='{file_path}' height='1000'>\")"
]
},
{
"cell_type": "markdown",
"id": "dbbc173c",
"metadata": {},
"source": [
"## Defining the extractor"
]
},
{
"cell_type": "markdown",
"id": "e9871c8d",
"metadata": {},
"source": [
"Let's first define our extractor:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7a8c6ff0",
"metadata": {},
"outputs": [],
"source": [
"from docling.datamodel.base_models import InputFormat\n",
"from docling.document_extractor import DocumentExtractor\n",
"\n",
"extractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])"
]
},
{
"cell_type": "markdown",
"id": "e2b1933e",
"metadata": {},
"source": [
"Following, we look at different ways to define the data template."
]
},
{
"cell_type": "markdown",
"id": "20e62dfd",
"metadata": {},
"source": [
"## Using a string template"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4c5119b0",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/pva/work/github.com/DS4SD/docling/docling/document_extractor.py:143: UserWarning: The extract API is currently experimental and may change without prior notice.\n",
"Only PDF and image formats are supported.\n",
" return next(all_res)\n",
"You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.\n",
"The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n"
]
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
" <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtractedPageData</span><span style=\"font-weight: bold\">(</span>\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">page_no</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">extracted_data</span>=<span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'bill_no'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'total'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span><span style=\"font-weight: bold\">}</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">raw_text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{\"bill_no\": \"3139\", \"total\": 3949.75}'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">errors</span>=<span style=\"font-weight: bold\">[]</span>\n",
" <span style=\"font-weight: bold\">)</span>\n",
"<span style=\"font-weight: bold\">]</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m[\u001b[0m\n",
" \u001b[1;35mExtractedPageData\u001b[0m\u001b[1m(\u001b[0m\n",
" \u001b[33mpage_no\u001b[0m=\u001b[1;36m1\u001b[0m,\n",
" \u001b[33mextracted_data\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'bill_no'\u001b[0m: \u001b[32m'3139'\u001b[0m, \u001b[32m'total'\u001b[0m: \u001b[1;36m3949.75\u001b[0m\u001b[1m}\u001b[0m,\n",
" \u001b[33mraw_text\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"bill_no\": \"3139\", \"total\": 3949.75\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
" \u001b[33merrors\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
" \u001b[1m)\u001b[0m\n",
"\u001b[1m]\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"result = extractor.extract(\n",
" source=file_path,\n",
" template='{\"bill_no\": \"string\", \"total\": \"float\"}',\n",
")\n",
"print(result.pages)"
]
},
{
"cell_type": "markdown",
"id": "0da85c9c",
"metadata": {},
"source": [
"## Using a dict template"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "e0df82f6",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n"
]
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
" <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtractedPageData</span><span style=\"font-weight: bold\">(</span>\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">page_no</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">extracted_data</span>=<span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'bill_no'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'total'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span><span style=\"font-weight: bold\">}</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">raw_text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{\"bill_no\": \"3139\", \"total\": 3949.75}'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">errors</span>=<span style=\"font-weight: bold\">[]</span>\n",
" <span style=\"font-weight: bold\">)</span>\n",
"<span style=\"font-weight: bold\">]</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m[\u001b[0m\n",
" \u001b[1;35mExtractedPageData\u001b[0m\u001b[1m(\u001b[0m\n",
" \u001b[33mpage_no\u001b[0m=\u001b[1;36m1\u001b[0m,\n",
" \u001b[33mextracted_data\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'bill_no'\u001b[0m: \u001b[32m'3139'\u001b[0m, \u001b[32m'total'\u001b[0m: \u001b[1;36m3949.75\u001b[0m\u001b[1m}\u001b[0m,\n",
" \u001b[33mraw_text\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"bill_no\": \"3139\", \"total\": 3949.75\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
" \u001b[33merrors\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
" \u001b[1m)\u001b[0m\n",
"\u001b[1m]\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"result = extractor.extract(\n",
" source=file_path,\n",
" template={\n",
" \"bill_no\": \"string\",\n",
" \"total\": \"float\",\n",
" },\n",
")\n",
"print(result.pages)"
]
},
{
"cell_type": "markdown",
"id": "925c1804",
"metadata": {},
"source": [
"## Using a Pydantic model template"
]
},
{
"cell_type": "markdown",
"id": "01aee19d",
"metadata": {},
"source": [
"First we define the Pydantic model we want to use"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "69facb7b",
"metadata": {},
"outputs": [],
"source": [
"from typing import Optional\n",
"\n",
"\n",
"class Invoice(BaseModel):\n",
" bill_no: str = Field(\n",
" examples=[\"A123\", \"5414\"]\n",
" ) # provide some examples, but no default value\n",
" total: float = Field(\n",
" default=10, examples=[20]\n",
" ) # provide some examples and a default value\n",
" tax_id: Optional[str] = Field(default=None, examples=[\"1234567890\"])"
]
},
{
"cell_type": "markdown",
"id": "fbcbce95",
"metadata": {},
"source": [
"The class itself can then be used directly as the template: "
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "81db63b1",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n"
]
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
" <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtractedPageData</span><span style=\"font-weight: bold\">(</span>\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">page_no</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">extracted_data</span>=<span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'bill_no'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'total'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'tax_id'</span>: <span style=\"color: #800080; text-decoration-color: #800080; font-style: italic\">None</span><span style=\"font-weight: bold\">}</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">raw_text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": null}'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">errors</span>=<span style=\"font-weight: bold\">[]</span>\n",
" <span style=\"font-weight: bold\">)</span>\n",
"<span style=\"font-weight: bold\">]</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m[\u001b[0m\n",
" \u001b[1;35mExtractedPageData\u001b[0m\u001b[1m(\u001b[0m\n",
" \u001b[33mpage_no\u001b[0m=\u001b[1;36m1\u001b[0m,\n",
" \u001b[33mextracted_data\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'bill_no'\u001b[0m: \u001b[32m'3139'\u001b[0m, \u001b[32m'total'\u001b[0m: \u001b[1;36m3949.75\u001b[0m, \u001b[32m'tax_id'\u001b[0m: \u001b[3;35mNone\u001b[0m\u001b[1m}\u001b[0m,\n",
" \u001b[33mraw_text\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": null\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
" \u001b[33merrors\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
" \u001b[1m)\u001b[0m\n",
"\u001b[1m]\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"result = extractor.extract(\n",
" source=file_path,\n",
" template=Invoice,\n",
")\n",
"print(result.pages)"
]
},
{
"cell_type": "markdown",
"id": "2bd8736b",
"metadata": {},
"source": [
"Alternatively, a Pydantic model instance can be passed as a template instead, allowing to override the default values.\n",
"\n",
"This can be very useful in scenarios where we happen to have available context that is more relevant than the\n",
"default values predefined in the model definition.\n",
"\n",
"E.g. in the example below:\n",
"- `bill_no` and `total` are actually set from the value extracted from the data,\n",
"- there was no `tax_id` to be extracted, so the updated default we provided was applied"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b531a20d",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n"
]
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
" <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtractedPageData</span><span style=\"font-weight: bold\">(</span>\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">page_no</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">extracted_data</span>=<span style=\"font-weight: bold\">{</span><span style=\"color: #008000; text-decoration-color: #008000\">'bill_no'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'total'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span>, <span style=\"color: #008000; text-decoration-color: #008000\">'tax_id'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'42'</span><span style=\"font-weight: bold\">}</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">raw_text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": \"42\"}'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">errors</span>=<span style=\"font-weight: bold\">[]</span>\n",
" <span style=\"font-weight: bold\">)</span>\n",
"<span style=\"font-weight: bold\">]</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m[\u001b[0m\n",
" \u001b[1;35mExtractedPageData\u001b[0m\u001b[1m(\u001b[0m\n",
" \u001b[33mpage_no\u001b[0m=\u001b[1;36m1\u001b[0m,\n",
" \u001b[33mextracted_data\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'bill_no'\u001b[0m: \u001b[32m'3139'\u001b[0m, \u001b[32m'total'\u001b[0m: \u001b[1;36m3949.75\u001b[0m, \u001b[32m'tax_id'\u001b[0m: \u001b[32m'42'\u001b[0m\u001b[1m}\u001b[0m,\n",
" \u001b[33mraw_text\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": \"42\"\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
" \u001b[33merrors\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
" \u001b[1m)\u001b[0m\n",
"\u001b[1m]\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"result = extractor.extract(\n",
" source=file_path,\n",
" template=Invoice(\n",
" bill_no=\"41\",\n",
" total=100,\n",
" tax_id=\"42\",\n",
" ),\n",
")\n",
"print(result.pages)"
]
},
{
"cell_type": "markdown",
"id": "dc38e143",
"metadata": {},
"source": [
"### Advanced Pydantic model"
]
},
{
"cell_type": "markdown",
"id": "5a1ee898",
"metadata": {},
"source": [
"Besides a flat template, we can in principle use any Pydantic model, thus leveraging reuse and being able to capture\n",
"hierarchies:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "dca8289a",
"metadata": {},
"outputs": [],
"source": [
"class Contact(BaseModel):\n",
" name: Optional[str] = Field(default=None, examples=[\"Smith\"])\n",
" address: str = Field(default=\"123 Main St\", examples=[\"456 Elm St\"])\n",
" postal_code: str = Field(default=\"12345\", examples=[\"67890\"])\n",
" city: str = Field(default=\"Anytown\", examples=[\"Othertown\"])\n",
" country: Optional[str] = Field(default=None, examples=[\"Canada\"])\n",
"\n",
"\n",
"class ExtendedInvoice(BaseModel):\n",
" bill_no: str = Field(\n",
" examples=[\"A123\", \"5414\"]\n",
" ) # provide some examples, but not the actual value of the test sample\n",
" total: float = Field(\n",
" default=10, examples=[20]\n",
" ) # provide a default value and some examples\n",
" garden_work_hours: int = Field(default=1, examples=[2])\n",
" sender: Contact = Field(default=Contact(), examples=[Contact()])\n",
" receiver: Contact = Field(default=Contact(), examples=[Contact()])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "5896662d",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n"
]
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[</span>\n",
" <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtractedPageData</span><span style=\"font-weight: bold\">(</span>\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">page_no</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">extracted_data</span>=<span style=\"font-weight: bold\">{</span>\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'bill_no'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'total'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'garden_work_hours'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">28</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'sender'</span>: <span style=\"font-weight: bold\">{</span>\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'name'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Robert Schneider'</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'address'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Rue du Lac 1268'</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'postal_code'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'2501'</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'city'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Biel'</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'country'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Switzerland'</span>\n",
" <span style=\"font-weight: bold\">}</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'receiver'</span>: <span style=\"font-weight: bold\">{</span>\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'name'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Pia Rutschmann'</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'address'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Marktgasse 28'</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'postal_code'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'9400'</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'city'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Rorschach'</span>,\n",
" <span style=\"color: #008000; text-decoration-color: #008000\">'country'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'Switzerland'</span>\n",
" <span style=\"font-weight: bold\">}</span>\n",
" <span style=\"font-weight: bold\">}</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">raw_text</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'{\"bill_no\": \"3139\", \"total\": 3949.75, \"garden_work_hours\": 28, \"sender\": {\"name\": \"Robert </span>\n",
"<span style=\"color: #008000; text-decoration-color: #008000\">Schneider\", \"address\": \"Rue du Lac 1268\", \"postal_code\": \"2501\", \"city\": \"Biel\", \"country\": \"Switzerland\"}, </span>\n",
"<span style=\"color: #008000; text-decoration-color: #008000\">\"receiver\": {\"name\": \"Pia Rutschmann\", \"address\": \"Marktgasse 28\", \"postal_code\": \"9400\", \"city\": \"Rorschach\", </span>\n",
"<span style=\"color: #008000; text-decoration-color: #008000\">\"country\": \"Switzerland\"}}'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">errors</span>=<span style=\"font-weight: bold\">[]</span>\n",
" <span style=\"font-weight: bold\">)</span>\n",
"<span style=\"font-weight: bold\">]</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m[\u001b[0m\n",
" \u001b[1;35mExtractedPageData\u001b[0m\u001b[1m(\u001b[0m\n",
" \u001b[33mpage_no\u001b[0m=\u001b[1;36m1\u001b[0m,\n",
" \u001b[33mextracted_data\u001b[0m=\u001b[1m{\u001b[0m\n",
" \u001b[32m'bill_no'\u001b[0m: \u001b[32m'3139'\u001b[0m,\n",
" \u001b[32m'total'\u001b[0m: \u001b[1;36m3949.75\u001b[0m,\n",
" \u001b[32m'garden_work_hours'\u001b[0m: \u001b[1;36m28\u001b[0m,\n",
" \u001b[32m'sender'\u001b[0m: \u001b[1m{\u001b[0m\n",
" \u001b[32m'name'\u001b[0m: \u001b[32m'Robert Schneider'\u001b[0m,\n",
" \u001b[32m'address'\u001b[0m: \u001b[32m'Rue du Lac 1268'\u001b[0m,\n",
" \u001b[32m'postal_code'\u001b[0m: \u001b[32m'2501'\u001b[0m,\n",
" \u001b[32m'city'\u001b[0m: \u001b[32m'Biel'\u001b[0m,\n",
" \u001b[32m'country'\u001b[0m: \u001b[32m'Switzerland'\u001b[0m\n",
" \u001b[1m}\u001b[0m,\n",
" \u001b[32m'receiver'\u001b[0m: \u001b[1m{\u001b[0m\n",
" \u001b[32m'name'\u001b[0m: \u001b[32m'Pia Rutschmann'\u001b[0m,\n",
" \u001b[32m'address'\u001b[0m: \u001b[32m'Marktgasse 28'\u001b[0m,\n",
" \u001b[32m'postal_code'\u001b[0m: \u001b[32m'9400'\u001b[0m,\n",
" \u001b[32m'city'\u001b[0m: \u001b[32m'Rorschach'\u001b[0m,\n",
" \u001b[32m'country'\u001b[0m: \u001b[32m'Switzerland'\u001b[0m\n",
" \u001b[1m}\u001b[0m\n",
" \u001b[1m}\u001b[0m,\n",
" \u001b[33mraw_text\u001b[0m=\u001b[32m'\u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"bill_no\": \"3139\", \"total\": 3949.75, \"garden_work_hours\": 28, \"sender\": \u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"name\": \"Robert \u001b[0m\n",
"\u001b[32mSchneider\", \"address\": \"Rue du Lac 1268\", \"postal_code\": \"2501\", \"city\": \"Biel\", \"country\": \"Switzerland\"\u001b[0m\u001b[32m}\u001b[0m\u001b[32m, \u001b[0m\n",
"\u001b[32m\"receiver\": \u001b[0m\u001b[32m{\u001b[0m\u001b[32m\"name\": \"Pia Rutschmann\", \"address\": \"Marktgasse 28\", \"postal_code\": \"9400\", \"city\": \"Rorschach\", \u001b[0m\n",
"\u001b[32m\"country\": \"Switzerland\"\u001b[0m\u001b[32m}\u001b[0m\u001b[32m}\u001b[0m\u001b[32m'\u001b[0m,\n",
" \u001b[33merrors\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
" \u001b[1m)\u001b[0m\n",
"\u001b[1m]\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"result = extractor.extract(\n",
" source=file_path,\n",
" template=ExtendedInvoice,\n",
")\n",
"print(result.pages)"
]
},
{
"cell_type": "markdown",
"id": "e873f65d",
"metadata": {},
"source": [
"### Validating and loading the extracted data"
]
},
{
"cell_type": "markdown",
"id": "080991f6",
"metadata": {},
"source": [
"The generated response data can be easily validated and loaded via Pydantic:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "a015bf60",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">ExtendedInvoice</span><span style=\"font-weight: bold\">(</span>\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">bill_no</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'3139'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">total</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3949.75</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">garden_work_hours</span>=<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">28</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">sender</span>=<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">Contact</span><span style=\"font-weight: bold\">(</span>\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">name</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Robert Schneider'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">address</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Rue du Lac 1268'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">postal_code</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'2501'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">city</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Biel'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">country</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Switzerland'</span>\n",
" <span style=\"font-weight: bold\">)</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">receiver</span>=<span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">Contact</span><span style=\"font-weight: bold\">(</span>\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">name</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Pia Rutschmann'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">address</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Marktgasse 28'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">postal_code</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'9400'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">city</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Rorschach'</span>,\n",
" <span style=\"color: #808000; text-decoration-color: #808000\">country</span>=<span style=\"color: #008000; text-decoration-color: #008000\">'Switzerland'</span>\n",
" <span style=\"font-weight: bold\">)</span>\n",
"<span style=\"font-weight: bold\">)</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1;35mExtendedInvoice\u001b[0m\u001b[1m(\u001b[0m\n",
" \u001b[33mbill_no\u001b[0m=\u001b[32m'3139'\u001b[0m,\n",
" \u001b[33mtotal\u001b[0m=\u001b[1;36m3949\u001b[0m\u001b[1;36m.75\u001b[0m,\n",
" \u001b[33mgarden_work_hours\u001b[0m=\u001b[1;36m28\u001b[0m,\n",
" \u001b[33msender\u001b[0m=\u001b[1;35mContact\u001b[0m\u001b[1m(\u001b[0m\n",
" \u001b[33mname\u001b[0m=\u001b[32m'Robert Schneider'\u001b[0m,\n",
" \u001b[33maddress\u001b[0m=\u001b[32m'Rue du Lac 1268'\u001b[0m,\n",
" \u001b[33mpostal_code\u001b[0m=\u001b[32m'2501'\u001b[0m,\n",
" \u001b[33mcity\u001b[0m=\u001b[32m'Biel'\u001b[0m,\n",
" \u001b[33mcountry\u001b[0m=\u001b[32m'Switzerland'\u001b[0m\n",
" \u001b[1m)\u001b[0m,\n",
" \u001b[33mreceiver\u001b[0m=\u001b[1;35mContact\u001b[0m\u001b[1m(\u001b[0m\n",
" \u001b[33mname\u001b[0m=\u001b[32m'Pia Rutschmann'\u001b[0m,\n",
" \u001b[33maddress\u001b[0m=\u001b[32m'Marktgasse 28'\u001b[0m,\n",
" \u001b[33mpostal_code\u001b[0m=\u001b[32m'9400'\u001b[0m,\n",
" \u001b[33mcity\u001b[0m=\u001b[32m'Rorschach'\u001b[0m,\n",
" \u001b[33mcountry\u001b[0m=\u001b[32m'Switzerland'\u001b[0m\n",
" \u001b[1m)\u001b[0m\n",
"\u001b[1m)\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"invoice = ExtendedInvoice.model_validate(result.pages[0].extracted_data)\n",
"print(invoice)"
]
},
{
"cell_type": "markdown",
"id": "ae593926",
"metadata": {},
"source": [
"This way, we can get from completely unstructured data to a very structured and developer-friendly representation:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "32844e40",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Invoice #<span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3139</span> was sent by Robert Schneider to Pia Rutschmann at Rue du Lac <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1268</span>.\n",
"</pre>\n"
],
"text/plain": [
"Invoice #\u001b[1;36m3139\u001b[0m was sent by Robert Schneider to Pia Rutschmann at Rue du Lac \u001b[1;36m1268\u001b[0m.\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\n",
" f\"Invoice #{invoice.bill_no} was sent by {invoice.sender.name} \"\n",
" f\"to {invoice.receiver.name} at {invoice.sender.address}.\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c1dbe41",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -4,6 +4,7 @@ Here some of our picks to get you started:
- 🔀 conversion examples ranging from [simple conversion to Markdown](./minimal.py) and export of [figures](./export_figures.py) & [tables](./export_tables.py), to [VLM](./minimal_vlm_pipeline.py) and [audio](./minimal_asr_pipeline.py) pipelines - 🔀 conversion examples ranging from [simple conversion to Markdown](./minimal.py) and export of [figures](./export_figures.py) & [tables](./export_tables.py), to [VLM](./minimal_vlm_pipeline.py) and [audio](./minimal_asr_pipeline.py) pipelines
- 💬 various RAG examples, e.g. based on [LangChain](./rag_langchain.ipynb), [LlamaIndex](./rag_llamaindex.ipynb), or [Haystack](./rag_haystack.ipynb), including [visual grounding](./visual_grounding.ipynb), and using different vector stores like [Milvus](./rag_milvus.ipynb), [Weaviate](./rag_weaviate.ipynb), or [Qdrant](./retrieval_qdrant.ipynb) - 💬 various RAG examples, e.g. based on [LangChain](./rag_langchain.ipynb), [LlamaIndex](./rag_llamaindex.ipynb), or [Haystack](./rag_haystack.ipynb), including [visual grounding](./visual_grounding.ipynb), and using different vector stores like [Milvus](./rag_milvus.ipynb), [Weaviate](./rag_weaviate.ipynb), or [Qdrant](./retrieval_qdrant.ipynb)
- 📤 [{==\[:fontawesome-solid-flask:{ title="beta feature" } beta\]==} structured data extraction](./extraction.ipynb)
- examples for ✍️ [serialization](./serialization.ipynb) and ✂️ [chunking](./hybrid_chunking.ipynb), including [user-defined customizations](./advanced_chunking_and_serialization.ipynb) - examples for ✍️ [serialization](./serialization.ipynb) and ✂️ [chunking](./hybrid_chunking.ipynb), including [user-defined customizations](./advanced_chunking_and_serialization.ipynb)
- 🖼️ [picture annotations](./pictures_description.ipynb) and [enrichments](./enrich_doclingdocument.py) - 🖼️ [picture annotations](./pictures_description.ipynb) and [enrichments](./enrich_doclingdocument.py)

View File

@@ -37,6 +37,7 @@ theme:
- content.tabs.link - content.tabs.link
- content.code.annotate - content.code.annotate
- content.code.copy - content.code.copy
- content.tooltips
- announce.dismiss - announce.dismiss
- navigation.footer - navigation.footer
- navigation.tabs - navigation.tabs
@@ -99,8 +100,8 @@ nav:
- examples/serialization.ipynb - examples/serialization.ipynb
- examples/hybrid_chunking.ipynb - examples/hybrid_chunking.ipynb
- examples/advanced_chunking_and_serialization.ipynb - examples/advanced_chunking_and_serialization.ipynb
- ✂️ Data Preparation and Embedding Pipeline: - 📤 Information extraction:
- examples/dpk-ingest-chunck-tokenize.ipynb - examples/extraction.ipynb
- 🤖 RAG with AI dev frameworks: - 🤖 RAG with AI dev frameworks:
- examples/rag_haystack.ipynb - examples/rag_haystack.ipynb
- examples/rag_langchain.ipynb - examples/rag_langchain.ipynb
@@ -114,6 +115,7 @@ nav:
- "Formula enrichment": examples/develop_formula_understanding.py - "Formula enrichment": examples/develop_formula_understanding.py
- "Enrich a DoclingDocument": examples/enrich_doclingdocument.py - "Enrich a DoclingDocument": examples/enrich_doclingdocument.py
- 🗂️ More examples: - 🗂️ More examples:
- examples/dpk-ingest-chunk-tokenize.ipynb
- examples/rag_milvus.ipynb - examples/rag_milvus.ipynb
- examples/rag_weaviate.ipynb - examples/rag_weaviate.ipynb
- RAG with Granite [↗]: https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Docling_RAG.ipynb - RAG with Granite [↗]: https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Docling_RAG.ipynb
@@ -155,6 +157,7 @@ nav:
- CLI reference: reference/cli.md - CLI reference: reference/cli.md
markdown_extensions: markdown_extensions:
- pymdownx.critic
- pymdownx.superfences - pymdownx.superfences
- pymdownx.tabbed: - pymdownx.tabbed:
alternate_style: true alternate_style: true