mirror of
https://github.com/DS4SD/docling.git
synced 2025-08-02 15:32:30 +00:00
Assign content_layer for page_headers and page_footers
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
This commit is contained in:
commit
6fa5bfd115
2
.github/actions/setup-poetry/action.yml
vendored
2
.github/actions/setup-poetry/action.yml
vendored
@ -8,7 +8,7 @@ runs:
|
|||||||
using: 'composite'
|
using: 'composite'
|
||||||
steps:
|
steps:
|
||||||
- name: Install poetry
|
- name: Install poetry
|
||||||
run: pipx install poetry==1.8.3
|
run: pipx install poetry==1.8.5
|
||||||
shell: bash
|
shell: bash
|
||||||
- uses: actions/setup-python@v5
|
- uses: actions/setup-python@v5
|
||||||
with:
|
with:
|
||||||
|
4
.github/workflows/checks.yml
vendored
4
.github/workflows/checks.yml
vendored
@ -6,11 +6,11 @@ jobs:
|
|||||||
runs-on: ubuntu-latest
|
runs-on: ubuntu-latest
|
||||||
strategy:
|
strategy:
|
||||||
matrix:
|
matrix:
|
||||||
python-version: ['3.9', '3.10', '3.11', '3.12']
|
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
|
||||||
steps:
|
steps:
|
||||||
- uses: actions/checkout@v4
|
- uses: actions/checkout@v4
|
||||||
- name: Install tesseract
|
- name: Install tesseract
|
||||||
run: sudo apt-get update && sudo apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa libleptonica-dev libtesseract-dev pkg-config
|
run: sudo apt-get update && sudo apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa tesseract-ocr-script-latn libleptonica-dev libtesseract-dev pkg-config
|
||||||
- name: Set TESSDATA_PREFIX
|
- name: Set TESSDATA_PREFIX
|
||||||
run: |
|
run: |
|
||||||
echo "TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)" >> "$GITHUB_ENV"
|
echo "TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)" >> "$GITHUB_ENV"
|
||||||
|
5
.github/workflows/docs.yml
vendored
5
.github/workflows/docs.yml
vendored
@ -14,7 +14,10 @@ jobs:
|
|||||||
- uses: ./.github/actions/setup-poetry
|
- uses: ./.github/actions/setup-poetry
|
||||||
- name: Build docs
|
- name: Build docs
|
||||||
run: poetry run mkdocs build --verbose --clean
|
run: poetry run mkdocs build --verbose --clean
|
||||||
|
- name: Make docs LLM ready
|
||||||
|
if: inputs.deploy
|
||||||
|
uses: demodrive-ai/llms-txt-action@ad720693843126e6a73910a667d0eba37c1dea4b
|
||||||
- name: Build and push docs
|
- name: Build and push docs
|
||||||
if: inputs.deploy
|
if: inputs.deploy
|
||||||
run: poetry run mkdocs gh-deploy --force
|
run: poetry run mkdocs gh-deploy --force --dirty
|
||||||
|
|
46
CHANGELOG.md
46
CHANGELOG.md
@ -1,3 +1,49 @@
|
|||||||
|
## [v2.17.0](https://github.com/DS4SD/docling/releases/tag/v2.17.0) - 2025-01-28
|
||||||
|
|
||||||
|
### Feature
|
||||||
|
|
||||||
|
* **CLI:** Expose code and formula models in the CLI ([#820](https://github.com/DS4SD/docling/issues/820)) ([`6882e6c`](https://github.com/DS4SD/docling/commit/6882e6c38df30e4d4a1b83e01b13900ca7ea001f))
|
||||||
|
* Add platform info to CLI version printout ([#816](https://github.com/DS4SD/docling/issues/816)) ([`95b293a`](https://github.com/DS4SD/docling/commit/95b293a72356f94c7076e3649be970c8a51121a3))
|
||||||
|
* **ocr:** Expose `rec_keys_path` in RapidOcrOptions to support custom dictionaries ([#786](https://github.com/DS4SD/docling/issues/786)) ([`5332755`](https://github.com/DS4SD/docling/commit/53327552e83ced079ae50d8067ba7a8ce80cd9ad))
|
||||||
|
* Introduce automatic language detection in TesseractOcrCliModel ([#800](https://github.com/DS4SD/docling/issues/800)) ([`3be2fb5`](https://github.com/DS4SD/docling/commit/3be2fb581fe5a2ebd5cec9c86bb22eb1dec6fd0f))
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
|
||||||
|
* Fix single newline handling in MD backend ([#824](https://github.com/DS4SD/docling/issues/824)) ([`5aed9f8`](https://github.com/DS4SD/docling/commit/5aed9f8aeba1624ba1a721e2ed3ba4aceaa7a482))
|
||||||
|
* Use file extension if filetype fails with PDF ([#827](https://github.com/DS4SD/docling/issues/827)) ([`adf6353`](https://github.com/DS4SD/docling/commit/adf635348365f82daa64e3f879076a7baf71edc0))
|
||||||
|
* Parse html with omitted body tag ([#818](https://github.com/DS4SD/docling/issues/818)) ([`a112d7a`](https://github.com/DS4SD/docling/commit/a112d7a03512e8a00842a100416426254d6ecfc0))
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
|
||||||
|
* Document Docling JSON parsing ([#819](https://github.com/DS4SD/docling/issues/819)) ([`6875913`](https://github.com/DS4SD/docling/commit/6875913e34abacb8d71b5d31543adbf7b5bd5e92))
|
||||||
|
* Add SSL verification error mitigation ([#821](https://github.com/DS4SD/docling/issues/821)) ([`5139b48`](https://github.com/DS4SD/docling/commit/5139b48e4e62bb061d956c132958ec2e6d88e40a))
|
||||||
|
* **backend XML:** Do not delete temp file in notebook ([#817](https://github.com/DS4SD/docling/issues/817)) ([`4d41db3`](https://github.com/DS4SD/docling/commit/4d41db3f7abb86c8c65386bf94e7eb0bf22bb82b))
|
||||||
|
* Typo ([#814](https://github.com/DS4SD/docling/issues/814)) ([`8a4ec77`](https://github.com/DS4SD/docling/commit/8a4ec77576b8a9fd60d0047939665d00cf93b4dd))
|
||||||
|
* Added markdown headings to enable TOC in github pages ([#808](https://github.com/DS4SD/docling/issues/808)) ([`b885b2f`](https://github.com/DS4SD/docling/commit/b885b2fa3c2519c399ed4b9a3dd4c2f6f62235d1))
|
||||||
|
* Description of supported formats and backends ([#788](https://github.com/DS4SD/docling/issues/788)) ([`c2ae1cc`](https://github.com/DS4SD/docling/commit/c2ae1cc4cab0f9e693c7ca460fe8afa5b515ee94))
|
||||||
|
|
||||||
|
## [v2.16.0](https://github.com/DS4SD/docling/releases/tag/v2.16.0) - 2025-01-24
|
||||||
|
|
||||||
|
### Feature
|
||||||
|
|
||||||
|
* New document picture classifier ([#805](https://github.com/DS4SD/docling/issues/805)) ([`16a218d`](https://github.com/DS4SD/docling/commit/16a218d871c48fd9cc636b77f7b597dc40cbeeec))
|
||||||
|
* Add Docling JSON ingestion ([#783](https://github.com/DS4SD/docling/issues/783)) ([`88a0e66`](https://github.com/DS4SD/docling/commit/88a0e66adc19238f57a942b0504926cdaeacd8cc))
|
||||||
|
* Code and equation model for PDF and code blocks in markdown ([#752](https://github.com/DS4SD/docling/issues/752)) ([`3213b24`](https://github.com/DS4SD/docling/commit/3213b247ad6870ff984271f09f7720be68d9479b))
|
||||||
|
* Add "auto" language for TesseractOcr ([#759](https://github.com/DS4SD/docling/issues/759)) ([`8543c22`](https://github.com/DS4SD/docling/commit/8543c22687fee40459d393bf4adcfc059712de02))
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
|
||||||
|
* Added extraction of byte-images in excel ([#804](https://github.com/DS4SD/docling/issues/804)) ([`a458e29`](https://github.com/DS4SD/docling/commit/a458e298ca64da2c6df29d953e95645525817bed))
|
||||||
|
* Update docling-parse-v2 backend version with new parsing fixes ([#769](https://github.com/DS4SD/docling/issues/769)) ([`670a08b`](https://github.com/DS4SD/docling/commit/670a08bdedda847ff3b6942bcaa1a2adef79afe2))
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
|
||||||
|
* Fix minor typos ([#801](https://github.com/DS4SD/docling/issues/801)) ([`c58f75d`](https://github.com/DS4SD/docling/commit/c58f75d0f75040e32820cc2915ec00755211c02f))
|
||||||
|
* Add Azure RAG example ([#675](https://github.com/DS4SD/docling/issues/675)) ([`9020a93`](https://github.com/DS4SD/docling/commit/9020a934be35b0798c972eb77a22fb62ce654ca5))
|
||||||
|
* Fix links between docs pages ([#697](https://github.com/DS4SD/docling/issues/697)) ([`c49b352`](https://github.com/DS4SD/docling/commit/c49b3526fb7b72e8007f785b1fcfdf58c2457756))
|
||||||
|
* Fix correct Accelerator pipeline options in docs/examples/custom_convert.py ([#733](https://github.com/DS4SD/docling/issues/733)) ([`7686083`](https://github.com/DS4SD/docling/commit/768608351d40376c3504546f52e967195536b3d5))
|
||||||
|
* Example to translate documents ([#739](https://github.com/DS4SD/docling/issues/739)) ([`f7e1cbf`](https://github.com/DS4SD/docling/commit/f7e1cbf629ae5f3e279296e72f656b7a453ab7a3))
|
||||||
|
|
||||||
## [v2.15.1](https://github.com/DS4SD/docling/releases/tag/v2.15.1) - 2025-01-10
|
## [v2.15.1](https://github.com/DS4SD/docling/releases/tag/v2.15.1) - 2025-01-10
|
||||||
|
|
||||||
### Fix
|
### Fix
|
||||||
|
24
README.md
24
README.md
@ -22,23 +22,25 @@
|
|||||||
[](https://opensource.org/licenses/MIT)
|
[](https://opensource.org/licenses/MIT)
|
||||||
[](https://pepy.tech/projects/docling)
|
[](https://pepy.tech/projects/docling)
|
||||||
|
|
||||||
Docling parses documents and exports them to the desired format with ease and speed.
|
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
|
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
|
||||||
* 📑 Advanced PDF document understanding including page layout, reading order & table structures
|
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
|
||||||
* 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
|
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
||||||
* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
||||||
* 🔍 OCR support for scanned PDFs
|
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||||
|
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||||
|
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||||
* 💻 Simple and convenient CLI
|
* 💻 Simple and convenient CLI
|
||||||
|
|
||||||
Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
|
|
||||||
|
|
||||||
### Coming soon
|
### Coming soon
|
||||||
|
|
||||||
* ♾️ Equation & code extraction
|
|
||||||
* 📝 Metadata extraction, including title, authors, references & language
|
* 📝 Metadata extraction, including title, authors, references & language
|
||||||
|
* 📝 Inclusion of Visual Language Models ([SmolDocling](https://huggingface.co/blog/smolervlm#smoldocling))
|
||||||
|
* 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
|
||||||
|
* 📝 Complex chemistry understanding (Molecular structures)
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
@ -120,3 +122,7 @@ For individual model usage, please refer to the model licenses found in the orig
|
|||||||
## IBM ❤️ Open Source AI
|
## IBM ❤️ Open Source AI
|
||||||
|
|
||||||
Docling has been brought to you by IBM.
|
Docling has been brought to you by IBM.
|
||||||
|
|
||||||
|
[supported_formats]: https://ds4sd.github.io/docling/supported_formats/
|
||||||
|
[docling_document]: https://ds4sd.github.io/docling/concepts/docling_document/
|
||||||
|
[integrations]: https://ds4sd.github.io/docling/integrations/
|
||||||
|
@ -27,7 +27,6 @@ class AbstractDocumentBackend(ABC):
|
|||||||
def supports_pagination(cls) -> bool:
|
def supports_pagination(cls) -> bool:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
@abstractmethod
|
|
||||||
def unload(self):
|
def unload(self):
|
||||||
if isinstance(self.path_or_stream, BytesIO):
|
if isinstance(self.path_or_stream, BytesIO):
|
||||||
self.path_or_stream.close()
|
self.path_or_stream.close()
|
||||||
|
@ -24,7 +24,6 @@ _log = logging.getLogger(__name__)
|
|||||||
|
|
||||||
|
|
||||||
class AsciiDocBackend(DeclarativeDocumentBackend):
|
class AsciiDocBackend(DeclarativeDocumentBackend):
|
||||||
|
|
||||||
def __init__(self, in_doc: InputDocument, path_or_stream: Union[BytesIO, Path]):
|
def __init__(self, in_doc: InputDocument, path_or_stream: Union[BytesIO, Path]):
|
||||||
super().__init__(in_doc, path_or_stream)
|
super().__init__(in_doc, path_or_stream)
|
||||||
|
|
||||||
|
@ -163,7 +163,7 @@ class DoclingParsePageBackend(PdfPageBackend):
|
|||||||
l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
|
l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
padbox = cropbox.to_bottom_left_origin(page_size.height)
|
padbox = cropbox.to_bottom_left_origin(page_size.height).model_copy()
|
||||||
padbox.r = page_size.width - padbox.r
|
padbox.r = page_size.width - padbox.r
|
||||||
padbox.t = page_size.height - padbox.t
|
padbox.t = page_size.height - padbox.t
|
||||||
|
|
||||||
|
@ -178,7 +178,7 @@ class DoclingParseV2PageBackend(PdfPageBackend):
|
|||||||
l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
|
l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
padbox = cropbox.to_bottom_left_origin(page_size.height)
|
padbox = cropbox.to_bottom_left_origin(page_size.height).model_copy()
|
||||||
padbox.r = page_size.width - padbox.r
|
padbox.r = page_size.width - padbox.r
|
||||||
padbox.t = page_size.height - padbox.t
|
padbox.t = page_size.height - padbox.t
|
||||||
|
|
||||||
|
@ -1,9 +1,9 @@
|
|||||||
import logging
|
import logging
|
||||||
from io import BytesIO
|
from io import BytesIO
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Set, Union
|
from typing import Optional, Set, Union
|
||||||
|
|
||||||
from bs4 import BeautifulSoup
|
from bs4 import BeautifulSoup, Tag
|
||||||
from docling_core.types.doc import (
|
from docling_core.types.doc import (
|
||||||
DocItemLabel,
|
DocItemLabel,
|
||||||
DoclingDocument,
|
DoclingDocument,
|
||||||
@ -24,7 +24,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
|
def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
|
||||||
super().__init__(in_doc, path_or_stream)
|
super().__init__(in_doc, path_or_stream)
|
||||||
_log.debug("About to init HTML backend...")
|
_log.debug("About to init HTML backend...")
|
||||||
self.soup = None
|
self.soup: Optional[Tag] = None
|
||||||
# HTML file:
|
# HTML file:
|
||||||
self.path_or_stream = path_or_stream
|
self.path_or_stream = path_or_stream
|
||||||
# Initialise the parents for the hierarchy
|
# Initialise the parents for the hierarchy
|
||||||
@ -78,17 +78,18 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
|
|
||||||
if self.is_valid():
|
if self.is_valid():
|
||||||
assert self.soup is not None
|
assert self.soup is not None
|
||||||
|
content = self.soup.body or self.soup
|
||||||
# Replace <br> tags with newline characters
|
# Replace <br> tags with newline characters
|
||||||
for br in self.soup.body.find_all("br"):
|
for br in content.find_all("br"):
|
||||||
br.replace_with("\n")
|
br.replace_with("\n")
|
||||||
doc = self.walk(self.soup.body, doc)
|
doc = self.walk(content, doc)
|
||||||
else:
|
else:
|
||||||
raise RuntimeError(
|
raise RuntimeError(
|
||||||
f"Cannot convert doc with {self.document_hash} because the backend failed to init."
|
f"Cannot convert doc with {self.document_hash} because the backend failed to init."
|
||||||
)
|
)
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def walk(self, element, doc):
|
def walk(self, element: Tag, doc: DoclingDocument):
|
||||||
try:
|
try:
|
||||||
# Iterate over elements in the body of the document
|
# Iterate over elements in the body of the document
|
||||||
for idx, element in enumerate(element.children):
|
for idx, element in enumerate(element.children):
|
||||||
@ -105,7 +106,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
|
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def analyse_element(self, element, idx, doc):
|
def analyse_element(self, element: Tag, idx: int, doc: DoclingDocument):
|
||||||
"""
|
"""
|
||||||
if element.name!=None:
|
if element.name!=None:
|
||||||
_log.debug("\t"*self.level, idx, "\t", f"{element.name} ({self.level})")
|
_log.debug("\t"*self.level, idx, "\t", f"{element.name} ({self.level})")
|
||||||
@ -135,7 +136,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
else:
|
else:
|
||||||
self.walk(element, doc)
|
self.walk(element, doc)
|
||||||
|
|
||||||
def get_direct_text(self, item):
|
def get_direct_text(self, item: Tag):
|
||||||
"""Get the direct text of the <li> element (ignoring nested lists)."""
|
"""Get the direct text of the <li> element (ignoring nested lists)."""
|
||||||
text = item.find(string=True, recursive=False)
|
text = item.find(string=True, recursive=False)
|
||||||
if isinstance(text, str):
|
if isinstance(text, str):
|
||||||
@ -144,7 +145,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
return ""
|
return ""
|
||||||
|
|
||||||
# Function to recursively extract text from all child nodes
|
# Function to recursively extract text from all child nodes
|
||||||
def extract_text_recursively(self, item):
|
def extract_text_recursively(self, item: Tag):
|
||||||
result = []
|
result = []
|
||||||
|
|
||||||
if isinstance(item, str):
|
if isinstance(item, str):
|
||||||
@ -165,7 +166,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
|
|
||||||
return "".join(result) + " "
|
return "".join(result) + " "
|
||||||
|
|
||||||
def handle_header(self, element, idx, doc):
|
def handle_header(self, element: Tag, idx: int, doc: DoclingDocument):
|
||||||
"""Handles header tags (h1, h2, etc.)."""
|
"""Handles header tags (h1, h2, etc.)."""
|
||||||
hlevel = int(element.name.replace("h", ""))
|
hlevel = int(element.name.replace("h", ""))
|
||||||
slevel = hlevel - 1
|
slevel = hlevel - 1
|
||||||
@ -207,7 +208,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
level=hlevel,
|
level=hlevel,
|
||||||
)
|
)
|
||||||
|
|
||||||
def handle_code(self, element, idx, doc):
|
def handle_code(self, element: Tag, idx: int, doc: DoclingDocument):
|
||||||
"""Handles monospace code snippets (pre)."""
|
"""Handles monospace code snippets (pre)."""
|
||||||
if element.text is None:
|
if element.text is None:
|
||||||
return
|
return
|
||||||
@ -215,9 +216,9 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
label = DocItemLabel.CODE
|
label = DocItemLabel.CODE
|
||||||
if len(text) == 0:
|
if len(text) == 0:
|
||||||
return
|
return
|
||||||
doc.add_text(parent=self.parents[self.level], label=label, text=text)
|
doc.add_code(parent=self.parents[self.level], text=text)
|
||||||
|
|
||||||
def handle_paragraph(self, element, idx, doc):
|
def handle_paragraph(self, element: Tag, idx: int, doc: DoclingDocument):
|
||||||
"""Handles paragraph tags (p)."""
|
"""Handles paragraph tags (p)."""
|
||||||
if element.text is None:
|
if element.text is None:
|
||||||
return
|
return
|
||||||
@ -227,7 +228,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
return
|
return
|
||||||
doc.add_text(parent=self.parents[self.level], label=label, text=text)
|
doc.add_text(parent=self.parents[self.level], label=label, text=text)
|
||||||
|
|
||||||
def handle_list(self, element, idx, doc):
|
def handle_list(self, element: Tag, idx: int, doc: DoclingDocument):
|
||||||
"""Handles list tags (ul, ol) and their list items."""
|
"""Handles list tags (ul, ol) and their list items."""
|
||||||
|
|
||||||
if element.name == "ul":
|
if element.name == "ul":
|
||||||
@ -249,7 +250,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.parents[self.level + 1] = None
|
self.parents[self.level + 1] = None
|
||||||
self.level -= 1
|
self.level -= 1
|
||||||
|
|
||||||
def handle_listitem(self, element, idx, doc):
|
def handle_listitem(self, element: Tag, idx: int, doc: DoclingDocument):
|
||||||
"""Handles listitem tags (li)."""
|
"""Handles listitem tags (li)."""
|
||||||
nested_lists = element.find(["ul", "ol"])
|
nested_lists = element.find(["ul", "ol"])
|
||||||
|
|
||||||
@ -303,7 +304,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
else:
|
else:
|
||||||
_log.warn("list-item has no text: ", element)
|
_log.warn("list-item has no text: ", element)
|
||||||
|
|
||||||
def handle_table(self, element, idx, doc):
|
def handle_table(self, element: Tag, idx: int, doc: DoclingDocument):
|
||||||
"""Handles table tags."""
|
"""Handles table tags."""
|
||||||
|
|
||||||
nested_tables = element.find("table")
|
nested_tables = element.find("table")
|
||||||
@ -376,7 +377,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
|
|
||||||
doc.add_table(data=data, parent=self.parents[self.level])
|
doc.add_table(data=data, parent=self.parents[self.level])
|
||||||
|
|
||||||
def get_list_text(self, list_element, level=0):
|
def get_list_text(self, list_element: Tag, level=0):
|
||||||
"""Recursively extract text from <ul> or <ol> with proper indentation."""
|
"""Recursively extract text from <ul> or <ol> with proper indentation."""
|
||||||
result = []
|
result = []
|
||||||
bullet_char = "*" # Default bullet character for unordered lists
|
bullet_char = "*" # Default bullet character for unordered lists
|
||||||
@ -402,7 +403,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
|
|
||||||
return result
|
return result
|
||||||
|
|
||||||
def extract_table_cell_text(self, cell):
|
def extract_table_cell_text(self, cell: Tag):
|
||||||
"""Extract text from a table cell, including lists with indents."""
|
"""Extract text from a table cell, including lists with indents."""
|
||||||
contains_lists = cell.find(["ul", "ol"])
|
contains_lists = cell.find(["ul", "ol"])
|
||||||
if contains_lists is None:
|
if contains_lists is None:
|
||||||
@ -413,7 +414,7 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
)
|
)
|
||||||
return cell.text
|
return cell.text
|
||||||
|
|
||||||
def handle_figure(self, element, idx, doc):
|
def handle_figure(self, element: Tag, idx: int, doc: DoclingDocument):
|
||||||
"""Handles image tags (img)."""
|
"""Handles image tags (img)."""
|
||||||
|
|
||||||
# Extract the image URI from the <img> tag
|
# Extract the image URI from the <img> tag
|
||||||
@ -436,6 +437,6 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
caption=fig_caption,
|
caption=fig_caption,
|
||||||
)
|
)
|
||||||
|
|
||||||
def handle_image(self, element, idx, doc):
|
def handle_image(self, element: Tag, idx, doc: DoclingDocument):
|
||||||
"""Handles image tags (img)."""
|
"""Handles image tags (img)."""
|
||||||
doc.add_picture(parent=self.parents[self.level], caption=None)
|
doc.add_picture(parent=self.parents[self.level], caption=None)
|
||||||
|
0
docling/backend/json/__init__.py
Normal file
0
docling/backend/json/__init__.py
Normal file
58
docling/backend/json/docling_json_backend.py
Normal file
58
docling/backend/json/docling_json_backend.py
Normal file
@ -0,0 +1,58 @@
|
|||||||
|
from io import BytesIO
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Union
|
||||||
|
|
||||||
|
from docling_core.types.doc import DoclingDocument
|
||||||
|
from typing_extensions import override
|
||||||
|
|
||||||
|
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
||||||
|
from docling.datamodel.base_models import InputFormat
|
||||||
|
from docling.datamodel.document import InputDocument
|
||||||
|
|
||||||
|
|
||||||
|
class DoclingJSONBackend(DeclarativeDocumentBackend):
|
||||||
|
@override
|
||||||
|
def __init__(
|
||||||
|
self, in_doc: InputDocument, path_or_stream: Union[BytesIO, Path]
|
||||||
|
) -> None:
|
||||||
|
super().__init__(in_doc, path_or_stream)
|
||||||
|
|
||||||
|
# given we need to store any actual conversion exception for raising it from
|
||||||
|
# convert(), this captures the successful result or the actual error in a
|
||||||
|
# mutually exclusive way:
|
||||||
|
self._doc_or_err = self._get_doc_or_err()
|
||||||
|
|
||||||
|
@override
|
||||||
|
def is_valid(self) -> bool:
|
||||||
|
return isinstance(self._doc_or_err, DoclingDocument)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
@override
|
||||||
|
def supports_pagination(cls) -> bool:
|
||||||
|
return False
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
@override
|
||||||
|
def supported_formats(cls) -> set[InputFormat]:
|
||||||
|
return {InputFormat.JSON_DOCLING}
|
||||||
|
|
||||||
|
def _get_doc_or_err(self) -> Union[DoclingDocument, Exception]:
|
||||||
|
try:
|
||||||
|
json_data: Union[str, bytes]
|
||||||
|
if isinstance(self.path_or_stream, Path):
|
||||||
|
with open(self.path_or_stream, encoding="utf-8") as f:
|
||||||
|
json_data = f.read()
|
||||||
|
elif isinstance(self.path_or_stream, BytesIO):
|
||||||
|
json_data = self.path_or_stream.getvalue()
|
||||||
|
else:
|
||||||
|
raise RuntimeError(f"Unexpected: {type(self.path_or_stream)=}")
|
||||||
|
return DoclingDocument.model_validate_json(json_data=json_data)
|
||||||
|
except Exception as e:
|
||||||
|
return e
|
||||||
|
|
||||||
|
@override
|
||||||
|
def convert(self) -> DoclingDocument:
|
||||||
|
if isinstance(self._doc_or_err, DoclingDocument):
|
||||||
|
return self._doc_or_err
|
||||||
|
else:
|
||||||
|
raise self._doc_or_err
|
@ -3,32 +3,40 @@ import re
|
|||||||
import warnings
|
import warnings
|
||||||
from io import BytesIO
|
from io import BytesIO
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Set, Union
|
from typing import List, Optional, Set, Union
|
||||||
|
|
||||||
import marko
|
import marko
|
||||||
|
import marko.element
|
||||||
import marko.ext
|
import marko.ext
|
||||||
import marko.ext.gfm
|
import marko.ext.gfm
|
||||||
import marko.inline
|
import marko.inline
|
||||||
from docling_core.types.doc import (
|
from docling_core.types.doc import (
|
||||||
|
DocItem,
|
||||||
DocItemLabel,
|
DocItemLabel,
|
||||||
DoclingDocument,
|
DoclingDocument,
|
||||||
DocumentOrigin,
|
DocumentOrigin,
|
||||||
GroupLabel,
|
GroupLabel,
|
||||||
|
NodeItem,
|
||||||
TableCell,
|
TableCell,
|
||||||
TableData,
|
TableData,
|
||||||
|
TextItem,
|
||||||
)
|
)
|
||||||
from marko import Markdown
|
from marko import Markdown
|
||||||
|
|
||||||
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
||||||
|
from docling.backend.html_backend import HTMLDocumentBackend
|
||||||
from docling.datamodel.base_models import InputFormat
|
from docling.datamodel.base_models import InputFormat
|
||||||
from docling.datamodel.document import InputDocument
|
from docling.datamodel.document import InputDocument
|
||||||
|
|
||||||
_log = logging.getLogger(__name__)
|
_log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
_MARKER_BODY = "DOCLING_DOC_MD_HTML_EXPORT"
|
||||||
|
_START_MARKER = f"#_#_{_MARKER_BODY}_START_#_#"
|
||||||
|
_STOP_MARKER = f"#_#_{_MARKER_BODY}_STOP_#_#"
|
||||||
|
|
||||||
|
|
||||||
class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
||||||
|
def shorten_underscore_sequences(self, markdown_text: str, max_length: int = 10):
|
||||||
def shorten_underscore_sequences(self, markdown_text, max_length=10):
|
|
||||||
# This regex will match any sequence of underscores
|
# This regex will match any sequence of underscores
|
||||||
pattern = r"_+"
|
pattern = r"_+"
|
||||||
|
|
||||||
@ -63,7 +71,8 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
|
|
||||||
self.in_table = False
|
self.in_table = False
|
||||||
self.md_table_buffer: list[str] = []
|
self.md_table_buffer: list[str] = []
|
||||||
self.inline_text_buffer = ""
|
self.inline_texts: list[str] = []
|
||||||
|
self._html_blocks: int = 0
|
||||||
|
|
||||||
try:
|
try:
|
||||||
if isinstance(self.path_or_stream, BytesIO):
|
if isinstance(self.path_or_stream, BytesIO):
|
||||||
@ -90,13 +99,13 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
) from e
|
) from e
|
||||||
return
|
return
|
||||||
|
|
||||||
def close_table(self, doc=None):
|
def close_table(self, doc: DoclingDocument):
|
||||||
if self.in_table:
|
if self.in_table:
|
||||||
_log.debug("=== TABLE START ===")
|
_log.debug("=== TABLE START ===")
|
||||||
for md_table_row in self.md_table_buffer:
|
for md_table_row in self.md_table_buffer:
|
||||||
_log.debug(md_table_row)
|
_log.debug(md_table_row)
|
||||||
_log.debug("=== TABLE END ===")
|
_log.debug("=== TABLE END ===")
|
||||||
tcells = []
|
tcells: List[TableCell] = []
|
||||||
result_table = []
|
result_table = []
|
||||||
for n, md_table_row in enumerate(self.md_table_buffer):
|
for n, md_table_row in enumerate(self.md_table_buffer):
|
||||||
data = []
|
data = []
|
||||||
@ -137,33 +146,42 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.in_table = False
|
self.in_table = False
|
||||||
self.md_table_buffer = [] # clean table markdown buffer
|
self.md_table_buffer = [] # clean table markdown buffer
|
||||||
# Initialize Docling TableData
|
# Initialize Docling TableData
|
||||||
data = TableData(num_rows=num_rows, num_cols=num_cols, table_cells=tcells)
|
table_data = TableData(
|
||||||
|
num_rows=num_rows, num_cols=num_cols, table_cells=tcells
|
||||||
|
)
|
||||||
# Populate
|
# Populate
|
||||||
for tcell in tcells:
|
for tcell in tcells:
|
||||||
data.table_cells.append(tcell)
|
table_data.table_cells.append(tcell)
|
||||||
if len(tcells) > 0:
|
if len(tcells) > 0:
|
||||||
doc.add_table(data=data)
|
doc.add_table(data=table_data)
|
||||||
return
|
return
|
||||||
|
|
||||||
def process_inline_text(self, parent_element, doc=None):
|
def process_inline_text(
|
||||||
# self.inline_text_buffer += str(text_in)
|
self, parent_element: Optional[NodeItem], doc: DoclingDocument
|
||||||
txt = self.inline_text_buffer.strip()
|
):
|
||||||
|
txt = " ".join(self.inline_texts)
|
||||||
if len(txt) > 0:
|
if len(txt) > 0:
|
||||||
doc.add_text(
|
doc.add_text(
|
||||||
label=DocItemLabel.PARAGRAPH,
|
label=DocItemLabel.PARAGRAPH,
|
||||||
parent=parent_element,
|
parent=parent_element,
|
||||||
text=txt,
|
text=txt,
|
||||||
)
|
)
|
||||||
self.inline_text_buffer = ""
|
self.inline_texts = []
|
||||||
|
|
||||||
def iterate_elements(self, element, depth=0, doc=None, parent_element=None):
|
def iterate_elements(
|
||||||
|
self,
|
||||||
|
element: marko.element.Element,
|
||||||
|
depth: int,
|
||||||
|
doc: DoclingDocument,
|
||||||
|
parent_element: Optional[NodeItem] = None,
|
||||||
|
):
|
||||||
# Iterates over all elements in the AST
|
# Iterates over all elements in the AST
|
||||||
# Check for different element types and process relevant details
|
# Check for different element types and process relevant details
|
||||||
if isinstance(element, marko.block.Heading):
|
if isinstance(element, marko.block.Heading) and len(element.children) > 0:
|
||||||
self.close_table(doc)
|
self.close_table(doc)
|
||||||
self.process_inline_text(parent_element, doc)
|
self.process_inline_text(parent_element, doc)
|
||||||
_log.debug(
|
_log.debug(
|
||||||
f" - Heading level {element.level}, content: {element.children[0].children}"
|
f" - Heading level {element.level}, content: {element.children[0].children}" # type: ignore
|
||||||
)
|
)
|
||||||
if element.level == 1:
|
if element.level == 1:
|
||||||
doc_label = DocItemLabel.TITLE
|
doc_label = DocItemLabel.TITLE
|
||||||
@ -172,10 +190,10 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
|
|
||||||
# Header could have arbitrary inclusion of bold, italic or emphasis,
|
# Header could have arbitrary inclusion of bold, italic or emphasis,
|
||||||
# hence we need to traverse the tree to get full text of a header
|
# hence we need to traverse the tree to get full text of a header
|
||||||
strings = []
|
strings: List[str] = []
|
||||||
|
|
||||||
# Define a recursive function to traverse the tree
|
# Define a recursive function to traverse the tree
|
||||||
def traverse(node):
|
def traverse(node: marko.block.BlockElement):
|
||||||
# Check if the node has a "children" attribute
|
# Check if the node has a "children" attribute
|
||||||
if hasattr(node, "children"):
|
if hasattr(node, "children"):
|
||||||
# If "children" is a list, continue traversal
|
# If "children" is a list, continue traversal
|
||||||
@ -194,24 +212,33 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
)
|
)
|
||||||
|
|
||||||
elif isinstance(element, marko.block.List):
|
elif isinstance(element, marko.block.List):
|
||||||
|
has_non_empty_list_items = False
|
||||||
|
for child in element.children:
|
||||||
|
if isinstance(child, marko.block.ListItem) and len(child.children) > 0:
|
||||||
|
has_non_empty_list_items = True
|
||||||
|
break
|
||||||
|
|
||||||
self.close_table(doc)
|
self.close_table(doc)
|
||||||
self.process_inline_text(parent_element, doc)
|
self.process_inline_text(parent_element, doc)
|
||||||
_log.debug(f" - List {'ordered' if element.ordered else 'unordered'}")
|
_log.debug(f" - List {'ordered' if element.ordered else 'unordered'}")
|
||||||
list_label = GroupLabel.LIST
|
if has_non_empty_list_items:
|
||||||
if element.ordered:
|
label = GroupLabel.ORDERED_LIST if element.ordered else GroupLabel.LIST
|
||||||
list_label = GroupLabel.ORDERED_LIST
|
parent_element = doc.add_group(
|
||||||
parent_element = doc.add_group(
|
label=label, name=f"list", parent=parent_element
|
||||||
label=list_label, name=f"list", parent=parent_element
|
)
|
||||||
)
|
|
||||||
|
|
||||||
elif isinstance(element, marko.block.ListItem):
|
elif isinstance(element, marko.block.ListItem) and len(element.children) > 0:
|
||||||
self.close_table(doc)
|
self.close_table(doc)
|
||||||
self.process_inline_text(parent_element, doc)
|
self.process_inline_text(parent_element, doc)
|
||||||
_log.debug(" - List item")
|
_log.debug(" - List item")
|
||||||
|
|
||||||
snippet_text = str(element.children[0].children[0].children)
|
snippet_text = str(element.children[0].children[0].children) # type: ignore
|
||||||
is_numbered = False
|
is_numbered = False
|
||||||
if parent_element.label == GroupLabel.ORDERED_LIST:
|
if (
|
||||||
|
parent_element is not None
|
||||||
|
and isinstance(parent_element, DocItem)
|
||||||
|
and parent_element.label == GroupLabel.ORDERED_LIST
|
||||||
|
):
|
||||||
is_numbered = True
|
is_numbered = True
|
||||||
doc.add_list_item(
|
doc.add_list_item(
|
||||||
enumerated=is_numbered, parent=parent_element, text=snippet_text
|
enumerated=is_numbered, parent=parent_element, text=snippet_text
|
||||||
@ -221,89 +248,91 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.close_table(doc)
|
self.close_table(doc)
|
||||||
self.process_inline_text(parent_element, doc)
|
self.process_inline_text(parent_element, doc)
|
||||||
_log.debug(f" - Image with alt: {element.title}, url: {element.dest}")
|
_log.debug(f" - Image with alt: {element.title}, url: {element.dest}")
|
||||||
doc.add_picture(parent=parent_element, caption=element.title)
|
|
||||||
|
|
||||||
elif isinstance(element, marko.block.Paragraph):
|
fig_caption: Optional[TextItem] = None
|
||||||
|
if element.title is not None and element.title != "":
|
||||||
|
fig_caption = doc.add_text(
|
||||||
|
label=DocItemLabel.CAPTION, text=element.title
|
||||||
|
)
|
||||||
|
|
||||||
|
doc.add_picture(parent=parent_element, caption=fig_caption)
|
||||||
|
|
||||||
|
elif isinstance(element, marko.block.Paragraph) and len(element.children) > 0:
|
||||||
self.process_inline_text(parent_element, doc)
|
self.process_inline_text(parent_element, doc)
|
||||||
|
|
||||||
elif isinstance(element, marko.inline.RawText):
|
elif isinstance(element, marko.inline.RawText):
|
||||||
_log.debug(f" - Paragraph (raw text): {element.children}")
|
_log.debug(f" - Paragraph (raw text): {element.children}")
|
||||||
snippet_text = str(element.children).strip()
|
snippet_text = element.children.strip()
|
||||||
# Detect start of the table:
|
# Detect start of the table:
|
||||||
if "|" in snippet_text:
|
if "|" in snippet_text:
|
||||||
# most likely part of the markdown table
|
# most likely part of the markdown table
|
||||||
self.in_table = True
|
self.in_table = True
|
||||||
if len(self.md_table_buffer) > 0:
|
if len(self.md_table_buffer) > 0:
|
||||||
self.md_table_buffer[len(self.md_table_buffer) - 1] += str(
|
self.md_table_buffer[len(self.md_table_buffer) - 1] += snippet_text
|
||||||
snippet_text
|
|
||||||
)
|
|
||||||
else:
|
else:
|
||||||
self.md_table_buffer.append(snippet_text)
|
self.md_table_buffer.append(snippet_text)
|
||||||
else:
|
else:
|
||||||
self.close_table(doc)
|
self.close_table(doc)
|
||||||
self.in_table = False
|
self.in_table = False
|
||||||
# most likely just inline text
|
# most likely just inline text
|
||||||
self.inline_text_buffer += str(
|
self.inline_texts.append(str(element.children))
|
||||||
element.children
|
|
||||||
) # do not strip an inline text, as it may contain important spaces
|
|
||||||
|
|
||||||
elif isinstance(element, marko.inline.CodeSpan):
|
elif isinstance(element, marko.inline.CodeSpan):
|
||||||
self.close_table(doc)
|
self.close_table(doc)
|
||||||
self.process_inline_text(parent_element, doc)
|
self.process_inline_text(parent_element, doc)
|
||||||
_log.debug(f" - Code Span: {element.children}")
|
_log.debug(f" - Code Span: {element.children}")
|
||||||
snippet_text = str(element.children).strip()
|
snippet_text = str(element.children).strip()
|
||||||
doc.add_text(
|
doc.add_code(parent=parent_element, text=snippet_text)
|
||||||
label=DocItemLabel.CODE, parent=parent_element, text=snippet_text
|
|
||||||
)
|
|
||||||
|
|
||||||
elif isinstance(element, marko.block.CodeBlock):
|
elif (
|
||||||
|
isinstance(element, (marko.block.CodeBlock, marko.block.FencedCode))
|
||||||
|
and len(element.children) > 0
|
||||||
|
and isinstance((first_child := element.children[0]), marko.inline.RawText)
|
||||||
|
and len(snippet_text := (first_child.children.strip())) > 0
|
||||||
|
):
|
||||||
self.close_table(doc)
|
self.close_table(doc)
|
||||||
self.process_inline_text(parent_element, doc)
|
self.process_inline_text(parent_element, doc)
|
||||||
_log.debug(f" - Code Block: {element.children}")
|
_log.debug(f" - Code Block: {element.children}")
|
||||||
snippet_text = str(element.children[0].children).strip()
|
doc.add_code(parent=parent_element, text=snippet_text)
|
||||||
doc.add_text(
|
|
||||||
label=DocItemLabel.CODE, parent=parent_element, text=snippet_text
|
|
||||||
)
|
|
||||||
|
|
||||||
elif isinstance(element, marko.block.FencedCode):
|
|
||||||
self.close_table(doc)
|
|
||||||
self.process_inline_text(parent_element, doc)
|
|
||||||
_log.debug(f" - Code Block: {element.children}")
|
|
||||||
snippet_text = str(element.children[0].children).strip()
|
|
||||||
doc.add_text(
|
|
||||||
label=DocItemLabel.CODE, parent=parent_element, text=snippet_text
|
|
||||||
)
|
|
||||||
|
|
||||||
elif isinstance(element, marko.inline.LineBreak):
|
elif isinstance(element, marko.inline.LineBreak):
|
||||||
self.process_inline_text(parent_element, doc)
|
|
||||||
if self.in_table:
|
if self.in_table:
|
||||||
_log.debug("Line break in a table")
|
_log.debug("Line break in a table")
|
||||||
self.md_table_buffer.append("")
|
self.md_table_buffer.append("")
|
||||||
|
|
||||||
elif isinstance(element, marko.block.HTMLBlock):
|
elif isinstance(element, marko.block.HTMLBlock):
|
||||||
|
self._html_blocks += 1
|
||||||
self.process_inline_text(parent_element, doc)
|
self.process_inline_text(parent_element, doc)
|
||||||
self.close_table(doc)
|
self.close_table(doc)
|
||||||
_log.debug("HTML Block: {}".format(element))
|
_log.debug("HTML Block: {}".format(element))
|
||||||
if (
|
if (
|
||||||
len(element.children) > 0
|
len(element.body) > 0
|
||||||
): # If Marko doesn't return any content for HTML block, skip it
|
): # If Marko doesn't return any content for HTML block, skip it
|
||||||
snippet_text = str(element.children).strip()
|
html_block = element.body.strip()
|
||||||
doc.add_text(
|
|
||||||
label=DocItemLabel.CODE, parent=parent_element, text=snippet_text
|
# wrap in markers to enable post-processing in convert()
|
||||||
)
|
text_to_add = f"{_START_MARKER}{html_block}{_STOP_MARKER}"
|
||||||
|
doc.add_code(parent=parent_element, text=text_to_add)
|
||||||
else:
|
else:
|
||||||
if not isinstance(element, str):
|
if not isinstance(element, str):
|
||||||
self.close_table(doc)
|
self.close_table(doc)
|
||||||
_log.debug("Some other element: {}".format(element))
|
_log.debug("Some other element: {}".format(element))
|
||||||
|
|
||||||
|
processed_block_types = (
|
||||||
|
marko.block.ListItem,
|
||||||
|
marko.block.Heading,
|
||||||
|
marko.block.CodeBlock,
|
||||||
|
marko.block.FencedCode,
|
||||||
|
# marko.block.Paragraph,
|
||||||
|
marko.inline.RawText,
|
||||||
|
)
|
||||||
|
|
||||||
# Iterate through the element's children (if any)
|
# Iterate through the element's children (if any)
|
||||||
if not isinstance(element, marko.block.ListItem):
|
if hasattr(element, "children") and not isinstance(
|
||||||
if not isinstance(element, marko.block.Heading):
|
element, processed_block_types
|
||||||
if not isinstance(element, marko.block.FencedCode):
|
):
|
||||||
# if not isinstance(element, marko.block.Paragraph):
|
for child in element.children:
|
||||||
if hasattr(element, "children"):
|
self.iterate_elements(child, depth + 1, doc, parent_element)
|
||||||
for child in element.children:
|
|
||||||
self.iterate_elements(child, depth + 1, doc, parent_element)
|
|
||||||
|
|
||||||
def is_valid(self) -> bool:
|
def is_valid(self) -> bool:
|
||||||
return self.valid
|
return self.valid
|
||||||
@ -339,6 +368,42 @@ class MarkdownDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
# Start iterating from the root of the AST
|
# Start iterating from the root of the AST
|
||||||
self.iterate_elements(parsed_ast, 0, doc, None)
|
self.iterate_elements(parsed_ast, 0, doc, None)
|
||||||
self.process_inline_text(None, doc) # handle last hanging inline text
|
self.process_inline_text(None, doc) # handle last hanging inline text
|
||||||
|
|
||||||
|
# if HTML blocks were detected, export to HTML and delegate to HTML backend
|
||||||
|
if self._html_blocks > 0:
|
||||||
|
|
||||||
|
# export to HTML
|
||||||
|
html_backend_cls = HTMLDocumentBackend
|
||||||
|
html_str = doc.export_to_html()
|
||||||
|
|
||||||
|
def _restore_original_html(txt, regex):
|
||||||
|
_txt, count = re.subn(regex, "", txt)
|
||||||
|
if count != self._html_blocks:
|
||||||
|
raise RuntimeError(
|
||||||
|
"An internal error has occurred during Markdown conversion."
|
||||||
|
)
|
||||||
|
return _txt
|
||||||
|
|
||||||
|
# restore original HTML by removing previouly added markers
|
||||||
|
for regex in [
|
||||||
|
rf"<pre>\s*<code>\s*{_START_MARKER}",
|
||||||
|
rf"{_STOP_MARKER}\s*</code>\s*</pre>",
|
||||||
|
]:
|
||||||
|
html_str = _restore_original_html(txt=html_str, regex=regex)
|
||||||
|
self._html_blocks = 0
|
||||||
|
|
||||||
|
# delegate to HTML backend
|
||||||
|
stream = BytesIO(bytes(html_str, encoding="utf-8"))
|
||||||
|
in_doc = InputDocument(
|
||||||
|
path_or_stream=stream,
|
||||||
|
format=InputFormat.HTML,
|
||||||
|
backend=html_backend_cls,
|
||||||
|
filename=self.file.name,
|
||||||
|
)
|
||||||
|
html_backend_obj = html_backend_cls(
|
||||||
|
in_doc=in_doc, path_or_stream=stream
|
||||||
|
)
|
||||||
|
doc = html_backend_obj.convert()
|
||||||
else:
|
else:
|
||||||
raise RuntimeError(
|
raise RuntimeError(
|
||||||
f"Cannot convert md with {self.document_hash} because the backend failed to init."
|
f"Cannot convert md with {self.document_hash} because the backend failed to init."
|
||||||
|
@ -26,6 +26,7 @@ _log = logging.getLogger(__name__)
|
|||||||
|
|
||||||
from typing import Any, List
|
from typing import Any, List
|
||||||
|
|
||||||
|
from PIL import Image as PILImage
|
||||||
from pydantic import BaseModel
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
|
||||||
@ -44,7 +45,6 @@ class ExcelTable(BaseModel):
|
|||||||
|
|
||||||
|
|
||||||
class MsExcelDocumentBackend(DeclarativeDocumentBackend):
|
class MsExcelDocumentBackend(DeclarativeDocumentBackend):
|
||||||
|
|
||||||
def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
|
def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
|
||||||
super().__init__(in_doc, path_or_stream)
|
super().__init__(in_doc, path_or_stream)
|
||||||
|
|
||||||
@ -326,49 +326,61 @@ class MsExcelDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self, doc: DoclingDocument, sheet: Worksheet
|
self, doc: DoclingDocument, sheet: Worksheet
|
||||||
) -> DoclingDocument:
|
) -> DoclingDocument:
|
||||||
|
|
||||||
# FIXME: mypy does not agree with _images ...
|
# Iterate over byte images in the sheet
|
||||||
|
for idx, image in enumerate(sheet._images): # type: ignore
|
||||||
|
|
||||||
|
try:
|
||||||
|
pil_image = PILImage.open(image.ref)
|
||||||
|
|
||||||
|
doc.add_picture(
|
||||||
|
parent=self.parents[0],
|
||||||
|
image=ImageRef.from_pil(image=pil_image, dpi=72),
|
||||||
|
caption=None,
|
||||||
|
)
|
||||||
|
except:
|
||||||
|
_log.error("could not extract the image from excel sheets")
|
||||||
|
|
||||||
"""
|
"""
|
||||||
# Iterate over images in the sheet
|
for idx, chart in enumerate(sheet._charts): # type: ignore
|
||||||
for idx, image in enumerate(sheet._images): # Access embedded images
|
try:
|
||||||
|
chart_path = f"chart_{idx + 1}.png"
|
||||||
|
_log.info(
|
||||||
|
f"Chart found, but dynamic rendering is required for: {chart_path}"
|
||||||
|
)
|
||||||
|
|
||||||
image_bytes = BytesIO(image.ref.blob)
|
_log.info(f"Chart {idx + 1}:")
|
||||||
pil_image = Image.open(image_bytes)
|
|
||||||
|
|
||||||
doc.add_picture(
|
# Chart type
|
||||||
parent=self.parents[0],
|
# _log.info(f"Type: {type(chart).__name__}")
|
||||||
image=ImageRef.from_pil(image=pil_image, dpi=72),
|
print(f"Type: {type(chart).__name__}")
|
||||||
caption=None,
|
|
||||||
)
|
|
||||||
"""
|
|
||||||
|
|
||||||
# FIXME: mypy does not agree with _charts ...
|
# Extract series data
|
||||||
"""
|
for series_idx, series in enumerate(chart.series):
|
||||||
for idx, chart in enumerate(sheet._charts): # Access embedded charts
|
#_log.info(f"Series {series_idx + 1}:")
|
||||||
chart_path = f"chart_{idx + 1}.png"
|
print(f"Series {series_idx + 1} type: {type(series).__name__}")
|
||||||
_log.info(
|
#print(f"x-values: {series.xVal}")
|
||||||
f"Chart found, but dynamic rendering is required for: {chart_path}"
|
#print(f"y-values: {series.yVal}")
|
||||||
)
|
|
||||||
|
|
||||||
_log.info(f"Chart {idx + 1}:")
|
print(f"xval type: {type(series.xVal).__name__}")
|
||||||
|
|
||||||
# Chart type
|
xvals = []
|
||||||
_log.info(f"Type: {type(chart).__name__}")
|
for _ in series.xVal.numLit.pt:
|
||||||
|
print(f"xval type: {type(_).__name__}")
|
||||||
|
if hasattr(_, 'v'):
|
||||||
|
xvals.append(_.v)
|
||||||
|
|
||||||
# Title
|
print(f"x-values: {xvals}")
|
||||||
if chart.title:
|
|
||||||
_log.info(f"Title: {chart.title}")
|
|
||||||
else:
|
|
||||||
_log.info("No title")
|
|
||||||
|
|
||||||
# Data series
|
yvals = []
|
||||||
for series in chart.series:
|
for _ in series.yVal:
|
||||||
_log.info(" => series ...")
|
if hasattr(_, 'v'):
|
||||||
_log.info(f"Data Series: {series.title}")
|
yvals.append(_.v)
|
||||||
_log.info(f"Values: {series.values}")
|
|
||||||
_log.info(f"Categories: {series.categories}")
|
|
||||||
|
|
||||||
# Position
|
print(f"y-values: {yvals}")
|
||||||
# _log.info(f"Anchor Cell: {chart.anchor}")
|
|
||||||
|
except Exception as exc:
|
||||||
|
print(exc)
|
||||||
|
continue
|
||||||
"""
|
"""
|
||||||
|
|
||||||
return doc
|
return doc
|
||||||
|
@ -98,21 +98,28 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
|
|||||||
|
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def generate_prov(self, shape, slide_ind, text=""):
|
def generate_prov(
|
||||||
left = shape.left
|
self, shape, slide_ind, text="", slide_size=Size(width=1, height=1)
|
||||||
top = shape.top
|
):
|
||||||
width = shape.width
|
if shape.left:
|
||||||
height = shape.height
|
left = shape.left
|
||||||
|
top = shape.top
|
||||||
|
width = shape.width
|
||||||
|
height = shape.height
|
||||||
|
else:
|
||||||
|
left = 0
|
||||||
|
top = 0
|
||||||
|
width = slide_size.width
|
||||||
|
height = slide_size.height
|
||||||
shape_bbox = [left, top, left + width, top + height]
|
shape_bbox = [left, top, left + width, top + height]
|
||||||
shape_bbox = BoundingBox.from_tuple(shape_bbox, origin=CoordOrigin.BOTTOMLEFT)
|
shape_bbox = BoundingBox.from_tuple(shape_bbox, origin=CoordOrigin.BOTTOMLEFT)
|
||||||
# prov = [{"bbox": shape_bbox, "page": parent_slide, "span": [0, len(text)]}]
|
|
||||||
prov = ProvenanceItem(
|
prov = ProvenanceItem(
|
||||||
page_no=slide_ind + 1, charspan=[0, len(text)], bbox=shape_bbox
|
page_no=slide_ind + 1, charspan=[0, len(text)], bbox=shape_bbox
|
||||||
)
|
)
|
||||||
|
|
||||||
return prov
|
return prov
|
||||||
|
|
||||||
def handle_text_elements(self, shape, parent_slide, slide_ind, doc):
|
def handle_text_elements(self, shape, parent_slide, slide_ind, doc, slide_size):
|
||||||
is_a_list = False
|
is_a_list = False
|
||||||
is_list_group_created = False
|
is_list_group_created = False
|
||||||
enum_list_item_value = 0
|
enum_list_item_value = 0
|
||||||
@ -121,7 +128,7 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
|
|||||||
list_text = ""
|
list_text = ""
|
||||||
list_label = GroupLabel.LIST
|
list_label = GroupLabel.LIST
|
||||||
doc_label = DocItemLabel.LIST_ITEM
|
doc_label = DocItemLabel.LIST_ITEM
|
||||||
prov = self.generate_prov(shape, slide_ind, shape.text.strip())
|
prov = self.generate_prov(shape, slide_ind, shape.text.strip(), slide_size)
|
||||||
|
|
||||||
# Identify if shape contains lists
|
# Identify if shape contains lists
|
||||||
for paragraph in shape.text_frame.paragraphs:
|
for paragraph in shape.text_frame.paragraphs:
|
||||||
@ -270,18 +277,17 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
|
|||||||
)
|
)
|
||||||
return
|
return
|
||||||
|
|
||||||
def handle_pictures(self, shape, parent_slide, slide_ind, doc):
|
def handle_pictures(self, shape, parent_slide, slide_ind, doc, slide_size):
|
||||||
# Get the image bytes
|
|
||||||
image = shape.image
|
|
||||||
image_bytes = image.blob
|
|
||||||
im_dpi, _ = image.dpi
|
|
||||||
|
|
||||||
# Open it with PIL
|
# Open it with PIL
|
||||||
try:
|
try:
|
||||||
|
# Get the image bytes
|
||||||
|
image = shape.image
|
||||||
|
image_bytes = image.blob
|
||||||
|
im_dpi, _ = image.dpi
|
||||||
pil_image = Image.open(BytesIO(image_bytes))
|
pil_image = Image.open(BytesIO(image_bytes))
|
||||||
|
|
||||||
# shape has picture
|
# shape has picture
|
||||||
prov = self.generate_prov(shape, slide_ind, "")
|
prov = self.generate_prov(shape, slide_ind, "", slide_size)
|
||||||
doc.add_picture(
|
doc.add_picture(
|
||||||
parent=parent_slide,
|
parent=parent_slide,
|
||||||
image=ImageRef.from_pil(image=pil_image, dpi=im_dpi),
|
image=ImageRef.from_pil(image=pil_image, dpi=im_dpi),
|
||||||
@ -292,13 +298,13 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
|
|||||||
_log.warning(f"Warning: image cannot be loaded by Pillow: {e}")
|
_log.warning(f"Warning: image cannot be loaded by Pillow: {e}")
|
||||||
return
|
return
|
||||||
|
|
||||||
def handle_tables(self, shape, parent_slide, slide_ind, doc):
|
def handle_tables(self, shape, parent_slide, slide_ind, doc, slide_size):
|
||||||
# Handling tables, images, charts
|
# Handling tables, images, charts
|
||||||
if shape.has_table:
|
if shape.has_table:
|
||||||
table = shape.table
|
table = shape.table
|
||||||
table_xml = shape._element
|
table_xml = shape._element
|
||||||
|
|
||||||
prov = self.generate_prov(shape, slide_ind, "")
|
prov = self.generate_prov(shape, slide_ind, "", slide_size)
|
||||||
|
|
||||||
num_cols = 0
|
num_cols = 0
|
||||||
num_rows = len(table.rows)
|
num_rows = len(table.rows)
|
||||||
@ -375,17 +381,19 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
|
|||||||
name=f"slide-{slide_ind}", label=GroupLabel.CHAPTER, parent=parents[0]
|
name=f"slide-{slide_ind}", label=GroupLabel.CHAPTER, parent=parents[0]
|
||||||
)
|
)
|
||||||
|
|
||||||
size = Size(width=slide_width, height=slide_height)
|
slide_size = Size(width=slide_width, height=slide_height)
|
||||||
parent_page = doc.add_page(page_no=slide_ind + 1, size=size)
|
parent_page = doc.add_page(page_no=slide_ind + 1, size=slide_size)
|
||||||
|
|
||||||
def handle_shapes(shape, parent_slide, slide_ind, doc):
|
def handle_shapes(shape, parent_slide, slide_ind, doc, slide_size):
|
||||||
handle_groups(shape, parent_slide, slide_ind, doc)
|
handle_groups(shape, parent_slide, slide_ind, doc, slide_size)
|
||||||
if shape.has_table:
|
if shape.has_table:
|
||||||
# Handle Tables
|
# Handle Tables
|
||||||
self.handle_tables(shape, parent_slide, slide_ind, doc)
|
self.handle_tables(shape, parent_slide, slide_ind, doc, slide_size)
|
||||||
if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
|
if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
|
||||||
# Handle Pictures
|
# Handle Pictures
|
||||||
self.handle_pictures(shape, parent_slide, slide_ind, doc)
|
self.handle_pictures(
|
||||||
|
shape, parent_slide, slide_ind, doc, slide_size
|
||||||
|
)
|
||||||
# If shape doesn't have any text, move on to the next shape
|
# If shape doesn't have any text, move on to the next shape
|
||||||
if not hasattr(shape, "text"):
|
if not hasattr(shape, "text"):
|
||||||
return
|
return
|
||||||
@ -397,16 +405,20 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
|
|||||||
_log.warning("Warning: shape has text but not text_frame")
|
_log.warning("Warning: shape has text but not text_frame")
|
||||||
return
|
return
|
||||||
# Handle other text elements, including lists (bullet lists, numbered lists)
|
# Handle other text elements, including lists (bullet lists, numbered lists)
|
||||||
self.handle_text_elements(shape, parent_slide, slide_ind, doc)
|
self.handle_text_elements(
|
||||||
|
shape, parent_slide, slide_ind, doc, slide_size
|
||||||
|
)
|
||||||
return
|
return
|
||||||
|
|
||||||
def handle_groups(shape, parent_slide, slide_ind, doc):
|
def handle_groups(shape, parent_slide, slide_ind, doc, slide_size):
|
||||||
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
|
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
|
||||||
for groupedshape in shape.shapes:
|
for groupedshape in shape.shapes:
|
||||||
handle_shapes(groupedshape, parent_slide, slide_ind, doc)
|
handle_shapes(
|
||||||
|
groupedshape, parent_slide, slide_ind, doc, slide_size
|
||||||
|
)
|
||||||
|
|
||||||
# Loop through each shape in the slide
|
# Loop through each shape in the slide
|
||||||
for shape in slide.shapes:
|
for shape in slide.shapes:
|
||||||
handle_shapes(shape, parent_slide, slide_ind, doc)
|
handle_shapes(shape, parent_slide, slide_ind, doc, slide_size)
|
||||||
|
|
||||||
return doc
|
return doc
|
||||||
|
@ -2,21 +2,28 @@ import logging
|
|||||||
import re
|
import re
|
||||||
from io import BytesIO
|
from io import BytesIO
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Set, Union
|
from typing import Any, Optional, Union
|
||||||
|
|
||||||
import docx
|
|
||||||
from docling_core.types.doc import (
|
from docling_core.types.doc import (
|
||||||
DocItemLabel,
|
DocItemLabel,
|
||||||
DoclingDocument,
|
DoclingDocument,
|
||||||
DocumentOrigin,
|
DocumentOrigin,
|
||||||
GroupLabel,
|
GroupLabel,
|
||||||
ImageRef,
|
ImageRef,
|
||||||
|
NodeItem,
|
||||||
TableCell,
|
TableCell,
|
||||||
TableData,
|
TableData,
|
||||||
)
|
)
|
||||||
|
from docx import Document
|
||||||
|
from docx.document import Document as DocxDocument
|
||||||
|
from docx.oxml.table import CT_Tc
|
||||||
|
from docx.oxml.xmlchemy import BaseOxmlElement
|
||||||
|
from docx.table import Table, _Cell
|
||||||
|
from docx.text.paragraph import Paragraph
|
||||||
from lxml import etree
|
from lxml import etree
|
||||||
from lxml.etree import XPath
|
from lxml.etree import XPath
|
||||||
from PIL import Image, UnidentifiedImageError
|
from PIL import Image, UnidentifiedImageError
|
||||||
|
from typing_extensions import override
|
||||||
|
|
||||||
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
from docling.backend.abstract_backend import DeclarativeDocumentBackend
|
||||||
from docling.datamodel.base_models import InputFormat
|
from docling.datamodel.base_models import InputFormat
|
||||||
@ -26,8 +33,10 @@ _log = logging.getLogger(__name__)
|
|||||||
|
|
||||||
|
|
||||||
class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
||||||
|
@override
|
||||||
def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
|
def __init__(
|
||||||
|
self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]
|
||||||
|
) -> None:
|
||||||
super().__init__(in_doc, path_or_stream)
|
super().__init__(in_doc, path_or_stream)
|
||||||
self.XML_KEY = (
|
self.XML_KEY = (
|
||||||
"{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val"
|
"{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val"
|
||||||
@ -37,19 +46,19 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
}
|
}
|
||||||
# self.initialise(path_or_stream)
|
# self.initialise(path_or_stream)
|
||||||
# Word file:
|
# Word file:
|
||||||
self.path_or_stream = path_or_stream
|
self.path_or_stream: Union[BytesIO, Path] = path_or_stream
|
||||||
self.valid = False
|
self.valid: bool = False
|
||||||
# Initialise the parents for the hierarchy
|
# Initialise the parents for the hierarchy
|
||||||
self.max_levels = 10
|
self.max_levels: int = 10
|
||||||
self.level_at_new_list = None
|
self.level_at_new_list: Optional[int] = None
|
||||||
self.parents = {} # type: ignore
|
self.parents: dict[int, Optional[NodeItem]] = {}
|
||||||
for i in range(-1, self.max_levels):
|
for i in range(-1, self.max_levels):
|
||||||
self.parents[i] = None
|
self.parents[i] = None
|
||||||
|
|
||||||
self.level = 0
|
self.level = 0
|
||||||
self.listIter = 0
|
self.listIter = 0
|
||||||
|
|
||||||
self.history = {
|
self.history: dict[str, Any] = {
|
||||||
"names": [None],
|
"names": [None],
|
||||||
"levels": [None],
|
"levels": [None],
|
||||||
"numids": [None],
|
"numids": [None],
|
||||||
@ -59,9 +68,9 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.docx_obj = None
|
self.docx_obj = None
|
||||||
try:
|
try:
|
||||||
if isinstance(self.path_or_stream, BytesIO):
|
if isinstance(self.path_or_stream, BytesIO):
|
||||||
self.docx_obj = docx.Document(self.path_or_stream)
|
self.docx_obj = Document(self.path_or_stream)
|
||||||
elif isinstance(self.path_or_stream, Path):
|
elif isinstance(self.path_or_stream, Path):
|
||||||
self.docx_obj = docx.Document(str(self.path_or_stream))
|
self.docx_obj = Document(str(self.path_or_stream))
|
||||||
|
|
||||||
self.valid = True
|
self.valid = True
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@ -69,13 +78,16 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
f"MsPowerpointDocumentBackend could not load document with hash {self.document_hash}"
|
f"MsPowerpointDocumentBackend could not load document with hash {self.document_hash}"
|
||||||
) from e
|
) from e
|
||||||
|
|
||||||
|
@override
|
||||||
def is_valid(self) -> bool:
|
def is_valid(self) -> bool:
|
||||||
return self.valid
|
return self.valid
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
|
@override
|
||||||
def supports_pagination(cls) -> bool:
|
def supports_pagination(cls) -> bool:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
@override
|
||||||
def unload(self):
|
def unload(self):
|
||||||
if isinstance(self.path_or_stream, BytesIO):
|
if isinstance(self.path_or_stream, BytesIO):
|
||||||
self.path_or_stream.close()
|
self.path_or_stream.close()
|
||||||
@ -83,11 +95,17 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.path_or_stream = None
|
self.path_or_stream = None
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def supported_formats(cls) -> Set[InputFormat]:
|
@override
|
||||||
|
def supported_formats(cls) -> set[InputFormat]:
|
||||||
return {InputFormat.DOCX}
|
return {InputFormat.DOCX}
|
||||||
|
|
||||||
|
@override
|
||||||
def convert(self) -> DoclingDocument:
|
def convert(self) -> DoclingDocument:
|
||||||
# Parses the DOCX into a structured document model.
|
"""Parses the DOCX into a structured document model.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
The parsed document.
|
||||||
|
"""
|
||||||
|
|
||||||
origin = DocumentOrigin(
|
origin = DocumentOrigin(
|
||||||
filename=self.file.name or "file",
|
filename=self.file.name or "file",
|
||||||
@ -105,23 +123,29 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
f"Cannot convert doc with {self.document_hash} because the backend failed to init."
|
f"Cannot convert doc with {self.document_hash} because the backend failed to init."
|
||||||
)
|
)
|
||||||
|
|
||||||
def update_history(self, name, level, numid, ilevel):
|
def update_history(
|
||||||
|
self,
|
||||||
|
name: str,
|
||||||
|
level: Optional[int],
|
||||||
|
numid: Optional[int],
|
||||||
|
ilevel: Optional[int],
|
||||||
|
):
|
||||||
self.history["names"].append(name)
|
self.history["names"].append(name)
|
||||||
self.history["levels"].append(level)
|
self.history["levels"].append(level)
|
||||||
|
|
||||||
self.history["numids"].append(numid)
|
self.history["numids"].append(numid)
|
||||||
self.history["indents"].append(ilevel)
|
self.history["indents"].append(ilevel)
|
||||||
|
|
||||||
def prev_name(self):
|
def prev_name(self) -> Optional[str]:
|
||||||
return self.history["names"][-1]
|
return self.history["names"][-1]
|
||||||
|
|
||||||
def prev_level(self):
|
def prev_level(self) -> Optional[int]:
|
||||||
return self.history["levels"][-1]
|
return self.history["levels"][-1]
|
||||||
|
|
||||||
def prev_numid(self):
|
def prev_numid(self) -> Optional[int]:
|
||||||
return self.history["numids"][-1]
|
return self.history["numids"][-1]
|
||||||
|
|
||||||
def prev_indent(self):
|
def prev_indent(self) -> Optional[int]:
|
||||||
return self.history["indents"][-1]
|
return self.history["indents"][-1]
|
||||||
|
|
||||||
def get_level(self) -> int:
|
def get_level(self) -> int:
|
||||||
@ -131,13 +155,19 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
return k
|
return k
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
def walk_linear(self, body, docx_obj, doc) -> DoclingDocument:
|
def walk_linear(
|
||||||
|
self,
|
||||||
|
body: BaseOxmlElement,
|
||||||
|
docx_obj: DocxDocument,
|
||||||
|
doc: DoclingDocument,
|
||||||
|
) -> DoclingDocument:
|
||||||
for element in body:
|
for element in body:
|
||||||
tag_name = etree.QName(element).localname
|
tag_name = etree.QName(element).localname
|
||||||
# Check for Inline Images (blip elements)
|
# Check for Inline Images (blip elements)
|
||||||
namespaces = {
|
namespaces = {
|
||||||
"a": "http://schemas.openxmlformats.org/drawingml/2006/main",
|
"a": "http://schemas.openxmlformats.org/drawingml/2006/main",
|
||||||
"r": "http://schemas.openxmlformats.org/officeDocument/2006/relationships",
|
"r": "http://schemas.openxmlformats.org/officeDocument/2006/relationships",
|
||||||
|
"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
|
||||||
}
|
}
|
||||||
xpath_expr = XPath(".//a:blip", namespaces=namespaces)
|
xpath_expr = XPath(".//a:blip", namespaces=namespaces)
|
||||||
drawing_blip = xpath_expr(element)
|
drawing_blip = xpath_expr(element)
|
||||||
@ -150,7 +180,15 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
_log.debug("could not parse a table, broken docx table")
|
_log.debug("could not parse a table, broken docx table")
|
||||||
|
|
||||||
elif drawing_blip:
|
elif drawing_blip:
|
||||||
self.handle_pictures(element, docx_obj, drawing_blip, doc)
|
self.handle_pictures(docx_obj, drawing_blip, doc)
|
||||||
|
# Check for the sdt containers, like table of contents
|
||||||
|
elif tag_name in ["sdt"]:
|
||||||
|
sdt_content = element.find(".//w:sdtContent", namespaces=namespaces)
|
||||||
|
if sdt_content is not None:
|
||||||
|
# Iterate paragraphs, runs, or text inside <w:sdtContent>.
|
||||||
|
paragraphs = sdt_content.findall(".//w:p", namespaces=namespaces)
|
||||||
|
for p in paragraphs:
|
||||||
|
self.handle_text_elements(p, docx_obj, doc)
|
||||||
# Check for Text
|
# Check for Text
|
||||||
elif tag_name in ["p"]:
|
elif tag_name in ["p"]:
|
||||||
# "tcPr", "sectPr"
|
# "tcPr", "sectPr"
|
||||||
@ -159,7 +197,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
_log.debug(f"Ignoring element in DOCX with tag: {tag_name}")
|
_log.debug(f"Ignoring element in DOCX with tag: {tag_name}")
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def str_to_int(self, s, default=0):
|
def str_to_int(self, s: Optional[str], default: Optional[int] = 0) -> Optional[int]:
|
||||||
if s is None:
|
if s is None:
|
||||||
return None
|
return None
|
||||||
try:
|
try:
|
||||||
@ -167,7 +205,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
except ValueError:
|
except ValueError:
|
||||||
return default
|
return default
|
||||||
|
|
||||||
def split_text_and_number(self, input_string):
|
def split_text_and_number(self, input_string: str) -> list[str]:
|
||||||
match = re.match(r"(\D+)(\d+)$|^(\d+)(\D+)", input_string)
|
match = re.match(r"(\D+)(\d+)$|^(\d+)(\D+)", input_string)
|
||||||
if match:
|
if match:
|
||||||
parts = list(filter(None, match.groups()))
|
parts = list(filter(None, match.groups()))
|
||||||
@ -175,7 +213,9 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
else:
|
else:
|
||||||
return [input_string]
|
return [input_string]
|
||||||
|
|
||||||
def get_numId_and_ilvl(self, paragraph):
|
def get_numId_and_ilvl(
|
||||||
|
self, paragraph: Paragraph
|
||||||
|
) -> tuple[Optional[int], Optional[int]]:
|
||||||
# Access the XML element of the paragraph
|
# Access the XML element of the paragraph
|
||||||
numPr = paragraph._element.find(
|
numPr = paragraph._element.find(
|
||||||
".//w:numPr", namespaces=paragraph._element.nsmap
|
".//w:numPr", namespaces=paragraph._element.nsmap
|
||||||
@ -188,13 +228,11 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
numId = numId_elem.get(self.XML_KEY) if numId_elem is not None else None
|
numId = numId_elem.get(self.XML_KEY) if numId_elem is not None else None
|
||||||
ilvl = ilvl_elem.get(self.XML_KEY) if ilvl_elem is not None else None
|
ilvl = ilvl_elem.get(self.XML_KEY) if ilvl_elem is not None else None
|
||||||
|
|
||||||
return self.str_to_int(numId, default=None), self.str_to_int(
|
return self.str_to_int(numId, None), self.str_to_int(ilvl, None)
|
||||||
ilvl, default=None
|
|
||||||
)
|
|
||||||
|
|
||||||
return None, None # If the paragraph is not part of a list
|
return None, None # If the paragraph is not part of a list
|
||||||
|
|
||||||
def get_label_and_level(self, paragraph):
|
def get_label_and_level(self, paragraph: Paragraph) -> tuple[str, Optional[int]]:
|
||||||
if paragraph.style is None:
|
if paragraph.style is None:
|
||||||
return "Normal", None
|
return "Normal", None
|
||||||
label = paragraph.style.style_id
|
label = paragraph.style.style_id
|
||||||
@ -210,20 +248,25 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
|
|
||||||
if "Heading" in label and len(parts) == 2:
|
if "Heading" in label and len(parts) == 2:
|
||||||
parts.sort()
|
parts.sort()
|
||||||
label_str = ""
|
label_str: str = ""
|
||||||
label_level = 0
|
label_level: Optional[int] = 0
|
||||||
if parts[0] == "Heading":
|
if parts[0] == "Heading":
|
||||||
label_str = parts[0]
|
label_str = parts[0]
|
||||||
label_level = self.str_to_int(parts[1], default=None)
|
label_level = self.str_to_int(parts[1], None)
|
||||||
if parts[1] == "Heading":
|
if parts[1] == "Heading":
|
||||||
label_str = parts[1]
|
label_str = parts[1]
|
||||||
label_level = self.str_to_int(parts[0], default=None)
|
label_level = self.str_to_int(parts[0], None)
|
||||||
return label_str, label_level
|
return label_str, label_level
|
||||||
else:
|
else:
|
||||||
return label, None
|
return label, None
|
||||||
|
|
||||||
def handle_text_elements(self, element, docx_obj, doc):
|
def handle_text_elements(
|
||||||
paragraph = docx.text.paragraph.Paragraph(element, docx_obj)
|
self,
|
||||||
|
element: BaseOxmlElement,
|
||||||
|
docx_obj: DocxDocument,
|
||||||
|
doc: DoclingDocument,
|
||||||
|
) -> None:
|
||||||
|
paragraph = Paragraph(element, docx_obj)
|
||||||
|
|
||||||
if paragraph.text is None:
|
if paragraph.text is None:
|
||||||
return
|
return
|
||||||
@ -241,13 +284,13 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
numid = None
|
numid = None
|
||||||
|
|
||||||
# Handle lists
|
# Handle lists
|
||||||
if numid is not None and ilevel is not None:
|
if (
|
||||||
|
numid is not None
|
||||||
|
and ilevel is not None
|
||||||
|
and p_style_id not in ["Title", "Heading"]
|
||||||
|
):
|
||||||
self.add_listitem(
|
self.add_listitem(
|
||||||
element,
|
|
||||||
docx_obj,
|
|
||||||
doc,
|
doc,
|
||||||
p_style_id,
|
|
||||||
p_level,
|
|
||||||
numid,
|
numid,
|
||||||
ilevel,
|
ilevel,
|
||||||
text,
|
text,
|
||||||
@ -255,20 +298,30 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
)
|
)
|
||||||
self.update_history(p_style_id, p_level, numid, ilevel)
|
self.update_history(p_style_id, p_level, numid, ilevel)
|
||||||
return
|
return
|
||||||
elif numid is None and self.prev_numid() is not None: # Close list
|
elif (
|
||||||
for key, val in self.parents.items():
|
numid is None
|
||||||
if key >= self.level_at_new_list:
|
and self.prev_numid() is not None
|
||||||
|
and p_style_id not in ["Title", "Heading"]
|
||||||
|
): # Close list
|
||||||
|
if self.level_at_new_list:
|
||||||
|
for key in range(len(self.parents)):
|
||||||
|
if key >= self.level_at_new_list:
|
||||||
|
self.parents[key] = None
|
||||||
|
self.level = self.level_at_new_list - 1
|
||||||
|
self.level_at_new_list = None
|
||||||
|
else:
|
||||||
|
for key in range(len(self.parents)):
|
||||||
self.parents[key] = None
|
self.parents[key] = None
|
||||||
self.level = self.level_at_new_list - 1
|
self.level = 0
|
||||||
self.level_at_new_list = None
|
|
||||||
if p_style_id in ["Title"]:
|
if p_style_id in ["Title"]:
|
||||||
for key, val in self.parents.items():
|
for key in range(len(self.parents)):
|
||||||
self.parents[key] = None
|
self.parents[key] = None
|
||||||
self.parents[0] = doc.add_text(
|
self.parents[0] = doc.add_text(
|
||||||
parent=None, label=DocItemLabel.TITLE, text=text
|
parent=None, label=DocItemLabel.TITLE, text=text
|
||||||
)
|
)
|
||||||
elif "Heading" in p_style_id:
|
elif "Heading" in p_style_id:
|
||||||
self.add_header(element, docx_obj, doc, p_style_id, p_level, text)
|
self.add_header(doc, p_level, text)
|
||||||
|
|
||||||
elif p_style_id in [
|
elif p_style_id in [
|
||||||
"Paragraph",
|
"Paragraph",
|
||||||
@ -296,7 +349,9 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.update_history(p_style_id, p_level, numid, ilevel)
|
self.update_history(p_style_id, p_level, numid, ilevel)
|
||||||
return
|
return
|
||||||
|
|
||||||
def add_header(self, element, docx_obj, doc, curr_name, curr_level, text: str):
|
def add_header(
|
||||||
|
self, doc: DoclingDocument, curr_level: Optional[int], text: str
|
||||||
|
) -> None:
|
||||||
level = self.get_level()
|
level = self.get_level()
|
||||||
if isinstance(curr_level, int):
|
if isinstance(curr_level, int):
|
||||||
if curr_level > level:
|
if curr_level > level:
|
||||||
@ -309,7 +364,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
)
|
)
|
||||||
elif curr_level < level:
|
elif curr_level < level:
|
||||||
# remove the tail
|
# remove the tail
|
||||||
for key, val in self.parents.items():
|
for key in range(len(self.parents)):
|
||||||
if key >= curr_level:
|
if key >= curr_level:
|
||||||
self.parents[key] = None
|
self.parents[key] = None
|
||||||
|
|
||||||
@ -328,22 +383,18 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
|
|
||||||
def add_listitem(
|
def add_listitem(
|
||||||
self,
|
self,
|
||||||
element,
|
doc: DoclingDocument,
|
||||||
docx_obj,
|
numid: int,
|
||||||
doc,
|
ilevel: int,
|
||||||
p_style_id,
|
|
||||||
p_level,
|
|
||||||
numid,
|
|
||||||
ilevel,
|
|
||||||
text: str,
|
text: str,
|
||||||
is_numbered=False,
|
is_numbered: bool = False,
|
||||||
):
|
) -> None:
|
||||||
# is_numbered = is_numbered
|
|
||||||
enum_marker = ""
|
enum_marker = ""
|
||||||
|
|
||||||
level = self.get_level()
|
level = self.get_level()
|
||||||
|
prev_indent = self.prev_indent()
|
||||||
if self.prev_numid() is None: # Open new list
|
if self.prev_numid() is None: # Open new list
|
||||||
self.level_at_new_list = level # type: ignore
|
self.level_at_new_list = level
|
||||||
|
|
||||||
self.parents[level] = doc.add_group(
|
self.parents[level] = doc.add_group(
|
||||||
label=GroupLabel.LIST, name="list", parent=self.parents[level - 1]
|
label=GroupLabel.LIST, name="list", parent=self.parents[level - 1]
|
||||||
@ -362,10 +413,13 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
)
|
)
|
||||||
|
|
||||||
elif (
|
elif (
|
||||||
self.prev_numid() == numid and self.prev_indent() < ilevel
|
self.prev_numid() == numid
|
||||||
|
and self.level_at_new_list is not None
|
||||||
|
and prev_indent is not None
|
||||||
|
and prev_indent < ilevel
|
||||||
): # Open indented list
|
): # Open indented list
|
||||||
for i in range(
|
for i in range(
|
||||||
self.level_at_new_list + self.prev_indent() + 1,
|
self.level_at_new_list + prev_indent + 1,
|
||||||
self.level_at_new_list + ilevel + 1,
|
self.level_at_new_list + ilevel + 1,
|
||||||
):
|
):
|
||||||
# Determine if this is an unordered list or an ordered list.
|
# Determine if this is an unordered list or an ordered list.
|
||||||
@ -394,7 +448,12 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
text=text,
|
text=text,
|
||||||
)
|
)
|
||||||
|
|
||||||
elif self.prev_numid() == numid and ilevel < self.prev_indent(): # Close list
|
elif (
|
||||||
|
self.prev_numid() == numid
|
||||||
|
and self.level_at_new_list is not None
|
||||||
|
and prev_indent is not None
|
||||||
|
and ilevel < prev_indent
|
||||||
|
): # Close list
|
||||||
for k, v in self.parents.items():
|
for k, v in self.parents.items():
|
||||||
if k > self.level_at_new_list + ilevel:
|
if k > self.level_at_new_list + ilevel:
|
||||||
self.parents[k] = None
|
self.parents[k] = None
|
||||||
@ -412,7 +471,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
)
|
)
|
||||||
self.listIter = 0
|
self.listIter = 0
|
||||||
|
|
||||||
elif self.prev_numid() == numid or self.prev_indent() == ilevel:
|
elif self.prev_numid() == numid or prev_indent == ilevel:
|
||||||
# TODO: Set marker and enumerated arguments if this is an enumeration element.
|
# TODO: Set marker and enumerated arguments if this is an enumeration element.
|
||||||
self.listIter += 1
|
self.listIter += 1
|
||||||
if is_numbered:
|
if is_numbered:
|
||||||
@ -426,31 +485,16 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
)
|
)
|
||||||
return
|
return
|
||||||
|
|
||||||
def handle_tables(self, element, docx_obj, doc):
|
def handle_tables(
|
||||||
|
self,
|
||||||
# Function to check if a cell has a colspan (gridSpan)
|
element: BaseOxmlElement,
|
||||||
def get_colspan(cell):
|
docx_obj: DocxDocument,
|
||||||
grid_span = cell._element.xpath("@w:gridSpan")
|
doc: DoclingDocument,
|
||||||
if grid_span:
|
) -> None:
|
||||||
return int(grid_span[0]) # Return the number of columns spanned
|
table: Table = Table(element, docx_obj)
|
||||||
return 1 # Default is 1 (no colspan)
|
|
||||||
|
|
||||||
# Function to check if a cell has a rowspan (vMerge)
|
|
||||||
def get_rowspan(cell):
|
|
||||||
v_merge = cell._element.xpath("@w:vMerge")
|
|
||||||
if v_merge:
|
|
||||||
return v_merge[
|
|
||||||
0
|
|
||||||
] # 'restart' indicates the beginning of a rowspan, others are continuation
|
|
||||||
return 1
|
|
||||||
|
|
||||||
table = docx.table.Table(element, docx_obj)
|
|
||||||
|
|
||||||
num_rows = len(table.rows)
|
num_rows = len(table.rows)
|
||||||
num_cols = 0
|
num_cols = len(table.columns)
|
||||||
for row in table.rows:
|
_log.debug(f"Table grid with {num_rows} rows and {num_cols} columns")
|
||||||
# Calculate the max number of columns
|
|
||||||
num_cols = max(num_cols, sum(get_colspan(cell) for cell in row.cells))
|
|
||||||
|
|
||||||
if num_rows == 1 and num_cols == 1:
|
if num_rows == 1 and num_cols == 1:
|
||||||
cell_element = table.rows[0].cells[0]
|
cell_element = table.rows[0].cells[0]
|
||||||
@ -459,59 +503,56 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
self.walk_linear(cell_element._element, docx_obj, doc)
|
self.walk_linear(cell_element._element, docx_obj, doc)
|
||||||
return
|
return
|
||||||
|
|
||||||
# Initialize the table grid
|
data = TableData(num_rows=num_rows, num_cols=num_cols)
|
||||||
table_grid = [[None for _ in range(num_cols)] for _ in range(num_rows)]
|
cell_set: set[CT_Tc] = set()
|
||||||
|
|
||||||
data = TableData(num_rows=num_rows, num_cols=num_cols, table_cells=[])
|
|
||||||
|
|
||||||
for row_idx, row in enumerate(table.rows):
|
for row_idx, row in enumerate(table.rows):
|
||||||
|
_log.debug(f"Row index {row_idx} with {len(row.cells)} populated cells")
|
||||||
col_idx = 0
|
col_idx = 0
|
||||||
for c, cell in enumerate(row.cells):
|
while col_idx < num_cols:
|
||||||
row_span = get_rowspan(cell)
|
cell: _Cell = row.cells[col_idx]
|
||||||
col_span = get_colspan(cell)
|
_log.debug(
|
||||||
|
f" col {col_idx} grid_span {cell.grid_span} grid_cols_before {row.grid_cols_before}"
|
||||||
|
)
|
||||||
|
if cell is None or cell._tc in cell_set:
|
||||||
|
_log.debug(f" skipped since repeated content")
|
||||||
|
col_idx += cell.grid_span
|
||||||
|
continue
|
||||||
|
else:
|
||||||
|
cell_set.add(cell._tc)
|
||||||
|
|
||||||
cell_text = cell.text
|
spanned_idx = row_idx
|
||||||
# In case cell doesn't return text via docx library:
|
spanned_tc: Optional[CT_Tc] = cell._tc
|
||||||
if len(cell_text) == 0:
|
while spanned_tc == cell._tc:
|
||||||
cell_xml = cell._element
|
spanned_idx += 1
|
||||||
|
spanned_tc = (
|
||||||
|
table.rows[spanned_idx].cells[col_idx]._tc
|
||||||
|
if spanned_idx < num_rows
|
||||||
|
else None
|
||||||
|
)
|
||||||
|
_log.debug(f" spanned before row {spanned_idx}")
|
||||||
|
|
||||||
texts = [""]
|
table_cell = TableCell(
|
||||||
for elem in cell_xml.iter():
|
text=cell.text,
|
||||||
if elem.tag.endswith("t"): # <w:t> tags that contain text
|
row_span=spanned_idx - row_idx,
|
||||||
if elem.text:
|
col_span=cell.grid_span,
|
||||||
texts.append(elem.text)
|
start_row_offset_idx=row.grid_cols_before + row_idx,
|
||||||
# Join the collected text
|
end_row_offset_idx=row.grid_cols_before + spanned_idx,
|
||||||
cell_text = " ".join(texts).strip()
|
|
||||||
|
|
||||||
# Find the next available column in the grid
|
|
||||||
while table_grid[row_idx][col_idx] is not None:
|
|
||||||
col_idx += 1
|
|
||||||
|
|
||||||
# Fill the grid with the cell value, considering rowspan and colspan
|
|
||||||
for i in range(row_span if row_span == "restart" else 1):
|
|
||||||
for j in range(col_span):
|
|
||||||
table_grid[row_idx + i][col_idx + j] = ""
|
|
||||||
|
|
||||||
cell = TableCell(
|
|
||||||
text=cell_text,
|
|
||||||
row_span=row_span,
|
|
||||||
col_span=col_span,
|
|
||||||
start_row_offset_idx=row_idx,
|
|
||||||
end_row_offset_idx=row_idx + row_span,
|
|
||||||
start_col_offset_idx=col_idx,
|
start_col_offset_idx=col_idx,
|
||||||
end_col_offset_idx=col_idx + col_span,
|
end_col_offset_idx=col_idx + cell.grid_span,
|
||||||
col_header=False,
|
col_header=False,
|
||||||
row_header=False,
|
row_header=False,
|
||||||
)
|
)
|
||||||
|
data.table_cells.append(table_cell)
|
||||||
data.table_cells.append(cell)
|
col_idx += cell.grid_span
|
||||||
|
|
||||||
level = self.get_level()
|
level = self.get_level()
|
||||||
doc.add_table(data=data, parent=self.parents[level - 1])
|
doc.add_table(data=data, parent=self.parents[level - 1])
|
||||||
return
|
return
|
||||||
|
|
||||||
def handle_pictures(self, element, docx_obj, drawing_blip, doc):
|
def handle_pictures(
|
||||||
def get_docx_image(element, drawing_blip):
|
self, docx_obj: DocxDocument, drawing_blip: Any, doc: DoclingDocument
|
||||||
|
) -> None:
|
||||||
|
def get_docx_image(drawing_blip):
|
||||||
rId = drawing_blip[0].get(
|
rId = drawing_blip[0].get(
|
||||||
"{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed"
|
"{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed"
|
||||||
)
|
)
|
||||||
@ -521,11 +562,11 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
|
|||||||
image_data = image_part.blob # Get the binary image data
|
image_data = image_part.blob # Get the binary image data
|
||||||
return image_data
|
return image_data
|
||||||
|
|
||||||
image_data = get_docx_image(element, drawing_blip)
|
|
||||||
image_bytes = BytesIO(image_data)
|
|
||||||
level = self.get_level()
|
level = self.get_level()
|
||||||
# Open the BytesIO object with PIL to create an Image
|
# Open the BytesIO object with PIL to create an Image
|
||||||
try:
|
try:
|
||||||
|
image_data = get_docx_image(drawing_blip)
|
||||||
|
image_bytes = BytesIO(image_data)
|
||||||
pil_image = Image.open(image_bytes)
|
pil_image = Image.open(image_bytes)
|
||||||
doc.add_picture(
|
doc.add_picture(
|
||||||
parent=self.parents[level - 1],
|
parent=self.parents[level - 1],
|
||||||
|
@ -12,7 +12,6 @@ from docling.datamodel.document import InputDocument
|
|||||||
|
|
||||||
|
|
||||||
class PdfPageBackend(ABC):
|
class PdfPageBackend(ABC):
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def get_text_in_rect(self, bbox: BoundingBox) -> str:
|
def get_text_in_rect(self, bbox: BoundingBox) -> str:
|
||||||
pass
|
pass
|
||||||
@ -45,7 +44,6 @@ class PdfPageBackend(ABC):
|
|||||||
|
|
||||||
|
|
||||||
class PdfDocumentBackend(PaginatedDocumentBackend):
|
class PdfDocumentBackend(PaginatedDocumentBackend):
|
||||||
|
|
||||||
def __init__(self, in_doc: InputDocument, path_or_stream: Union[BytesIO, Path]):
|
def __init__(self, in_doc: InputDocument, path_or_stream: Union[BytesIO, Path]):
|
||||||
super().__init__(in_doc, path_or_stream)
|
super().__init__(in_doc, path_or_stream)
|
||||||
|
|
||||||
|
@ -210,7 +210,7 @@ class PyPdfiumPageBackend(PdfPageBackend):
|
|||||||
l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
|
l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
padbox = cropbox.to_bottom_left_origin(page_size.height)
|
padbox = cropbox.to_bottom_left_origin(page_size.height).model_copy()
|
||||||
padbox.r = page_size.width - padbox.r
|
padbox.r = page_size.width - padbox.r
|
||||||
padbox.t = page_size.height - padbox.t
|
padbox.t = page_size.height - padbox.t
|
||||||
|
|
||||||
|
@ -389,7 +389,7 @@ class PatentUsptoIce(PatentUspto):
|
|||||||
if name == self.Element.TITLE.value:
|
if name == self.Element.TITLE.value:
|
||||||
if text:
|
if text:
|
||||||
self.parents[self.level + 1] = self.doc.add_title(
|
self.parents[self.level + 1] = self.doc.add_title(
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
text=text,
|
text=text,
|
||||||
)
|
)
|
||||||
self.level += 1
|
self.level += 1
|
||||||
@ -406,7 +406,7 @@ class PatentUsptoIce(PatentUspto):
|
|||||||
abstract_item = self.doc.add_heading(
|
abstract_item = self.doc.add_heading(
|
||||||
heading_text,
|
heading_text,
|
||||||
level=heading_level,
|
level=heading_level,
|
||||||
parent=self.parents[heading_level], # type: ignore[arg-type]
|
parent=self.parents[heading_level],
|
||||||
)
|
)
|
||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
label=DocItemLabel.PARAGRAPH,
|
label=DocItemLabel.PARAGRAPH,
|
||||||
@ -434,7 +434,7 @@ class PatentUsptoIce(PatentUspto):
|
|||||||
claims_item = self.doc.add_heading(
|
claims_item = self.doc.add_heading(
|
||||||
heading_text,
|
heading_text,
|
||||||
level=heading_level,
|
level=heading_level,
|
||||||
parent=self.parents[heading_level], # type: ignore[arg-type]
|
parent=self.parents[heading_level],
|
||||||
)
|
)
|
||||||
for text in self.claims:
|
for text in self.claims:
|
||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
@ -452,7 +452,7 @@ class PatentUsptoIce(PatentUspto):
|
|||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
label=DocItemLabel.PARAGRAPH,
|
label=DocItemLabel.PARAGRAPH,
|
||||||
text=text,
|
text=text,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
self.text = ""
|
self.text = ""
|
||||||
|
|
||||||
@ -460,7 +460,7 @@ class PatentUsptoIce(PatentUspto):
|
|||||||
self.parents[self.level + 1] = self.doc.add_heading(
|
self.parents[self.level + 1] = self.doc.add_heading(
|
||||||
text=text,
|
text=text,
|
||||||
level=self.level,
|
level=self.level,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
self.level += 1
|
self.level += 1
|
||||||
self.text = ""
|
self.text = ""
|
||||||
@ -470,7 +470,7 @@ class PatentUsptoIce(PatentUspto):
|
|||||||
empty_table = TableData(num_rows=0, num_cols=0, table_cells=[])
|
empty_table = TableData(num_rows=0, num_cols=0, table_cells=[])
|
||||||
self.doc.add_table(
|
self.doc.add_table(
|
||||||
data=empty_table,
|
data=empty_table,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
|
|
||||||
def _apply_style(self, text: str, style_tag: str) -> str:
|
def _apply_style(self, text: str, style_tag: str) -> str:
|
||||||
@ -721,7 +721,7 @@ class PatentUsptoGrantV2(PatentUspto):
|
|||||||
if self.Element.TITLE.value in self.property and text.strip():
|
if self.Element.TITLE.value in self.property and text.strip():
|
||||||
title = text.strip()
|
title = text.strip()
|
||||||
self.parents[self.level + 1] = self.doc.add_title(
|
self.parents[self.level + 1] = self.doc.add_title(
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
text=title,
|
text=title,
|
||||||
)
|
)
|
||||||
self.level += 1
|
self.level += 1
|
||||||
@ -749,7 +749,7 @@ class PatentUsptoGrantV2(PatentUspto):
|
|||||||
self.parents[self.level + 1] = self.doc.add_heading(
|
self.parents[self.level + 1] = self.doc.add_heading(
|
||||||
text=text.strip(),
|
text=text.strip(),
|
||||||
level=self.level,
|
level=self.level,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
self.level += 1
|
self.level += 1
|
||||||
|
|
||||||
@ -769,7 +769,7 @@ class PatentUsptoGrantV2(PatentUspto):
|
|||||||
claims_item = self.doc.add_heading(
|
claims_item = self.doc.add_heading(
|
||||||
heading_text,
|
heading_text,
|
||||||
level=heading_level,
|
level=heading_level,
|
||||||
parent=self.parents[heading_level], # type: ignore[arg-type]
|
parent=self.parents[heading_level],
|
||||||
)
|
)
|
||||||
for text in self.claims:
|
for text in self.claims:
|
||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
@ -787,7 +787,7 @@ class PatentUsptoGrantV2(PatentUspto):
|
|||||||
abstract_item = self.doc.add_heading(
|
abstract_item = self.doc.add_heading(
|
||||||
heading_text,
|
heading_text,
|
||||||
level=heading_level,
|
level=heading_level,
|
||||||
parent=self.parents[heading_level], # type: ignore[arg-type]
|
parent=self.parents[heading_level],
|
||||||
)
|
)
|
||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
label=DocItemLabel.PARAGRAPH, text=abstract, parent=abstract_item
|
label=DocItemLabel.PARAGRAPH, text=abstract, parent=abstract_item
|
||||||
@ -799,7 +799,7 @@ class PatentUsptoGrantV2(PatentUspto):
|
|||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
label=DocItemLabel.PARAGRAPH,
|
label=DocItemLabel.PARAGRAPH,
|
||||||
text=paragraph,
|
text=paragraph,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
elif self.Element.CLAIM.value in self.property:
|
elif self.Element.CLAIM.value in self.property:
|
||||||
# we may need a space after a paragraph in claim text
|
# we may need a space after a paragraph in claim text
|
||||||
@ -811,7 +811,7 @@ class PatentUsptoGrantV2(PatentUspto):
|
|||||||
empty_table = TableData(num_rows=0, num_cols=0, table_cells=[])
|
empty_table = TableData(num_rows=0, num_cols=0, table_cells=[])
|
||||||
self.doc.add_table(
|
self.doc.add_table(
|
||||||
data=empty_table,
|
data=empty_table,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
|
|
||||||
def _apply_style(self, text: str, style_tag: str) -> str:
|
def _apply_style(self, text: str, style_tag: str) -> str:
|
||||||
@ -938,7 +938,7 @@ class PatentUsptoGrantAps(PatentUspto):
|
|||||||
self.parents[self.level + 1] = self.doc.add_heading(
|
self.parents[self.level + 1] = self.doc.add_heading(
|
||||||
heading.value,
|
heading.value,
|
||||||
level=self.level,
|
level=self.level,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
self.level += 1
|
self.level += 1
|
||||||
|
|
||||||
@ -959,7 +959,7 @@ class PatentUsptoGrantAps(PatentUspto):
|
|||||||
|
|
||||||
if field == self.Field.TITLE.value:
|
if field == self.Field.TITLE.value:
|
||||||
self.parents[self.level + 1] = self.doc.add_title(
|
self.parents[self.level + 1] = self.doc.add_title(
|
||||||
parent=self.parents[self.level], text=value # type: ignore[arg-type]
|
parent=self.parents[self.level], text=value
|
||||||
)
|
)
|
||||||
self.level += 1
|
self.level += 1
|
||||||
|
|
||||||
@ -971,14 +971,14 @@ class PatentUsptoGrantAps(PatentUspto):
|
|||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
label=DocItemLabel.PARAGRAPH,
|
label=DocItemLabel.PARAGRAPH,
|
||||||
text=value,
|
text=value,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
|
|
||||||
elif field == self.Field.NUMBER.value and section == self.Section.CLAIMS.value:
|
elif field == self.Field.NUMBER.value and section == self.Section.CLAIMS.value:
|
||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
label=DocItemLabel.PARAGRAPH,
|
label=DocItemLabel.PARAGRAPH,
|
||||||
text="",
|
text="",
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
|
|
||||||
elif (
|
elif (
|
||||||
@ -996,7 +996,7 @@ class PatentUsptoGrantAps(PatentUspto):
|
|||||||
last_claim = self.doc.add_text(
|
last_claim = self.doc.add_text(
|
||||||
label=DocItemLabel.PARAGRAPH,
|
label=DocItemLabel.PARAGRAPH,
|
||||||
text="",
|
text="",
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
|
|
||||||
last_claim.text += f" {value}" if last_claim.text else value
|
last_claim.text += f" {value}" if last_claim.text else value
|
||||||
@ -1012,7 +1012,7 @@ class PatentUsptoGrantAps(PatentUspto):
|
|||||||
self.parents[self.level + 1] = self.doc.add_heading(
|
self.parents[self.level + 1] = self.doc.add_heading(
|
||||||
value,
|
value,
|
||||||
level=self.level,
|
level=self.level,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
self.level += 1
|
self.level += 1
|
||||||
|
|
||||||
@ -1029,7 +1029,7 @@ class PatentUsptoGrantAps(PatentUspto):
|
|||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
label=DocItemLabel.PARAGRAPH,
|
label=DocItemLabel.PARAGRAPH,
|
||||||
text=value,
|
text=value,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
|
|
||||||
def parse(self, patent_content: str) -> Optional[DoclingDocument]:
|
def parse(self, patent_content: str) -> Optional[DoclingDocument]:
|
||||||
@ -1283,7 +1283,7 @@ class PatentUsptoAppV1(PatentUspto):
|
|||||||
title = text.strip()
|
title = text.strip()
|
||||||
if title:
|
if title:
|
||||||
self.parents[self.level + 1] = self.doc.add_text(
|
self.parents[self.level + 1] = self.doc.add_text(
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
label=DocItemLabel.TITLE,
|
label=DocItemLabel.TITLE,
|
||||||
text=title,
|
text=title,
|
||||||
)
|
)
|
||||||
@ -1301,7 +1301,7 @@ class PatentUsptoAppV1(PatentUspto):
|
|||||||
abstract_item = self.doc.add_heading(
|
abstract_item = self.doc.add_heading(
|
||||||
heading_text,
|
heading_text,
|
||||||
level=heading_level,
|
level=heading_level,
|
||||||
parent=self.parents[heading_level], # type: ignore[arg-type]
|
parent=self.parents[heading_level],
|
||||||
)
|
)
|
||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
label=DocItemLabel.PARAGRAPH,
|
label=DocItemLabel.PARAGRAPH,
|
||||||
@ -1331,7 +1331,7 @@ class PatentUsptoAppV1(PatentUspto):
|
|||||||
claims_item = self.doc.add_heading(
|
claims_item = self.doc.add_heading(
|
||||||
heading_text,
|
heading_text,
|
||||||
level=heading_level,
|
level=heading_level,
|
||||||
parent=self.parents[heading_level], # type: ignore[arg-type]
|
parent=self.parents[heading_level],
|
||||||
)
|
)
|
||||||
for text in self.claims:
|
for text in self.claims:
|
||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
@ -1350,14 +1350,14 @@ class PatentUsptoAppV1(PatentUspto):
|
|||||||
self.parents[self.level + 1] = self.doc.add_heading(
|
self.parents[self.level + 1] = self.doc.add_heading(
|
||||||
text=text,
|
text=text,
|
||||||
level=self.level,
|
level=self.level,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
self.level += 1
|
self.level += 1
|
||||||
else:
|
else:
|
||||||
self.doc.add_text(
|
self.doc.add_text(
|
||||||
label=DocItemLabel.PARAGRAPH,
|
label=DocItemLabel.PARAGRAPH,
|
||||||
text=text,
|
text=text,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
self.text = ""
|
self.text = ""
|
||||||
|
|
||||||
@ -1366,7 +1366,7 @@ class PatentUsptoAppV1(PatentUspto):
|
|||||||
empty_table = TableData(num_rows=0, num_cols=0, table_cells=[])
|
empty_table = TableData(num_rows=0, num_cols=0, table_cells=[])
|
||||||
self.doc.add_table(
|
self.doc.add_table(
|
||||||
data=empty_table,
|
data=empty_table,
|
||||||
parent=self.parents[self.level], # type: ignore[arg-type]
|
parent=self.parents[self.level],
|
||||||
)
|
)
|
||||||
|
|
||||||
def _apply_style(self, text: str, style_tag: str) -> str:
|
def _apply_style(self, text: str, style_tag: str) -> str:
|
||||||
|
@ -1,18 +1,18 @@
|
|||||||
import importlib
|
import importlib
|
||||||
import json
|
|
||||||
import logging
|
import logging
|
||||||
|
import platform
|
||||||
import re
|
import re
|
||||||
|
import sys
|
||||||
import tempfile
|
import tempfile
|
||||||
import time
|
import time
|
||||||
import warnings
|
import warnings
|
||||||
from enum import Enum
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Annotated, Dict, Iterable, List, Optional, Type
|
from typing import Annotated, Dict, Iterable, List, Optional, Type
|
||||||
|
|
||||||
import typer
|
import typer
|
||||||
from docling_core.types.doc import ImageRefMode
|
from docling_core.types.doc import ImageRefMode
|
||||||
from docling_core.utils.file import resolve_source_to_path
|
from docling_core.utils.file import resolve_source_to_path
|
||||||
from pydantic import TypeAdapter, ValidationError
|
from pydantic import TypeAdapter
|
||||||
|
|
||||||
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
|
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
|
||||||
from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend
|
from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend
|
||||||
@ -65,10 +65,15 @@ def version_callback(value: bool):
|
|||||||
docling_core_version = importlib.metadata.version("docling-core")
|
docling_core_version = importlib.metadata.version("docling-core")
|
||||||
docling_ibm_models_version = importlib.metadata.version("docling-ibm-models")
|
docling_ibm_models_version = importlib.metadata.version("docling-ibm-models")
|
||||||
docling_parse_version = importlib.metadata.version("docling-parse")
|
docling_parse_version = importlib.metadata.version("docling-parse")
|
||||||
|
platform_str = platform.platform()
|
||||||
|
py_impl_version = sys.implementation.cache_tag
|
||||||
|
py_lang_version = platform.python_version()
|
||||||
print(f"Docling version: {docling_version}")
|
print(f"Docling version: {docling_version}")
|
||||||
print(f"Docling Core version: {docling_core_version}")
|
print(f"Docling Core version: {docling_core_version}")
|
||||||
print(f"Docling IBM Models version: {docling_ibm_models_version}")
|
print(f"Docling IBM Models version: {docling_ibm_models_version}")
|
||||||
print(f"Docling Parse version: {docling_parse_version}")
|
print(f"Docling Parse version: {docling_parse_version}")
|
||||||
|
print(f"Python: {py_impl_version} ({py_lang_version})")
|
||||||
|
print(f"Platform: {platform_str}")
|
||||||
raise typer.Exit()
|
raise typer.Exit()
|
||||||
|
|
||||||
|
|
||||||
@ -206,6 +211,14 @@ def convert(
|
|||||||
TableFormerMode,
|
TableFormerMode,
|
||||||
typer.Option(..., help="The mode to use in the table structure model."),
|
typer.Option(..., help="The mode to use in the table structure model."),
|
||||||
] = TableFormerMode.FAST,
|
] = TableFormerMode.FAST,
|
||||||
|
enrich_code: Annotated[
|
||||||
|
bool,
|
||||||
|
typer.Option(..., help="Enable the code enrichment model in the pipeline."),
|
||||||
|
] = False,
|
||||||
|
enrich_formula: Annotated[
|
||||||
|
bool,
|
||||||
|
typer.Option(..., help="Enable the formula enrichment model in the pipeline."),
|
||||||
|
] = False,
|
||||||
artifacts_path: Annotated[
|
artifacts_path: Annotated[
|
||||||
Optional[Path],
|
Optional[Path],
|
||||||
typer.Option(..., help="If provided, the location of the model artifacts."),
|
typer.Option(..., help="If provided, the location of the model artifacts."),
|
||||||
@ -360,6 +373,8 @@ def convert(
|
|||||||
do_ocr=ocr,
|
do_ocr=ocr,
|
||||||
ocr_options=ocr_options,
|
ocr_options=ocr_options,
|
||||||
do_table_structure=True,
|
do_table_structure=True,
|
||||||
|
do_code_enrichment=enrich_code,
|
||||||
|
do_formula_enrichment=enrich_formula,
|
||||||
document_timeout=document_timeout,
|
document_timeout=document_timeout,
|
||||||
)
|
)
|
||||||
pipeline_options.table_structure_options.do_cell_matching = (
|
pipeline_options.table_structure_options.do_cell_matching = (
|
||||||
|
@ -4,6 +4,7 @@ from typing import TYPE_CHECKING, Dict, List, Optional, Union
|
|||||||
from docling_core.types.doc import (
|
from docling_core.types.doc import (
|
||||||
BoundingBox,
|
BoundingBox,
|
||||||
DocItemLabel,
|
DocItemLabel,
|
||||||
|
NodeItem,
|
||||||
PictureDataType,
|
PictureDataType,
|
||||||
Size,
|
Size,
|
||||||
TableCell,
|
TableCell,
|
||||||
@ -40,6 +41,7 @@ class InputFormat(str, Enum):
|
|||||||
MD = "md"
|
MD = "md"
|
||||||
XLSX = "xlsx"
|
XLSX = "xlsx"
|
||||||
XML_USPTO = "xml_uspto"
|
XML_USPTO = "xml_uspto"
|
||||||
|
JSON_DOCLING = "json_docling"
|
||||||
|
|
||||||
|
|
||||||
class OutputFormat(str, Enum):
|
class OutputFormat(str, Enum):
|
||||||
@ -61,6 +63,7 @@ FormatToExtensions: Dict[InputFormat, List[str]] = {
|
|||||||
InputFormat.ASCIIDOC: ["adoc", "asciidoc", "asc"],
|
InputFormat.ASCIIDOC: ["adoc", "asciidoc", "asc"],
|
||||||
InputFormat.XLSX: ["xlsx"],
|
InputFormat.XLSX: ["xlsx"],
|
||||||
InputFormat.XML_USPTO: ["xml", "txt"],
|
InputFormat.XML_USPTO: ["xml", "txt"],
|
||||||
|
InputFormat.JSON_DOCLING: ["json"],
|
||||||
}
|
}
|
||||||
|
|
||||||
FormatToMimeType: Dict[InputFormat, List[str]] = {
|
FormatToMimeType: Dict[InputFormat, List[str]] = {
|
||||||
@ -89,6 +92,7 @@ FormatToMimeType: Dict[InputFormat, List[str]] = {
|
|||||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||||
],
|
],
|
||||||
InputFormat.XML_USPTO: ["application/xml", "text/plain"],
|
InputFormat.XML_USPTO: ["application/xml", "text/plain"],
|
||||||
|
InputFormat.JSON_DOCLING: ["application/json"],
|
||||||
}
|
}
|
||||||
|
|
||||||
MimeTypeToFormat: dict[str, list[InputFormat]] = {
|
MimeTypeToFormat: dict[str, list[InputFormat]] = {
|
||||||
@ -201,6 +205,13 @@ class AssembledUnit(BaseModel):
|
|||||||
headers: List[PageElement] = []
|
headers: List[PageElement] = []
|
||||||
|
|
||||||
|
|
||||||
|
class ItemAndImageEnrichmentElement(BaseModel):
|
||||||
|
model_config = ConfigDict(arbitrary_types_allowed=True)
|
||||||
|
|
||||||
|
item: NodeItem
|
||||||
|
image: Image
|
||||||
|
|
||||||
|
|
||||||
class Page(BaseModel):
|
class Page(BaseModel):
|
||||||
model_config = ConfigDict(arbitrary_types_allowed=True)
|
model_config = ConfigDict(arbitrary_types_allowed=True)
|
||||||
|
|
||||||
@ -219,12 +230,28 @@ class Page(BaseModel):
|
|||||||
{}
|
{}
|
||||||
) # Cache of images in different scales. By default it is cleared during assembling.
|
) # Cache of images in different scales. By default it is cleared during assembling.
|
||||||
|
|
||||||
def get_image(self, scale: float = 1.0) -> Optional[Image]:
|
def get_image(
|
||||||
|
self, scale: float = 1.0, cropbox: Optional[BoundingBox] = None
|
||||||
|
) -> Optional[Image]:
|
||||||
if self._backend is None:
|
if self._backend is None:
|
||||||
return self._image_cache.get(scale, None)
|
return self._image_cache.get(scale, None)
|
||||||
|
|
||||||
if not scale in self._image_cache:
|
if not scale in self._image_cache:
|
||||||
self._image_cache[scale] = self._backend.get_page_image(scale=scale)
|
if cropbox is None:
|
||||||
return self._image_cache[scale]
|
self._image_cache[scale] = self._backend.get_page_image(scale=scale)
|
||||||
|
else:
|
||||||
|
return self._backend.get_page_image(scale=scale, cropbox=cropbox)
|
||||||
|
|
||||||
|
if cropbox is None:
|
||||||
|
return self._image_cache[scale]
|
||||||
|
else:
|
||||||
|
page_im = self._image_cache[scale]
|
||||||
|
assert self.size is not None
|
||||||
|
return page_im.crop(
|
||||||
|
cropbox.to_top_left_origin(page_height=self.size.height)
|
||||||
|
.scaled(scale=scale)
|
||||||
|
.as_tuple()
|
||||||
|
)
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def image(self) -> Optional[Image]:
|
def image(self) -> Optional[Image]:
|
||||||
|
@ -157,6 +157,8 @@ class InputDocument(BaseModel):
|
|||||||
self.page_count = self._backend.page_count()
|
self.page_count = self._backend.page_count()
|
||||||
if not self.page_count <= self.limits.max_num_pages:
|
if not self.page_count <= self.limits.max_num_pages:
|
||||||
self.valid = False
|
self.valid = False
|
||||||
|
elif self.page_count < self.limits.page_range[0]:
|
||||||
|
self.valid = False
|
||||||
|
|
||||||
except (FileNotFoundError, OSError) as e:
|
except (FileNotFoundError, OSError) as e:
|
||||||
self.valid = False
|
self.valid = False
|
||||||
@ -350,6 +352,10 @@ class _DocumentConversionInput(BaseModel):
|
|||||||
mime = FormatToMimeType[InputFormat.HTML][0]
|
mime = FormatToMimeType[InputFormat.HTML][0]
|
||||||
elif ext in FormatToExtensions[InputFormat.MD]:
|
elif ext in FormatToExtensions[InputFormat.MD]:
|
||||||
mime = FormatToMimeType[InputFormat.MD][0]
|
mime = FormatToMimeType[InputFormat.MD][0]
|
||||||
|
elif ext in FormatToExtensions[InputFormat.JSON_DOCLING]:
|
||||||
|
mime = FormatToMimeType[InputFormat.JSON_DOCLING][0]
|
||||||
|
elif ext in FormatToExtensions[InputFormat.PDF]:
|
||||||
|
mime = FormatToMimeType[InputFormat.PDF][0]
|
||||||
return mime
|
return mime
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
|
@ -1,17 +1,11 @@
|
|||||||
import logging
|
import logging
|
||||||
import os
|
import os
|
||||||
import warnings
|
|
||||||
from enum import Enum
|
from enum import Enum
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Annotated, Any, Dict, List, Literal, Optional, Tuple, Type, Union
|
from typing import Any, List, Literal, Optional, Union
|
||||||
|
|
||||||
from pydantic import BaseModel, ConfigDict, Field, field_validator, model_validator
|
from pydantic import BaseModel, ConfigDict, Field, model_validator
|
||||||
from pydantic_settings import (
|
from pydantic_settings import BaseSettings, SettingsConfigDict
|
||||||
BaseSettings,
|
|
||||||
PydanticBaseSettingsSource,
|
|
||||||
SettingsConfigDict,
|
|
||||||
)
|
|
||||||
from typing_extensions import deprecated
|
|
||||||
|
|
||||||
_log = logging.getLogger(__name__)
|
_log = logging.getLogger(__name__)
|
||||||
|
|
||||||
@ -125,6 +119,7 @@ class RapidOcrOptions(OcrOptions):
|
|||||||
det_model_path: Optional[str] = None # same default as rapidocr
|
det_model_path: Optional[str] = None # same default as rapidocr
|
||||||
cls_model_path: Optional[str] = None # same default as rapidocr
|
cls_model_path: Optional[str] = None # same default as rapidocr
|
||||||
rec_model_path: Optional[str] = None # same default as rapidocr
|
rec_model_path: Optional[str] = None # same default as rapidocr
|
||||||
|
rec_keys_path: Optional[str] = None # same default as rapidocr
|
||||||
|
|
||||||
model_config = ConfigDict(
|
model_config = ConfigDict(
|
||||||
extra="forbid",
|
extra="forbid",
|
||||||
@ -225,6 +220,9 @@ class PdfPipelineOptions(PipelineOptions):
|
|||||||
artifacts_path: Optional[Union[Path, str]] = None
|
artifacts_path: Optional[Union[Path, str]] = None
|
||||||
do_table_structure: bool = True # True: perform table structure extraction
|
do_table_structure: bool = True # True: perform table structure extraction
|
||||||
do_ocr: bool = True # True: perform OCR, replace programmatic PDF text
|
do_ocr: bool = True # True: perform OCR, replace programmatic PDF text
|
||||||
|
do_code_enrichment: bool = False # True: perform code OCR
|
||||||
|
do_formula_enrichment: bool = False # True: perform formula OCR, return Latex code
|
||||||
|
do_picture_classification: bool = False # True: classify pictures in documents
|
||||||
|
|
||||||
table_structure_options: TableStructureOptions = TableStructureOptions()
|
table_structure_options: TableStructureOptions = TableStructureOptions()
|
||||||
ocr_options: Union[
|
ocr_options: Union[
|
||||||
|
@ -1,13 +1,28 @@
|
|||||||
import sys
|
import sys
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
from typing import Annotated, Tuple
|
||||||
|
|
||||||
from pydantic import BaseModel
|
from pydantic import BaseModel, PlainValidator
|
||||||
from pydantic_settings import BaseSettings, SettingsConfigDict
|
from pydantic_settings import BaseSettings, SettingsConfigDict
|
||||||
|
|
||||||
|
|
||||||
|
def _validate_page_range(v: Tuple[int, int]) -> Tuple[int, int]:
|
||||||
|
if v[0] < 1 or v[1] < v[0]:
|
||||||
|
raise ValueError(
|
||||||
|
"Invalid page range: start must be ≥ 1 and end must be ≥ start."
|
||||||
|
)
|
||||||
|
return v
|
||||||
|
|
||||||
|
|
||||||
|
PageRange = Annotated[Tuple[int, int], PlainValidator(_validate_page_range)]
|
||||||
|
|
||||||
|
DEFAULT_PAGE_RANGE: PageRange = (1, sys.maxsize)
|
||||||
|
|
||||||
|
|
||||||
class DocumentLimits(BaseModel):
|
class DocumentLimits(BaseModel):
|
||||||
max_num_pages: int = sys.maxsize
|
max_num_pages: int = sys.maxsize
|
||||||
max_file_size: int = sys.maxsize
|
max_file_size: int = sys.maxsize
|
||||||
|
page_range: PageRange = DEFAULT_PAGE_RANGE
|
||||||
|
|
||||||
|
|
||||||
class BatchConcurrencySettings(BaseModel):
|
class BatchConcurrencySettings(BaseModel):
|
||||||
|
@ -1,9 +1,10 @@
|
|||||||
import logging
|
import logging
|
||||||
|
import math
|
||||||
import sys
|
import sys
|
||||||
import time
|
import time
|
||||||
from functools import partial
|
from functools import partial
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Dict, Iterable, Iterator, List, Optional, Type, Union
|
from typing import Dict, Iterable, Iterator, List, Optional, Tuple, Type, Union
|
||||||
|
|
||||||
from pydantic import BaseModel, ConfigDict, model_validator, validate_call
|
from pydantic import BaseModel, ConfigDict, model_validator, validate_call
|
||||||
|
|
||||||
@ -11,6 +12,7 @@ from docling.backend.abstract_backend import AbstractDocumentBackend
|
|||||||
from docling.backend.asciidoc_backend import AsciiDocBackend
|
from docling.backend.asciidoc_backend import AsciiDocBackend
|
||||||
from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend
|
from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend
|
||||||
from docling.backend.html_backend import HTMLDocumentBackend
|
from docling.backend.html_backend import HTMLDocumentBackend
|
||||||
|
from docling.backend.json.docling_json_backend import DoclingJSONBackend
|
||||||
from docling.backend.md_backend import MarkdownDocumentBackend
|
from docling.backend.md_backend import MarkdownDocumentBackend
|
||||||
from docling.backend.msexcel_backend import MsExcelDocumentBackend
|
from docling.backend.msexcel_backend import MsExcelDocumentBackend
|
||||||
from docling.backend.mspowerpoint_backend import MsPowerpointDocumentBackend
|
from docling.backend.mspowerpoint_backend import MsPowerpointDocumentBackend
|
||||||
@ -30,7 +32,12 @@ from docling.datamodel.document import (
|
|||||||
_DocumentConversionInput,
|
_DocumentConversionInput,
|
||||||
)
|
)
|
||||||
from docling.datamodel.pipeline_options import PipelineOptions
|
from docling.datamodel.pipeline_options import PipelineOptions
|
||||||
from docling.datamodel.settings import DocumentLimits, settings
|
from docling.datamodel.settings import (
|
||||||
|
DEFAULT_PAGE_RANGE,
|
||||||
|
DocumentLimits,
|
||||||
|
PageRange,
|
||||||
|
settings,
|
||||||
|
)
|
||||||
from docling.exceptions import ConversionError
|
from docling.exceptions import ConversionError
|
||||||
from docling.pipeline.base_pipeline import BasePipeline
|
from docling.pipeline.base_pipeline import BasePipeline
|
||||||
from docling.pipeline.simple_pipeline import SimplePipeline
|
from docling.pipeline.simple_pipeline import SimplePipeline
|
||||||
@ -136,6 +143,9 @@ def _get_default_option(format: InputFormat) -> FormatOption:
|
|||||||
InputFormat.PDF: FormatOption(
|
InputFormat.PDF: FormatOption(
|
||||||
pipeline_cls=StandardPdfPipeline, backend=DoclingParseV2DocumentBackend
|
pipeline_cls=StandardPdfPipeline, backend=DoclingParseV2DocumentBackend
|
||||||
),
|
),
|
||||||
|
InputFormat.JSON_DOCLING: FormatOption(
|
||||||
|
pipeline_cls=SimplePipeline, backend=DoclingJSONBackend
|
||||||
|
),
|
||||||
}
|
}
|
||||||
if (options := format_to_default_options.get(format)) is not None:
|
if (options := format_to_default_options.get(format)) is not None:
|
||||||
return options
|
return options
|
||||||
@ -180,6 +190,7 @@ class DocumentConverter:
|
|||||||
raises_on_error: bool = True,
|
raises_on_error: bool = True,
|
||||||
max_num_pages: int = sys.maxsize,
|
max_num_pages: int = sys.maxsize,
|
||||||
max_file_size: int = sys.maxsize,
|
max_file_size: int = sys.maxsize,
|
||||||
|
page_range: PageRange = DEFAULT_PAGE_RANGE,
|
||||||
) -> ConversionResult:
|
) -> ConversionResult:
|
||||||
all_res = self.convert_all(
|
all_res = self.convert_all(
|
||||||
source=[source],
|
source=[source],
|
||||||
@ -187,6 +198,7 @@ class DocumentConverter:
|
|||||||
max_num_pages=max_num_pages,
|
max_num_pages=max_num_pages,
|
||||||
max_file_size=max_file_size,
|
max_file_size=max_file_size,
|
||||||
headers=headers,
|
headers=headers,
|
||||||
|
page_range=page_range,
|
||||||
)
|
)
|
||||||
return next(all_res)
|
return next(all_res)
|
||||||
|
|
||||||
@ -198,10 +210,12 @@ class DocumentConverter:
|
|||||||
raises_on_error: bool = True, # True: raises on first conversion error; False: does not raise on conv error
|
raises_on_error: bool = True, # True: raises on first conversion error; False: does not raise on conv error
|
||||||
max_num_pages: int = sys.maxsize,
|
max_num_pages: int = sys.maxsize,
|
||||||
max_file_size: int = sys.maxsize,
|
max_file_size: int = sys.maxsize,
|
||||||
|
page_range: PageRange = DEFAULT_PAGE_RANGE,
|
||||||
) -> Iterator[ConversionResult]:
|
) -> Iterator[ConversionResult]:
|
||||||
limits = DocumentLimits(
|
limits = DocumentLimits(
|
||||||
max_num_pages=max_num_pages,
|
max_num_pages=max_num_pages,
|
||||||
max_file_size=max_file_size,
|
max_file_size=max_file_size,
|
||||||
|
page_range=page_range,
|
||||||
)
|
)
|
||||||
conv_input = _DocumentConversionInput(
|
conv_input = _DocumentConversionInput(
|
||||||
path_or_stream_iterator=source, limits=limits, headers=headers
|
path_or_stream_iterator=source, limits=limits, headers=headers
|
||||||
|
@ -1,9 +1,10 @@
|
|||||||
from abc import ABC, abstractmethod
|
from abc import ABC, abstractmethod
|
||||||
from typing import Any, Iterable
|
from typing import Any, Generic, Iterable, Optional
|
||||||
|
|
||||||
from docling_core.types.doc import DoclingDocument, NodeItem
|
from docling_core.types.doc import BoundingBox, DoclingDocument, NodeItem, TextItem
|
||||||
|
from typing_extensions import TypeVar
|
||||||
|
|
||||||
from docling.datamodel.base_models import Page
|
from docling.datamodel.base_models import ItemAndImageEnrichmentElement, Page
|
||||||
from docling.datamodel.document import ConversionResult
|
from docling.datamodel.document import ConversionResult
|
||||||
|
|
||||||
|
|
||||||
@ -15,14 +16,69 @@ class BasePageModel(ABC):
|
|||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
class BaseEnrichmentModel(ABC):
|
EnrichElementT = TypeVar("EnrichElementT", default=NodeItem)
|
||||||
|
|
||||||
|
|
||||||
|
class GenericEnrichmentModel(ABC, Generic[EnrichElementT]):
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:
|
def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def __call__(
|
def prepare_element(
|
||||||
self, doc: DoclingDocument, element_batch: Iterable[NodeItem]
|
self, conv_res: ConversionResult, element: NodeItem
|
||||||
) -> Iterable[Any]:
|
) -> Optional[EnrichElementT]:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def __call__(
|
||||||
|
self, doc: DoclingDocument, element_batch: Iterable[EnrichElementT]
|
||||||
|
) -> Iterable[NodeItem]:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class BaseEnrichmentModel(GenericEnrichmentModel[NodeItem]):
|
||||||
|
|
||||||
|
def prepare_element(
|
||||||
|
self, conv_res: ConversionResult, element: NodeItem
|
||||||
|
) -> Optional[NodeItem]:
|
||||||
|
if self.is_processable(doc=conv_res.document, element=element):
|
||||||
|
return element
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
class BaseItemAndImageEnrichmentModel(
|
||||||
|
GenericEnrichmentModel[ItemAndImageEnrichmentElement]
|
||||||
|
):
|
||||||
|
|
||||||
|
images_scale: float
|
||||||
|
expansion_factor: float = 0.0
|
||||||
|
|
||||||
|
def prepare_element(
|
||||||
|
self, conv_res: ConversionResult, element: NodeItem
|
||||||
|
) -> Optional[ItemAndImageEnrichmentElement]:
|
||||||
|
if not self.is_processable(doc=conv_res.document, element=element):
|
||||||
|
return None
|
||||||
|
|
||||||
|
assert isinstance(element, TextItem)
|
||||||
|
element_prov = element.prov[0]
|
||||||
|
|
||||||
|
bbox = element_prov.bbox
|
||||||
|
width = bbox.r - bbox.l
|
||||||
|
height = bbox.t - bbox.b
|
||||||
|
|
||||||
|
# TODO: move to a utility in the BoundingBox class
|
||||||
|
expanded_bbox = BoundingBox(
|
||||||
|
l=bbox.l - width * self.expansion_factor,
|
||||||
|
t=bbox.t + height * self.expansion_factor,
|
||||||
|
r=bbox.r + width * self.expansion_factor,
|
||||||
|
b=bbox.b - height * self.expansion_factor,
|
||||||
|
coord_origin=bbox.coord_origin,
|
||||||
|
)
|
||||||
|
|
||||||
|
page_ix = element_prov.page_no - 1
|
||||||
|
cropped_image = conv_res.pages[page_ix].get_image(
|
||||||
|
scale=self.images_scale, cropbox=expanded_bbox
|
||||||
|
)
|
||||||
|
return ItemAndImageEnrichmentElement(item=element, image=cropped_image)
|
||||||
|
245
docling/models/code_formula_model.py
Normal file
245
docling/models/code_formula_model.py
Normal file
@ -0,0 +1,245 @@
|
|||||||
|
import re
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Iterable, List, Literal, Optional, Tuple, Union
|
||||||
|
|
||||||
|
from docling_core.types.doc import (
|
||||||
|
CodeItem,
|
||||||
|
DocItemLabel,
|
||||||
|
DoclingDocument,
|
||||||
|
NodeItem,
|
||||||
|
TextItem,
|
||||||
|
)
|
||||||
|
from docling_core.types.doc.labels import CodeLanguageLabel
|
||||||
|
from PIL import Image
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
from docling.datamodel.base_models import ItemAndImageEnrichmentElement
|
||||||
|
from docling.datamodel.pipeline_options import AcceleratorOptions
|
||||||
|
from docling.models.base_model import BaseItemAndImageEnrichmentModel
|
||||||
|
from docling.utils.accelerator_utils import decide_device
|
||||||
|
|
||||||
|
|
||||||
|
class CodeFormulaModelOptions(BaseModel):
|
||||||
|
"""
|
||||||
|
Configuration options for the CodeFormulaModel.
|
||||||
|
|
||||||
|
Attributes
|
||||||
|
----------
|
||||||
|
kind : str
|
||||||
|
Type of the model. Fixed value "code_formula".
|
||||||
|
do_code_enrichment : bool
|
||||||
|
True if code enrichment is enabled, False otherwise.
|
||||||
|
do_formula_enrichment : bool
|
||||||
|
True if formula enrichment is enabled, False otherwise.
|
||||||
|
"""
|
||||||
|
|
||||||
|
kind: Literal["code_formula"] = "code_formula"
|
||||||
|
do_code_enrichment: bool = True
|
||||||
|
do_formula_enrichment: bool = True
|
||||||
|
|
||||||
|
|
||||||
|
class CodeFormulaModel(BaseItemAndImageEnrichmentModel):
|
||||||
|
"""
|
||||||
|
Model for processing and enriching documents with code and formula predictions.
|
||||||
|
|
||||||
|
Attributes
|
||||||
|
----------
|
||||||
|
enabled : bool
|
||||||
|
True if the model is enabled, False otherwise.
|
||||||
|
options : CodeFormulaModelOptions
|
||||||
|
Configuration options for the CodeFormulaModel.
|
||||||
|
code_formula_model : CodeFormulaPredictor
|
||||||
|
The predictor model for code and formula processing.
|
||||||
|
|
||||||
|
Methods
|
||||||
|
-------
|
||||||
|
__init__(self, enabled, artifacts_path, accelerator_options, code_formula_options)
|
||||||
|
Initializes the CodeFormulaModel with the given configuration options.
|
||||||
|
is_processable(self, doc, element)
|
||||||
|
Determines if a given element in a document can be processed by the model.
|
||||||
|
__call__(self, doc, element_batch)
|
||||||
|
Processes the given batch of elements and enriches them with predictions.
|
||||||
|
"""
|
||||||
|
|
||||||
|
images_scale = 1.66 # = 120 dpi, aligned with training data resolution
|
||||||
|
expansion_factor = 0.03
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
enabled: bool,
|
||||||
|
artifacts_path: Optional[Union[Path, str]],
|
||||||
|
options: CodeFormulaModelOptions,
|
||||||
|
accelerator_options: AcceleratorOptions,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Initializes the CodeFormulaModel with the given configuration.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
enabled : bool
|
||||||
|
True if the model is enabled, False otherwise.
|
||||||
|
artifacts_path : Path
|
||||||
|
Path to the directory containing the model artifacts.
|
||||||
|
options : CodeFormulaModelOptions
|
||||||
|
Configuration options for the model.
|
||||||
|
accelerator_options : AcceleratorOptions
|
||||||
|
Options specifying the device and number of threads for acceleration.
|
||||||
|
"""
|
||||||
|
self.enabled = enabled
|
||||||
|
self.options = options
|
||||||
|
|
||||||
|
if self.enabled:
|
||||||
|
device = decide_device(accelerator_options.device)
|
||||||
|
|
||||||
|
from docling_ibm_models.code_formula_model.code_formula_predictor import (
|
||||||
|
CodeFormulaPredictor,
|
||||||
|
)
|
||||||
|
|
||||||
|
if artifacts_path is None:
|
||||||
|
artifacts_path = self.download_models_hf()
|
||||||
|
else:
|
||||||
|
artifacts_path = Path(artifacts_path)
|
||||||
|
|
||||||
|
self.code_formula_model = CodeFormulaPredictor(
|
||||||
|
artifacts_path=artifacts_path,
|
||||||
|
device=device,
|
||||||
|
num_threads=accelerator_options.num_threads,
|
||||||
|
)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def download_models_hf(
|
||||||
|
local_dir: Optional[Path] = None, force: bool = False
|
||||||
|
) -> Path:
|
||||||
|
from huggingface_hub import snapshot_download
|
||||||
|
from huggingface_hub.utils import disable_progress_bars
|
||||||
|
|
||||||
|
disable_progress_bars()
|
||||||
|
download_path = snapshot_download(
|
||||||
|
repo_id="ds4sd/CodeFormula",
|
||||||
|
force_download=force,
|
||||||
|
local_dir=local_dir,
|
||||||
|
revision="v1.0.0",
|
||||||
|
)
|
||||||
|
|
||||||
|
return Path(download_path)
|
||||||
|
|
||||||
|
def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:
|
||||||
|
"""
|
||||||
|
Determines if a given element in a document can be processed by the model.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
doc : DoclingDocument
|
||||||
|
The document being processed.
|
||||||
|
element : NodeItem
|
||||||
|
The element within the document to check.
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
bool
|
||||||
|
True if the element can be processed, False otherwise.
|
||||||
|
"""
|
||||||
|
return self.enabled and (
|
||||||
|
(isinstance(element, CodeItem) and self.options.do_code_enrichment)
|
||||||
|
or (
|
||||||
|
isinstance(element, TextItem)
|
||||||
|
and element.label == DocItemLabel.FORMULA
|
||||||
|
and self.options.do_formula_enrichment
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
def _extract_code_language(self, input_string: str) -> Tuple[str, Optional[str]]:
|
||||||
|
"""Extracts a programming language from the beginning of a string.
|
||||||
|
|
||||||
|
This function checks if the input string starts with a pattern of the form
|
||||||
|
``<_some_language_>``. If it does, it extracts the language string and returns
|
||||||
|
a tuple of (remainder, language). Otherwise, it returns the original string
|
||||||
|
and `None`.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
input_string (str): The input string, which may start with ``<_language_>``.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple[str, Optional[str]]:
|
||||||
|
A tuple where:
|
||||||
|
- The first element is either:
|
||||||
|
- The remainder of the string (everything after ``<_language_>``),
|
||||||
|
if a match is found; or
|
||||||
|
- The original string, if no match is found.
|
||||||
|
- The second element is the extracted language if a match is found;
|
||||||
|
otherwise, `None`.
|
||||||
|
"""
|
||||||
|
pattern = r"^<_([^>]+)_>\s*(.*)"
|
||||||
|
match = re.match(pattern, input_string, flags=re.DOTALL)
|
||||||
|
if match:
|
||||||
|
language = str(match.group(1)) # the captured programming language
|
||||||
|
remainder = str(match.group(2)) # everything after the <_language_>
|
||||||
|
return remainder, language
|
||||||
|
else:
|
||||||
|
return input_string, None
|
||||||
|
|
||||||
|
def _get_code_language_enum(self, value: Optional[str]) -> CodeLanguageLabel:
|
||||||
|
"""
|
||||||
|
Converts a string to a corresponding `CodeLanguageLabel` enum member.
|
||||||
|
|
||||||
|
If the provided string does not match any value in `CodeLanguageLabel`,
|
||||||
|
it defaults to `CodeLanguageLabel.UNKNOWN`.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
value (Optional[str]): The string representation of the code language or None.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
CodeLanguageLabel: The corresponding enum member if the value is valid,
|
||||||
|
otherwise `CodeLanguageLabel.UNKNOWN`.
|
||||||
|
"""
|
||||||
|
if not isinstance(value, str):
|
||||||
|
return CodeLanguageLabel.UNKNOWN
|
||||||
|
|
||||||
|
try:
|
||||||
|
return CodeLanguageLabel(value)
|
||||||
|
except ValueError:
|
||||||
|
return CodeLanguageLabel.UNKNOWN
|
||||||
|
|
||||||
|
def __call__(
|
||||||
|
self,
|
||||||
|
doc: DoclingDocument,
|
||||||
|
element_batch: Iterable[ItemAndImageEnrichmentElement],
|
||||||
|
) -> Iterable[NodeItem]:
|
||||||
|
"""
|
||||||
|
Processes the given batch of elements and enriches them with predictions.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
doc : DoclingDocument
|
||||||
|
The document being processed.
|
||||||
|
element_batch : Iterable[ItemAndImageEnrichmentElement]
|
||||||
|
A batch of elements to be processed.
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
Iterable[Any]
|
||||||
|
An iterable of enriched elements.
|
||||||
|
"""
|
||||||
|
if not self.enabled:
|
||||||
|
for element in element_batch:
|
||||||
|
yield element.item
|
||||||
|
return
|
||||||
|
|
||||||
|
labels: List[str] = []
|
||||||
|
images: List[Image.Image] = []
|
||||||
|
elements: List[TextItem] = []
|
||||||
|
for el in element_batch:
|
||||||
|
assert isinstance(el.item, TextItem)
|
||||||
|
elements.append(el.item)
|
||||||
|
labels.append(el.item.label)
|
||||||
|
images.append(el.image)
|
||||||
|
|
||||||
|
outputs = self.code_formula_model.predict(images, labels)
|
||||||
|
|
||||||
|
for item, output in zip(elements, outputs):
|
||||||
|
if isinstance(item, CodeItem):
|
||||||
|
output, code_language = self._extract_code_language(output)
|
||||||
|
item.code_language = self._get_code_language_enum(code_language)
|
||||||
|
item.text = output
|
||||||
|
|
||||||
|
yield item
|
187
docling/models/document_picture_classifier.py
Normal file
187
docling/models/document_picture_classifier.py
Normal file
@ -0,0 +1,187 @@
|
|||||||
|
from pathlib import Path
|
||||||
|
from typing import Iterable, List, Literal, Optional, Tuple, Union
|
||||||
|
|
||||||
|
from docling_core.types.doc import (
|
||||||
|
DoclingDocument,
|
||||||
|
NodeItem,
|
||||||
|
PictureClassificationClass,
|
||||||
|
PictureClassificationData,
|
||||||
|
PictureItem,
|
||||||
|
)
|
||||||
|
from PIL import Image
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
from docling.datamodel.pipeline_options import AcceleratorOptions
|
||||||
|
from docling.models.base_model import BaseEnrichmentModel
|
||||||
|
from docling.utils.accelerator_utils import decide_device
|
||||||
|
|
||||||
|
|
||||||
|
class DocumentPictureClassifierOptions(BaseModel):
|
||||||
|
"""
|
||||||
|
Options for configuring the DocumentPictureClassifier.
|
||||||
|
|
||||||
|
Attributes
|
||||||
|
----------
|
||||||
|
kind : Literal["document_picture_classifier"]
|
||||||
|
Identifier for the type of classifier.
|
||||||
|
"""
|
||||||
|
|
||||||
|
kind: Literal["document_picture_classifier"] = "document_picture_classifier"
|
||||||
|
|
||||||
|
|
||||||
|
class DocumentPictureClassifier(BaseEnrichmentModel):
|
||||||
|
"""
|
||||||
|
A model for classifying pictures in documents.
|
||||||
|
|
||||||
|
This class enriches document pictures with predicted classifications
|
||||||
|
based on a predefined set of classes.
|
||||||
|
|
||||||
|
Attributes
|
||||||
|
----------
|
||||||
|
enabled : bool
|
||||||
|
Whether the classifier is enabled for use.
|
||||||
|
options : DocumentPictureClassifierOptions
|
||||||
|
Configuration options for the classifier.
|
||||||
|
document_picture_classifier : DocumentPictureClassifierPredictor
|
||||||
|
The underlying prediction model, loaded if the classifier is enabled.
|
||||||
|
|
||||||
|
Methods
|
||||||
|
-------
|
||||||
|
__init__(enabled, artifacts_path, options, accelerator_options)
|
||||||
|
Initializes the classifier with specified configurations.
|
||||||
|
is_processable(doc, element)
|
||||||
|
Checks if the given element can be processed by the classifier.
|
||||||
|
__call__(doc, element_batch)
|
||||||
|
Processes a batch of elements and adds classification annotations.
|
||||||
|
"""
|
||||||
|
|
||||||
|
images_scale = 2
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
enabled: bool,
|
||||||
|
artifacts_path: Optional[Union[Path, str]],
|
||||||
|
options: DocumentPictureClassifierOptions,
|
||||||
|
accelerator_options: AcceleratorOptions,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Initializes the DocumentPictureClassifier.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
enabled : bool
|
||||||
|
Indicates whether the classifier is enabled.
|
||||||
|
artifacts_path : Optional[Union[Path, str]],
|
||||||
|
Path to the directory containing model artifacts.
|
||||||
|
options : DocumentPictureClassifierOptions
|
||||||
|
Configuration options for the classifier.
|
||||||
|
accelerator_options : AcceleratorOptions
|
||||||
|
Options for configuring the device and parallelism.
|
||||||
|
"""
|
||||||
|
self.enabled = enabled
|
||||||
|
self.options = options
|
||||||
|
|
||||||
|
if self.enabled:
|
||||||
|
device = decide_device(accelerator_options.device)
|
||||||
|
from docling_ibm_models.document_figure_classifier_model.document_figure_classifier_predictor import (
|
||||||
|
DocumentFigureClassifierPredictor,
|
||||||
|
)
|
||||||
|
|
||||||
|
if artifacts_path is None:
|
||||||
|
artifacts_path = self.download_models_hf()
|
||||||
|
else:
|
||||||
|
artifacts_path = Path(artifacts_path)
|
||||||
|
|
||||||
|
self.document_picture_classifier = DocumentFigureClassifierPredictor(
|
||||||
|
artifacts_path=artifacts_path,
|
||||||
|
device=device,
|
||||||
|
num_threads=accelerator_options.num_threads,
|
||||||
|
)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def download_models_hf(
|
||||||
|
local_dir: Optional[Path] = None, force: bool = False
|
||||||
|
) -> Path:
|
||||||
|
from huggingface_hub import snapshot_download
|
||||||
|
from huggingface_hub.utils import disable_progress_bars
|
||||||
|
|
||||||
|
disable_progress_bars()
|
||||||
|
download_path = snapshot_download(
|
||||||
|
repo_id="ds4sd/DocumentFigureClassifier",
|
||||||
|
force_download=force,
|
||||||
|
local_dir=local_dir,
|
||||||
|
revision="v1.0.0",
|
||||||
|
)
|
||||||
|
|
||||||
|
return Path(download_path)
|
||||||
|
|
||||||
|
def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:
|
||||||
|
"""
|
||||||
|
Determines if the given element can be processed by the classifier.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
doc : DoclingDocument
|
||||||
|
The document containing the element.
|
||||||
|
element : NodeItem
|
||||||
|
The element to be checked.
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
bool
|
||||||
|
True if the element is a PictureItem and processing is enabled; False otherwise.
|
||||||
|
"""
|
||||||
|
return self.enabled and isinstance(element, PictureItem)
|
||||||
|
|
||||||
|
def __call__(
|
||||||
|
self,
|
||||||
|
doc: DoclingDocument,
|
||||||
|
element_batch: Iterable[NodeItem],
|
||||||
|
) -> Iterable[NodeItem]:
|
||||||
|
"""
|
||||||
|
Processes a batch of elements and enriches them with classification predictions.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
doc : DoclingDocument
|
||||||
|
The document containing the elements to be processed.
|
||||||
|
element_batch : Iterable[NodeItem]
|
||||||
|
A batch of pictures to classify.
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
Iterable[NodeItem]
|
||||||
|
An iterable of NodeItem objects after processing. The field
|
||||||
|
'data.classification' is added containing the classification for each picture.
|
||||||
|
"""
|
||||||
|
if not self.enabled:
|
||||||
|
for element in element_batch:
|
||||||
|
yield element
|
||||||
|
return
|
||||||
|
|
||||||
|
images: List[Image.Image] = []
|
||||||
|
elements: List[PictureItem] = []
|
||||||
|
for el in element_batch:
|
||||||
|
assert isinstance(el, PictureItem)
|
||||||
|
elements.append(el)
|
||||||
|
img = el.get_image(doc)
|
||||||
|
assert img is not None
|
||||||
|
images.append(img)
|
||||||
|
|
||||||
|
outputs = self.document_picture_classifier.predict(images)
|
||||||
|
|
||||||
|
for element, output in zip(elements, outputs):
|
||||||
|
element.annotations.append(
|
||||||
|
PictureClassificationData(
|
||||||
|
provenance="DocumentPictureClassifier",
|
||||||
|
predicted_classes=[
|
||||||
|
PictureClassificationClass(
|
||||||
|
class_name=pred[0],
|
||||||
|
confidence=pred[1],
|
||||||
|
)
|
||||||
|
for pred in output
|
||||||
|
],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
yield element
|
@ -1,28 +1,21 @@
|
|||||||
import copy
|
import copy
|
||||||
import logging
|
import logging
|
||||||
import random
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Iterable, List
|
from typing import Iterable
|
||||||
|
|
||||||
from docling_core.types.doc import CoordOrigin, DocItemLabel
|
from docling_core.types.doc import DocItemLabel
|
||||||
from docling_ibm_models.layoutmodel.layout_predictor import LayoutPredictor
|
from docling_ibm_models.layoutmodel.layout_predictor import LayoutPredictor
|
||||||
from PIL import Image, ImageDraw, ImageFont
|
from PIL import Image
|
||||||
|
|
||||||
from docling.datamodel.base_models import (
|
from docling.datamodel.base_models import BoundingBox, Cluster, LayoutPrediction, Page
|
||||||
BoundingBox,
|
|
||||||
Cell,
|
|
||||||
Cluster,
|
|
||||||
LayoutPrediction,
|
|
||||||
Page,
|
|
||||||
)
|
|
||||||
from docling.datamodel.document import ConversionResult
|
from docling.datamodel.document import ConversionResult
|
||||||
from docling.datamodel.pipeline_options import AcceleratorDevice, AcceleratorOptions
|
from docling.datamodel.pipeline_options import AcceleratorOptions
|
||||||
from docling.datamodel.settings import settings
|
from docling.datamodel.settings import settings
|
||||||
from docling.models.base_model import BasePageModel
|
from docling.models.base_model import BasePageModel
|
||||||
from docling.utils.accelerator_utils import decide_device
|
from docling.utils.accelerator_utils import decide_device
|
||||||
from docling.utils.layout_postprocessor import LayoutPostprocessor
|
from docling.utils.layout_postprocessor import LayoutPostprocessor
|
||||||
from docling.utils.profiling import TimeRecorder
|
from docling.utils.profiling import TimeRecorder
|
||||||
|
from docling.utils.visualization import draw_clusters
|
||||||
|
|
||||||
_log = logging.getLogger(__name__)
|
_log = logging.getLogger(__name__)
|
||||||
|
|
||||||
@ -40,7 +33,7 @@ class LayoutModel(BasePageModel):
|
|||||||
DocItemLabel.PAGE_FOOTER,
|
DocItemLabel.PAGE_FOOTER,
|
||||||
DocItemLabel.CODE,
|
DocItemLabel.CODE,
|
||||||
DocItemLabel.LIST_ITEM,
|
DocItemLabel.LIST_ITEM,
|
||||||
# "Formula",
|
DocItemLabel.FORMULA,
|
||||||
]
|
]
|
||||||
PAGE_HEADER_LABELS = [DocItemLabel.PAGE_HEADER, DocItemLabel.PAGE_FOOTER]
|
PAGE_HEADER_LABELS = [DocItemLabel.PAGE_HEADER, DocItemLabel.PAGE_FOOTER]
|
||||||
|
|
||||||
@ -82,78 +75,9 @@ class LayoutModel(BasePageModel):
|
|||||||
left_image = copy.deepcopy(page.image)
|
left_image = copy.deepcopy(page.image)
|
||||||
right_image = copy.deepcopy(page.image)
|
right_image = copy.deepcopy(page.image)
|
||||||
|
|
||||||
# Function to draw clusters on an image
|
|
||||||
def draw_clusters(image, clusters):
|
|
||||||
draw = ImageDraw.Draw(image, "RGBA")
|
|
||||||
# Create a smaller font for the labels
|
|
||||||
try:
|
|
||||||
font = ImageFont.truetype("arial.ttf", 12)
|
|
||||||
except OSError:
|
|
||||||
# Fallback to default font if arial is not available
|
|
||||||
font = ImageFont.load_default()
|
|
||||||
for c_tl in clusters:
|
|
||||||
all_clusters = [c_tl, *c_tl.children]
|
|
||||||
for c in all_clusters:
|
|
||||||
# Draw cells first (underneath)
|
|
||||||
cell_color = (0, 0, 0, 40) # Transparent black for cells
|
|
||||||
for tc in c.cells:
|
|
||||||
cx0, cy0, cx1, cy1 = tc.bbox.as_tuple()
|
|
||||||
cx0 *= scale_x
|
|
||||||
cx1 *= scale_x
|
|
||||||
cy0 *= scale_x
|
|
||||||
cy1 *= scale_y
|
|
||||||
|
|
||||||
draw.rectangle(
|
|
||||||
[(cx0, cy0), (cx1, cy1)],
|
|
||||||
outline=None,
|
|
||||||
fill=cell_color,
|
|
||||||
)
|
|
||||||
# Draw cluster rectangle
|
|
||||||
x0, y0, x1, y1 = c.bbox.as_tuple()
|
|
||||||
x0 *= scale_x
|
|
||||||
x1 *= scale_x
|
|
||||||
y0 *= scale_x
|
|
||||||
y1 *= scale_y
|
|
||||||
|
|
||||||
cluster_fill_color = (*list(DocItemLabel.get_color(c.label)), 70)
|
|
||||||
cluster_outline_color = (
|
|
||||||
*list(DocItemLabel.get_color(c.label)),
|
|
||||||
255,
|
|
||||||
)
|
|
||||||
draw.rectangle(
|
|
||||||
[(x0, y0), (x1, y1)],
|
|
||||||
outline=cluster_outline_color,
|
|
||||||
fill=cluster_fill_color,
|
|
||||||
)
|
|
||||||
# Add label name and confidence
|
|
||||||
label_text = f"{c.label.name} ({c.confidence:.2f})"
|
|
||||||
# Create semi-transparent background for text
|
|
||||||
text_bbox = draw.textbbox((x0, y0), label_text, font=font)
|
|
||||||
text_bg_padding = 2
|
|
||||||
draw.rectangle(
|
|
||||||
[
|
|
||||||
(
|
|
||||||
text_bbox[0] - text_bg_padding,
|
|
||||||
text_bbox[1] - text_bg_padding,
|
|
||||||
),
|
|
||||||
(
|
|
||||||
text_bbox[2] + text_bg_padding,
|
|
||||||
text_bbox[3] + text_bg_padding,
|
|
||||||
),
|
|
||||||
],
|
|
||||||
fill=(255, 255, 255, 180), # Semi-transparent white
|
|
||||||
)
|
|
||||||
# Draw text
|
|
||||||
draw.text(
|
|
||||||
(x0, y0),
|
|
||||||
label_text,
|
|
||||||
fill=(0, 0, 0, 255), # Solid black
|
|
||||||
font=font,
|
|
||||||
)
|
|
||||||
|
|
||||||
# Draw clusters on both images
|
# Draw clusters on both images
|
||||||
draw_clusters(left_image, left_clusters)
|
draw_clusters(left_image, left_clusters, scale_x, scale_y)
|
||||||
draw_clusters(right_image, right_clusters)
|
draw_clusters(right_image, right_clusters, scale_x, scale_y)
|
||||||
# Combine the images side by side
|
# Combine the images side by side
|
||||||
combined_width = left_image.width * 2
|
combined_width = left_image.width * 2
|
||||||
combined_height = left_image.height
|
combined_height = left_image.height
|
||||||
|
@ -22,7 +22,7 @@ _log = logging.getLogger(__name__)
|
|||||||
|
|
||||||
|
|
||||||
class PageAssembleOptions(BaseModel):
|
class PageAssembleOptions(BaseModel):
|
||||||
keep_images: bool = False
|
pass
|
||||||
|
|
||||||
|
|
||||||
class PageAssembleModel(BasePageModel):
|
class PageAssembleModel(BasePageModel):
|
||||||
@ -135,31 +135,6 @@ class PageAssembleModel(BasePageModel):
|
|||||||
)
|
)
|
||||||
elements.append(fig)
|
elements.append(fig)
|
||||||
body.append(fig)
|
body.append(fig)
|
||||||
elif cluster.label == LayoutModel.FORMULA_LABEL:
|
|
||||||
equation = None
|
|
||||||
if page.predictions.equations_prediction:
|
|
||||||
equation = page.predictions.equations_prediction.equation_map.get(
|
|
||||||
cluster.id, None
|
|
||||||
)
|
|
||||||
if (
|
|
||||||
not equation
|
|
||||||
): # fallback: add empty formula, if it isn't present
|
|
||||||
text = self.sanitize_text(
|
|
||||||
[
|
|
||||||
cell.text.replace("\x02", "-").strip()
|
|
||||||
for cell in cluster.cells
|
|
||||||
if len(cell.text.strip()) > 0
|
|
||||||
]
|
|
||||||
)
|
|
||||||
equation = TextElement(
|
|
||||||
label=cluster.label,
|
|
||||||
id=cluster.id,
|
|
||||||
cluster=cluster,
|
|
||||||
page_no=page.page_no,
|
|
||||||
text=text,
|
|
||||||
)
|
|
||||||
elements.append(equation)
|
|
||||||
body.append(equation)
|
|
||||||
elif cluster.label in LayoutModel.CONTAINER_LABELS:
|
elif cluster.label in LayoutModel.CONTAINER_LABELS:
|
||||||
container_el = ContainerElement(
|
container_el = ContainerElement(
|
||||||
label=cluster.label,
|
label=cluster.label,
|
||||||
@ -174,11 +149,4 @@ class PageAssembleModel(BasePageModel):
|
|||||||
elements=elements, headers=headers, body=body
|
elements=elements, headers=headers, body=body
|
||||||
)
|
)
|
||||||
|
|
||||||
# Remove page images (can be disabled)
|
|
||||||
if not self.options.keep_images:
|
|
||||||
page._image_cache = {}
|
|
||||||
|
|
||||||
# Unload backend
|
|
||||||
page._backend.unload()
|
|
||||||
|
|
||||||
yield page
|
yield page
|
||||||
|
@ -59,6 +59,7 @@ class RapidOcrModel(BaseOcrModel):
|
|||||||
det_model_path=self.options.det_model_path,
|
det_model_path=self.options.det_model_path,
|
||||||
cls_model_path=self.options.cls_model_path,
|
cls_model_path=self.options.cls_model_path,
|
||||||
rec_model_path=self.options.rec_model_path,
|
rec_model_path=self.options.rec_model_path,
|
||||||
|
rec_keys_path=self.options.rec_keys_path,
|
||||||
)
|
)
|
||||||
|
|
||||||
def __call__(
|
def __call__(
|
||||||
|
@ -209,12 +209,16 @@ class TableStructureModel(BasePageModel):
|
|||||||
tc.bbox = tc.bbox.scaled(1 / self.scale)
|
tc.bbox = tc.bbox.scaled(1 / self.scale)
|
||||||
table_cells.append(tc)
|
table_cells.append(tc)
|
||||||
|
|
||||||
|
assert "predict_details" in table_out
|
||||||
|
|
||||||
# Retrieving cols/rows, after post processing:
|
# Retrieving cols/rows, after post processing:
|
||||||
num_rows = table_out["predict_details"]["num_rows"]
|
num_rows = table_out["predict_details"].get("num_rows", 0)
|
||||||
num_cols = table_out["predict_details"]["num_cols"]
|
num_cols = table_out["predict_details"].get("num_cols", 0)
|
||||||
otsl_seq = table_out["predict_details"]["prediction"][
|
otsl_seq = (
|
||||||
"rs_seq"
|
table_out["predict_details"]
|
||||||
]
|
.get("prediction", {})
|
||||||
|
.get("rs_seq", [])
|
||||||
|
)
|
||||||
|
|
||||||
tbl = Table(
|
tbl = Table(
|
||||||
otsl_seq=otsl_seq,
|
otsl_seq=otsl_seq,
|
||||||
|
@ -4,7 +4,7 @@ import logging
|
|||||||
import os
|
import os
|
||||||
import tempfile
|
import tempfile
|
||||||
from subprocess import DEVNULL, PIPE, Popen
|
from subprocess import DEVNULL, PIPE, Popen
|
||||||
from typing import Iterable, Optional, Tuple
|
from typing import Iterable, List, Optional, Tuple
|
||||||
|
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
from docling_core.types.doc import BoundingBox, CoordOrigin
|
from docling_core.types.doc import BoundingBox, CoordOrigin
|
||||||
@ -14,13 +14,13 @@ from docling.datamodel.document import ConversionResult
|
|||||||
from docling.datamodel.pipeline_options import TesseractCliOcrOptions
|
from docling.datamodel.pipeline_options import TesseractCliOcrOptions
|
||||||
from docling.datamodel.settings import settings
|
from docling.datamodel.settings import settings
|
||||||
from docling.models.base_ocr_model import BaseOcrModel
|
from docling.models.base_ocr_model import BaseOcrModel
|
||||||
|
from docling.utils.ocr_utils import map_tesseract_script
|
||||||
from docling.utils.profiling import TimeRecorder
|
from docling.utils.profiling import TimeRecorder
|
||||||
|
|
||||||
_log = logging.getLogger(__name__)
|
_log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
class TesseractOcrCliModel(BaseOcrModel):
|
class TesseractOcrCliModel(BaseOcrModel):
|
||||||
|
|
||||||
def __init__(self, enabled: bool, options: TesseractCliOcrOptions):
|
def __init__(self, enabled: bool, options: TesseractCliOcrOptions):
|
||||||
super().__init__(enabled=enabled, options=options)
|
super().__init__(enabled=enabled, options=options)
|
||||||
self.options: TesseractCliOcrOptions
|
self.options: TesseractCliOcrOptions
|
||||||
@ -29,10 +29,13 @@ class TesseractOcrCliModel(BaseOcrModel):
|
|||||||
|
|
||||||
self._name: Optional[str] = None
|
self._name: Optional[str] = None
|
||||||
self._version: Optional[str] = None
|
self._version: Optional[str] = None
|
||||||
|
self._tesseract_languages: Optional[List[str]] = None
|
||||||
|
self._script_prefix: Optional[str] = None
|
||||||
|
|
||||||
if self.enabled:
|
if self.enabled:
|
||||||
try:
|
try:
|
||||||
self._get_name_and_version()
|
self._get_name_and_version()
|
||||||
|
self._set_languages_and_prefix()
|
||||||
|
|
||||||
except Exception as exc:
|
except Exception as exc:
|
||||||
raise RuntimeError(
|
raise RuntimeError(
|
||||||
@ -74,12 +77,20 @@ class TesseractOcrCliModel(BaseOcrModel):
|
|||||||
return name, version
|
return name, version
|
||||||
|
|
||||||
def _run_tesseract(self, ifilename: str):
|
def _run_tesseract(self, ifilename: str):
|
||||||
|
r"""
|
||||||
|
Run tesseract CLI
|
||||||
|
"""
|
||||||
cmd = [self.options.tesseract_cmd]
|
cmd = [self.options.tesseract_cmd]
|
||||||
|
|
||||||
if self.options.lang is not None and len(self.options.lang) > 0:
|
if "auto" in self.options.lang:
|
||||||
|
lang = self._detect_language(ifilename)
|
||||||
|
if lang is not None:
|
||||||
|
cmd.append("-l")
|
||||||
|
cmd.append(lang)
|
||||||
|
elif self.options.lang is not None and len(self.options.lang) > 0:
|
||||||
cmd.append("-l")
|
cmd.append("-l")
|
||||||
cmd.append("+".join(self.options.lang))
|
cmd.append("+".join(self.options.lang))
|
||||||
|
|
||||||
if self.options.path is not None:
|
if self.options.path is not None:
|
||||||
cmd.append("--tessdata-dir")
|
cmd.append("--tessdata-dir")
|
||||||
cmd.append(self.options.path)
|
cmd.append(self.options.path)
|
||||||
@ -107,6 +118,63 @@ class TesseractOcrCliModel(BaseOcrModel):
|
|||||||
|
|
||||||
return df_filtered
|
return df_filtered
|
||||||
|
|
||||||
|
def _detect_language(self, ifilename: str):
|
||||||
|
r"""
|
||||||
|
Run tesseract in PSM 0 mode to detect the language
|
||||||
|
"""
|
||||||
|
assert self._tesseract_languages is not None
|
||||||
|
|
||||||
|
cmd = [self.options.tesseract_cmd]
|
||||||
|
cmd.extend(["--psm", "0", "-l", "osd", ifilename, "stdout"])
|
||||||
|
_log.info("command: {}".format(" ".join(cmd)))
|
||||||
|
proc = Popen(cmd, stdout=PIPE, stderr=DEVNULL)
|
||||||
|
output, _ = proc.communicate()
|
||||||
|
decoded_data = output.decode("utf-8")
|
||||||
|
df = pd.read_csv(
|
||||||
|
io.StringIO(decoded_data), sep=":", header=None, names=["key", "value"]
|
||||||
|
)
|
||||||
|
scripts = df.loc[df["key"] == "Script"].value.tolist()
|
||||||
|
if len(scripts) == 0:
|
||||||
|
_log.warning("Tesseract cannot detect the script of the page")
|
||||||
|
return None
|
||||||
|
|
||||||
|
script = map_tesseract_script(scripts[0].strip())
|
||||||
|
lang = f"{self._script_prefix}{script}"
|
||||||
|
|
||||||
|
# Check if the detected language has been installed
|
||||||
|
if lang not in self._tesseract_languages:
|
||||||
|
msg = f"Tesseract detected the script '{script}' and language '{lang}'."
|
||||||
|
msg += " However this language is not installed in your system and will be ignored."
|
||||||
|
_log.warning(msg)
|
||||||
|
return None
|
||||||
|
|
||||||
|
_log.debug(
|
||||||
|
f"Using tesseract model for the detected script '{script}' and language '{lang}'"
|
||||||
|
)
|
||||||
|
return lang
|
||||||
|
|
||||||
|
def _set_languages_and_prefix(self):
|
||||||
|
r"""
|
||||||
|
Read and set the languages installed in tesseract and decide the script prefix
|
||||||
|
"""
|
||||||
|
# Get all languages
|
||||||
|
cmd = [self.options.tesseract_cmd]
|
||||||
|
cmd.append("--list-langs")
|
||||||
|
_log.info("command: {}".format(" ".join(cmd)))
|
||||||
|
proc = Popen(cmd, stdout=PIPE, stderr=DEVNULL)
|
||||||
|
output, _ = proc.communicate()
|
||||||
|
decoded_data = output.decode("utf-8")
|
||||||
|
df = pd.read_csv(io.StringIO(decoded_data), header=None)
|
||||||
|
self._tesseract_languages = df[0].tolist()[1:]
|
||||||
|
|
||||||
|
# Decide the script prefix
|
||||||
|
if any([l.startswith("script/") for l in self._tesseract_languages]):
|
||||||
|
script_prefix = "script/"
|
||||||
|
else:
|
||||||
|
script_prefix = ""
|
||||||
|
|
||||||
|
self._script_prefix = script_prefix
|
||||||
|
|
||||||
def __call__(
|
def __call__(
|
||||||
self, conv_res: ConversionResult, page_batch: Iterable[Page]
|
self, conv_res: ConversionResult, page_batch: Iterable[Page]
|
||||||
) -> Iterable[Page]:
|
) -> Iterable[Page]:
|
||||||
@ -121,7 +189,6 @@ class TesseractOcrCliModel(BaseOcrModel):
|
|||||||
yield page
|
yield page
|
||||||
else:
|
else:
|
||||||
with TimeRecorder(conv_res, "ocr"):
|
with TimeRecorder(conv_res, "ocr"):
|
||||||
|
|
||||||
ocr_rects = self.get_ocr_rects(page)
|
ocr_rects = self.get_ocr_rects(page)
|
||||||
|
|
||||||
all_ocr_cells = []
|
all_ocr_cells = []
|
||||||
|
@ -8,6 +8,7 @@ from docling.datamodel.document import ConversionResult
|
|||||||
from docling.datamodel.pipeline_options import TesseractOcrOptions
|
from docling.datamodel.pipeline_options import TesseractOcrOptions
|
||||||
from docling.datamodel.settings import settings
|
from docling.datamodel.settings import settings
|
||||||
from docling.models.base_ocr_model import BaseOcrModel
|
from docling.models.base_ocr_model import BaseOcrModel
|
||||||
|
from docling.utils.ocr_utils import map_tesseract_script
|
||||||
from docling.utils.profiling import TimeRecorder
|
from docling.utils.profiling import TimeRecorder
|
||||||
|
|
||||||
_log = logging.getLogger(__name__)
|
_log = logging.getLogger(__name__)
|
||||||
@ -20,6 +21,7 @@ class TesseractOcrModel(BaseOcrModel):
|
|||||||
|
|
||||||
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
|
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
|
||||||
self.reader = None
|
self.reader = None
|
||||||
|
self.osd_reader = None
|
||||||
|
|
||||||
if self.enabled:
|
if self.enabled:
|
||||||
install_errmsg = (
|
install_errmsg = (
|
||||||
@ -47,27 +49,38 @@ class TesseractOcrModel(BaseOcrModel):
|
|||||||
except:
|
except:
|
||||||
raise ImportError(install_errmsg)
|
raise ImportError(install_errmsg)
|
||||||
|
|
||||||
_, tesserocr_languages = tesserocr.get_languages()
|
_, self._tesserocr_languages = tesserocr.get_languages()
|
||||||
if not tesserocr_languages:
|
if not self._tesserocr_languages:
|
||||||
raise ImportError(missing_langs_errmsg)
|
raise ImportError(missing_langs_errmsg)
|
||||||
|
|
||||||
# Initialize the tesseractAPI
|
# Initialize the tesseractAPI
|
||||||
_log.debug("Initializing TesserOCR: %s", tesseract_version)
|
_log.debug("Initializing TesserOCR: %s", tesseract_version)
|
||||||
lang = "+".join(self.options.lang)
|
lang = "+".join(self.options.lang)
|
||||||
|
|
||||||
|
self.script_readers: dict[str, tesserocr.PyTessBaseAPI] = {}
|
||||||
|
|
||||||
|
if any([l.startswith("script/") for l in self._tesserocr_languages]):
|
||||||
|
self.script_prefix = "script/"
|
||||||
|
else:
|
||||||
|
self.script_prefix = ""
|
||||||
|
|
||||||
|
tesserocr_kwargs = {
|
||||||
|
"psm": tesserocr.PSM.AUTO,
|
||||||
|
"init": True,
|
||||||
|
"oem": tesserocr.OEM.DEFAULT,
|
||||||
|
}
|
||||||
|
|
||||||
if self.options.path is not None:
|
if self.options.path is not None:
|
||||||
self.reader = tesserocr.PyTessBaseAPI(
|
tesserocr_kwargs["path"] = self.options.path
|
||||||
path=self.options.path,
|
|
||||||
lang=lang,
|
if lang == "auto":
|
||||||
psm=tesserocr.PSM.AUTO,
|
self.reader = tesserocr.PyTessBaseAPI(**tesserocr_kwargs)
|
||||||
init=True,
|
self.osd_reader = tesserocr.PyTessBaseAPI(
|
||||||
oem=tesserocr.OEM.DEFAULT,
|
**{"lang": "osd", "psm": tesserocr.PSM.OSD_ONLY} | tesserocr_kwargs
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
self.reader = tesserocr.PyTessBaseAPI(
|
self.reader = tesserocr.PyTessBaseAPI(
|
||||||
lang=lang,
|
**{"lang": lang} | tesserocr_kwargs,
|
||||||
psm=tesserocr.PSM.AUTO,
|
|
||||||
init=True,
|
|
||||||
oem=tesserocr.OEM.DEFAULT,
|
|
||||||
)
|
)
|
||||||
self.reader_RIL = tesserocr.RIL
|
self.reader_RIL = tesserocr.RIL
|
||||||
|
|
||||||
@ -75,11 +88,12 @@ class TesseractOcrModel(BaseOcrModel):
|
|||||||
if self.reader is not None:
|
if self.reader is not None:
|
||||||
# Finalize the tesseractAPI
|
# Finalize the tesseractAPI
|
||||||
self.reader.End()
|
self.reader.End()
|
||||||
|
for script in self.script_readers:
|
||||||
|
self.script_readers[script].End()
|
||||||
|
|
||||||
def __call__(
|
def __call__(
|
||||||
self, conv_res: ConversionResult, page_batch: Iterable[Page]
|
self, conv_res: ConversionResult, page_batch: Iterable[Page]
|
||||||
) -> Iterable[Page]:
|
) -> Iterable[Page]:
|
||||||
|
|
||||||
if not self.enabled:
|
if not self.enabled:
|
||||||
yield from page_batch
|
yield from page_batch
|
||||||
return
|
return
|
||||||
@ -90,8 +104,8 @@ class TesseractOcrModel(BaseOcrModel):
|
|||||||
yield page
|
yield page
|
||||||
else:
|
else:
|
||||||
with TimeRecorder(conv_res, "ocr"):
|
with TimeRecorder(conv_res, "ocr"):
|
||||||
|
|
||||||
assert self.reader is not None
|
assert self.reader is not None
|
||||||
|
assert self._tesserocr_languages is not None
|
||||||
|
|
||||||
ocr_rects = self.get_ocr_rects(page)
|
ocr_rects = self.get_ocr_rects(page)
|
||||||
|
|
||||||
@ -104,22 +118,56 @@ class TesseractOcrModel(BaseOcrModel):
|
|||||||
scale=self.scale, cropbox=ocr_rect
|
scale=self.scale, cropbox=ocr_rect
|
||||||
)
|
)
|
||||||
|
|
||||||
# Retrieve text snippets with their bounding boxes
|
local_reader = self.reader
|
||||||
self.reader.SetImage(high_res_image)
|
if "auto" in self.options.lang:
|
||||||
boxes = self.reader.GetComponentImages(
|
assert self.osd_reader is not None
|
||||||
|
|
||||||
|
self.osd_reader.SetImage(high_res_image)
|
||||||
|
osd = self.osd_reader.DetectOrientationScript()
|
||||||
|
|
||||||
|
# No text, probably
|
||||||
|
if osd is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
script = osd["script_name"]
|
||||||
|
script = map_tesseract_script(script)
|
||||||
|
lang = f"{self.script_prefix}{script}"
|
||||||
|
|
||||||
|
# Check if the detected languge is present in the system
|
||||||
|
if lang not in self._tesserocr_languages:
|
||||||
|
msg = f"Tesseract detected the script '{script}' and language '{lang}'."
|
||||||
|
msg += " However this language is not installed in your system and will be ignored."
|
||||||
|
_log.warning(msg)
|
||||||
|
else:
|
||||||
|
if script not in self.script_readers:
|
||||||
|
import tesserocr
|
||||||
|
|
||||||
|
self.script_readers[script] = (
|
||||||
|
tesserocr.PyTessBaseAPI(
|
||||||
|
path=self.reader.GetDatapath(),
|
||||||
|
lang=lang,
|
||||||
|
psm=tesserocr.PSM.AUTO,
|
||||||
|
init=True,
|
||||||
|
oem=tesserocr.OEM.DEFAULT,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
local_reader = self.script_readers[script]
|
||||||
|
|
||||||
|
local_reader.SetImage(high_res_image)
|
||||||
|
boxes = local_reader.GetComponentImages(
|
||||||
self.reader_RIL.TEXTLINE, True
|
self.reader_RIL.TEXTLINE, True
|
||||||
)
|
)
|
||||||
|
|
||||||
cells = []
|
cells = []
|
||||||
for ix, (im, box, _, _) in enumerate(boxes):
|
for ix, (im, box, _, _) in enumerate(boxes):
|
||||||
# Set the area of interest. Tesseract uses Bottom-Left for the origin
|
# Set the area of interest. Tesseract uses Bottom-Left for the origin
|
||||||
self.reader.SetRectangle(
|
local_reader.SetRectangle(
|
||||||
box["x"], box["y"], box["w"], box["h"]
|
box["x"], box["y"], box["w"], box["h"]
|
||||||
)
|
)
|
||||||
|
|
||||||
# Extract text within the bounding box
|
# Extract text within the bounding box
|
||||||
text = self.reader.GetUTF8Text().strip()
|
text = local_reader.GetUTF8Text().strip()
|
||||||
confidence = self.reader.MeanTextConf()
|
confidence = local_reader.MeanTextConf()
|
||||||
left = box["x"] / self.scale
|
left = box["x"] / self.scale
|
||||||
bottom = box["y"] / self.scale
|
bottom = box["y"] / self.scale
|
||||||
right = (box["x"] + box["w"]) / self.scale
|
right = (box["x"] + box["w"]) / self.scale
|
||||||
|
@ -3,7 +3,7 @@ import logging
|
|||||||
import time
|
import time
|
||||||
import traceback
|
import traceback
|
||||||
from abc import ABC, abstractmethod
|
from abc import ABC, abstractmethod
|
||||||
from typing import Callable, Iterable, List
|
from typing import Any, Callable, Iterable, List
|
||||||
|
|
||||||
from docling_core.types.doc import DoclingDocument, NodeItem
|
from docling_core.types.doc import DoclingDocument, NodeItem
|
||||||
|
|
||||||
@ -18,7 +18,7 @@ from docling.datamodel.base_models import (
|
|||||||
from docling.datamodel.document import ConversionResult, InputDocument
|
from docling.datamodel.document import ConversionResult, InputDocument
|
||||||
from docling.datamodel.pipeline_options import PipelineOptions
|
from docling.datamodel.pipeline_options import PipelineOptions
|
||||||
from docling.datamodel.settings import settings
|
from docling.datamodel.settings import settings
|
||||||
from docling.models.base_model import BaseEnrichmentModel
|
from docling.models.base_model import GenericEnrichmentModel
|
||||||
from docling.utils.profiling import ProfilingScope, TimeRecorder
|
from docling.utils.profiling import ProfilingScope, TimeRecorder
|
||||||
from docling.utils.utils import chunkify
|
from docling.utils.utils import chunkify
|
||||||
|
|
||||||
@ -28,8 +28,9 @@ _log = logging.getLogger(__name__)
|
|||||||
class BasePipeline(ABC):
|
class BasePipeline(ABC):
|
||||||
def __init__(self, pipeline_options: PipelineOptions):
|
def __init__(self, pipeline_options: PipelineOptions):
|
||||||
self.pipeline_options = pipeline_options
|
self.pipeline_options = pipeline_options
|
||||||
|
self.keep_images = False
|
||||||
self.build_pipe: List[Callable] = []
|
self.build_pipe: List[Callable] = []
|
||||||
self.enrichment_pipe: List[BaseEnrichmentModel] = []
|
self.enrichment_pipe: List[GenericEnrichmentModel[Any]] = []
|
||||||
|
|
||||||
def execute(self, in_doc: InputDocument, raises_on_error: bool) -> ConversionResult:
|
def execute(self, in_doc: InputDocument, raises_on_error: bool) -> ConversionResult:
|
||||||
conv_res = ConversionResult(input=in_doc)
|
conv_res = ConversionResult(input=in_doc)
|
||||||
@ -40,7 +41,7 @@ class BasePipeline(ABC):
|
|||||||
conv_res, "pipeline_total", scope=ProfilingScope.DOCUMENT
|
conv_res, "pipeline_total", scope=ProfilingScope.DOCUMENT
|
||||||
):
|
):
|
||||||
# These steps are building and assembling the structure of the
|
# These steps are building and assembling the structure of the
|
||||||
# output DoclingDocument
|
# output DoclingDocument.
|
||||||
conv_res = self._build_document(conv_res)
|
conv_res = self._build_document(conv_res)
|
||||||
conv_res = self._assemble_document(conv_res)
|
conv_res = self._assemble_document(conv_res)
|
||||||
# From this stage, all operations should rely only on conv_res.output
|
# From this stage, all operations should rely only on conv_res.output
|
||||||
@ -50,6 +51,8 @@ class BasePipeline(ABC):
|
|||||||
conv_res.status = ConversionStatus.FAILURE
|
conv_res.status = ConversionStatus.FAILURE
|
||||||
if raises_on_error:
|
if raises_on_error:
|
||||||
raise e
|
raise e
|
||||||
|
finally:
|
||||||
|
self._unload(conv_res)
|
||||||
|
|
||||||
return conv_res
|
return conv_res
|
||||||
|
|
||||||
@ -62,21 +65,22 @@ class BasePipeline(ABC):
|
|||||||
|
|
||||||
def _enrich_document(self, conv_res: ConversionResult) -> ConversionResult:
|
def _enrich_document(self, conv_res: ConversionResult) -> ConversionResult:
|
||||||
|
|
||||||
def _filter_elements(
|
def _prepare_elements(
|
||||||
doc: DoclingDocument, model: BaseEnrichmentModel
|
conv_res: ConversionResult, model: GenericEnrichmentModel[Any]
|
||||||
) -> Iterable[NodeItem]:
|
) -> Iterable[NodeItem]:
|
||||||
for element, _level in doc.iterate_items():
|
for doc_element, _level in conv_res.document.iterate_items():
|
||||||
if model.is_processable(doc=doc, element=element):
|
prepared_element = model.prepare_element(
|
||||||
yield element
|
conv_res=conv_res, element=doc_element
|
||||||
|
)
|
||||||
|
if prepared_element is not None:
|
||||||
|
yield prepared_element
|
||||||
|
|
||||||
with TimeRecorder(conv_res, "doc_enrich", scope=ProfilingScope.DOCUMENT):
|
with TimeRecorder(conv_res, "doc_enrich", scope=ProfilingScope.DOCUMENT):
|
||||||
for model in self.enrichment_pipe:
|
for model in self.enrichment_pipe:
|
||||||
for element_batch in chunkify(
|
for element_batch in chunkify(
|
||||||
_filter_elements(conv_res.document, model),
|
_prepare_elements(conv_res, model),
|
||||||
settings.perf.elements_batch_size,
|
settings.perf.elements_batch_size,
|
||||||
):
|
):
|
||||||
# TODO: currently we assume the element itself is modified, because
|
|
||||||
# we don't have an interface to save the element back to the document
|
|
||||||
for element in model(
|
for element in model(
|
||||||
doc=conv_res.document, element_batch=element_batch
|
doc=conv_res.document, element_batch=element_batch
|
||||||
): # Must exhaust!
|
): # Must exhaust!
|
||||||
@ -88,6 +92,9 @@ class BasePipeline(ABC):
|
|||||||
def _determine_status(self, conv_res: ConversionResult) -> ConversionStatus:
|
def _determine_status(self, conv_res: ConversionResult) -> ConversionStatus:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
def _unload(self, conv_res: ConversionResult):
|
||||||
|
pass
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def get_default_options(cls) -> PipelineOptions:
|
def get_default_options(cls) -> PipelineOptions:
|
||||||
@ -107,6 +114,10 @@ class BasePipeline(ABC):
|
|||||||
|
|
||||||
class PaginatedPipeline(BasePipeline): # TODO this is a bad name.
|
class PaginatedPipeline(BasePipeline): # TODO this is a bad name.
|
||||||
|
|
||||||
|
def __init__(self, pipeline_options: PipelineOptions):
|
||||||
|
super().__init__(pipeline_options)
|
||||||
|
self.keep_backend = False
|
||||||
|
|
||||||
def _apply_on_pages(
|
def _apply_on_pages(
|
||||||
self, conv_res: ConversionResult, page_batch: Iterable[Page]
|
self, conv_res: ConversionResult, page_batch: Iterable[Page]
|
||||||
) -> Iterable[Page]:
|
) -> Iterable[Page]:
|
||||||
@ -130,7 +141,9 @@ class PaginatedPipeline(BasePipeline): # TODO this is a bad name.
|
|||||||
with TimeRecorder(conv_res, "doc_build", scope=ProfilingScope.DOCUMENT):
|
with TimeRecorder(conv_res, "doc_build", scope=ProfilingScope.DOCUMENT):
|
||||||
|
|
||||||
for i in range(0, conv_res.input.page_count):
|
for i in range(0, conv_res.input.page_count):
|
||||||
conv_res.pages.append(Page(page_no=i))
|
start_page, end_page = conv_res.input.limits.page_range
|
||||||
|
if (start_page - 1) <= i <= (end_page - 1):
|
||||||
|
conv_res.pages.append(Page(page_no=i))
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# Iterate batches of pages (page_batch_size) in the doc
|
# Iterate batches of pages (page_batch_size) in the doc
|
||||||
@ -148,7 +161,14 @@ class PaginatedPipeline(BasePipeline): # TODO this is a bad name.
|
|||||||
pipeline_pages = self._apply_on_pages(conv_res, init_pages)
|
pipeline_pages = self._apply_on_pages(conv_res, init_pages)
|
||||||
|
|
||||||
for p in pipeline_pages: # Must exhaust!
|
for p in pipeline_pages: # Must exhaust!
|
||||||
pass
|
|
||||||
|
# Cleanup cached images
|
||||||
|
if not self.keep_images:
|
||||||
|
p._image_cache = {}
|
||||||
|
|
||||||
|
# Cleanup page backends
|
||||||
|
if not self.keep_backend and p._backend is not None:
|
||||||
|
p._backend.unload()
|
||||||
|
|
||||||
end_batch_time = time.monotonic()
|
end_batch_time = time.monotonic()
|
||||||
total_elapsed_time += end_batch_time - start_batch_time
|
total_elapsed_time += end_batch_time - start_batch_time
|
||||||
@ -177,10 +197,15 @@ class PaginatedPipeline(BasePipeline): # TODO this is a bad name.
|
|||||||
)
|
)
|
||||||
raise e
|
raise e
|
||||||
|
|
||||||
finally:
|
return conv_res
|
||||||
# Always unload the PDF backend, even in case of failure
|
|
||||||
if conv_res.input._backend:
|
def _unload(self, conv_res: ConversionResult) -> ConversionResult:
|
||||||
conv_res.input._backend.unload()
|
for page in conv_res.pages:
|
||||||
|
if page._backend is not None:
|
||||||
|
page._backend.unload()
|
||||||
|
|
||||||
|
if conv_res.input._backend:
|
||||||
|
conv_res.input._backend.unload()
|
||||||
|
|
||||||
return conv_res
|
return conv_res
|
||||||
|
|
||||||
|
@ -18,6 +18,11 @@ from docling.datamodel.pipeline_options import (
|
|||||||
TesseractOcrOptions,
|
TesseractOcrOptions,
|
||||||
)
|
)
|
||||||
from docling.models.base_ocr_model import BaseOcrModel
|
from docling.models.base_ocr_model import BaseOcrModel
|
||||||
|
from docling.models.code_formula_model import CodeFormulaModel, CodeFormulaModelOptions
|
||||||
|
from docling.models.document_picture_classifier import (
|
||||||
|
DocumentPictureClassifier,
|
||||||
|
DocumentPictureClassifierOptions,
|
||||||
|
)
|
||||||
from docling.models.ds_glm_model import GlmModel, GlmOptions
|
from docling.models.ds_glm_model import GlmModel, GlmOptions
|
||||||
from docling.models.easyocr_model import EasyOcrModel
|
from docling.models.easyocr_model import EasyOcrModel
|
||||||
from docling.models.layout_model import LayoutModel
|
from docling.models.layout_model import LayoutModel
|
||||||
@ -50,7 +55,7 @@ class StandardPdfPipeline(PaginatedPipeline):
|
|||||||
else:
|
else:
|
||||||
self.artifacts_path = Path(pipeline_options.artifacts_path)
|
self.artifacts_path = Path(pipeline_options.artifacts_path)
|
||||||
|
|
||||||
keep_images = (
|
self.keep_images = (
|
||||||
self.pipeline_options.generate_page_images
|
self.pipeline_options.generate_page_images
|
||||||
or self.pipeline_options.generate_picture_images
|
or self.pipeline_options.generate_picture_images
|
||||||
or self.pipeline_options.generate_table_images
|
or self.pipeline_options.generate_table_images
|
||||||
@ -87,13 +92,37 @@ class StandardPdfPipeline(PaginatedPipeline):
|
|||||||
accelerator_options=pipeline_options.accelerator_options,
|
accelerator_options=pipeline_options.accelerator_options,
|
||||||
),
|
),
|
||||||
# Page assemble
|
# Page assemble
|
||||||
PageAssembleModel(options=PageAssembleOptions(keep_images=keep_images)),
|
PageAssembleModel(options=PageAssembleOptions()),
|
||||||
]
|
]
|
||||||
|
|
||||||
self.enrichment_pipe = [
|
self.enrichment_pipe = [
|
||||||
# Other models working on `NodeItem` elements in the DoclingDocument
|
# Other models working on `NodeItem` elements in the DoclingDocument
|
||||||
|
# Code Formula Enrichment Model
|
||||||
|
CodeFormulaModel(
|
||||||
|
enabled=pipeline_options.do_code_enrichment
|
||||||
|
or pipeline_options.do_formula_enrichment,
|
||||||
|
artifacts_path=pipeline_options.artifacts_path,
|
||||||
|
options=CodeFormulaModelOptions(
|
||||||
|
do_code_enrichment=pipeline_options.do_code_enrichment,
|
||||||
|
do_formula_enrichment=pipeline_options.do_formula_enrichment,
|
||||||
|
),
|
||||||
|
accelerator_options=pipeline_options.accelerator_options,
|
||||||
|
),
|
||||||
|
# Document Picture Classifier
|
||||||
|
DocumentPictureClassifier(
|
||||||
|
enabled=pipeline_options.do_picture_classification,
|
||||||
|
artifacts_path=pipeline_options.artifacts_path,
|
||||||
|
options=DocumentPictureClassifierOptions(),
|
||||||
|
accelerator_options=pipeline_options.accelerator_options,
|
||||||
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
if (
|
||||||
|
self.pipeline_options.do_formula_enrichment
|
||||||
|
or self.pipeline_options.do_code_enrichment
|
||||||
|
):
|
||||||
|
self.keep_backend = True
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def download_models_hf(
|
def download_models_hf(
|
||||||
local_dir: Optional[Path] = None, force: bool = False
|
local_dir: Optional[Path] = None, force: bool = False
|
||||||
|
@ -15,6 +15,7 @@ from docling_core.types.doc import (
|
|||||||
TableCell,
|
TableCell,
|
||||||
TableData,
|
TableData,
|
||||||
)
|
)
|
||||||
|
from docling_core.types.doc.document import ContentLayer
|
||||||
|
|
||||||
|
|
||||||
def resolve_item(paths, obj):
|
def resolve_item(paths, obj):
|
||||||
@ -270,7 +271,6 @@ def to_docling_document(doc_glm, update_name_label=False) -> DoclingDocument:
|
|||||||
container_el = doc.add_group(label=group_label)
|
container_el = doc.add_group(label=group_label)
|
||||||
|
|
||||||
_add_child_elements(container_el, doc, obj, pelem)
|
_add_child_elements(container_el, doc, obj, pelem)
|
||||||
|
|
||||||
elif "text" in obj:
|
elif "text" in obj:
|
||||||
text = obj["text"][span_i:span_j]
|
text = obj["text"][span_i:span_j]
|
||||||
|
|
||||||
@ -304,6 +304,14 @@ def to_docling_document(doc_glm, update_name_label=False) -> DoclingDocument:
|
|||||||
current_list = None
|
current_list = None
|
||||||
|
|
||||||
doc.add_heading(text=text, prov=prov)
|
doc.add_heading(text=text, prov=prov)
|
||||||
|
elif label == DocItemLabel.CODE:
|
||||||
|
current_list = None
|
||||||
|
|
||||||
|
doc.add_code(text=text, prov=prov)
|
||||||
|
elif label == DocItemLabel.FORMULA:
|
||||||
|
current_list = None
|
||||||
|
|
||||||
|
doc.add_text(label=DocItemLabel.FORMULA, text="", orig=text, prov=prov)
|
||||||
elif label in [DocItemLabel.PAGE_HEADER, DocItemLabel.PAGE_FOOTER]:
|
elif label in [DocItemLabel.PAGE_HEADER, DocItemLabel.PAGE_FOOTER]:
|
||||||
current_list = None
|
current_list = None
|
||||||
|
|
||||||
@ -311,7 +319,7 @@ def to_docling_document(doc_glm, update_name_label=False) -> DoclingDocument:
|
|||||||
label=DocItemLabel(name_label),
|
label=DocItemLabel(name_label),
|
||||||
text=text,
|
text=text,
|
||||||
prov=prov,
|
prov=prov,
|
||||||
parent=doc.furniture,
|
content_layer=ContentLayer.FURNITURE,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
current_list = None
|
current_list = None
|
||||||
|
9
docling/utils/ocr_utils.py
Normal file
9
docling/utils/ocr_utils.py
Normal file
@ -0,0 +1,9 @@
|
|||||||
|
def map_tesseract_script(script: str) -> str:
|
||||||
|
r""" """
|
||||||
|
if script == "Katakana" or script == "Hiragana":
|
||||||
|
script = "Japanese"
|
||||||
|
elif script == "Han":
|
||||||
|
script = "HanS"
|
||||||
|
elif script == "Korean":
|
||||||
|
script = "Hangul"
|
||||||
|
return script
|
80
docling/utils/visualization.py
Normal file
80
docling/utils/visualization.py
Normal file
@ -0,0 +1,80 @@
|
|||||||
|
from docling_core.types.doc import DocItemLabel
|
||||||
|
from PIL import Image, ImageDraw, ImageFont
|
||||||
|
from PIL.ImageFont import FreeTypeFont
|
||||||
|
|
||||||
|
from docling.datamodel.base_models import Cluster
|
||||||
|
|
||||||
|
|
||||||
|
def draw_clusters(
|
||||||
|
image: Image.Image, clusters: list[Cluster], scale_x: float, scale_y: float
|
||||||
|
) -> None:
|
||||||
|
"""
|
||||||
|
Draw clusters on an image
|
||||||
|
"""
|
||||||
|
draw = ImageDraw.Draw(image, "RGBA")
|
||||||
|
# Create a smaller font for the labels
|
||||||
|
font: ImageFont.ImageFont | FreeTypeFont
|
||||||
|
try:
|
||||||
|
font = ImageFont.truetype("arial.ttf", 12)
|
||||||
|
except OSError:
|
||||||
|
# Fallback to default font if arial is not available
|
||||||
|
font = ImageFont.load_default()
|
||||||
|
for c_tl in clusters:
|
||||||
|
all_clusters = [c_tl, *c_tl.children]
|
||||||
|
for c in all_clusters:
|
||||||
|
# Draw cells first (underneath)
|
||||||
|
cell_color = (0, 0, 0, 40) # Transparent black for cells
|
||||||
|
for tc in c.cells:
|
||||||
|
cx0, cy0, cx1, cy1 = tc.bbox.as_tuple()
|
||||||
|
cx0 *= scale_x
|
||||||
|
cx1 *= scale_x
|
||||||
|
cy0 *= scale_x
|
||||||
|
cy1 *= scale_y
|
||||||
|
|
||||||
|
draw.rectangle(
|
||||||
|
[(cx0, cy0), (cx1, cy1)],
|
||||||
|
outline=None,
|
||||||
|
fill=cell_color,
|
||||||
|
)
|
||||||
|
# Draw cluster rectangle
|
||||||
|
x0, y0, x1, y1 = c.bbox.as_tuple()
|
||||||
|
x0 *= scale_x
|
||||||
|
x1 *= scale_x
|
||||||
|
y0 *= scale_x
|
||||||
|
y1 *= scale_y
|
||||||
|
|
||||||
|
cluster_fill_color = (*list(DocItemLabel.get_color(c.label)), 70)
|
||||||
|
cluster_outline_color = (
|
||||||
|
*list(DocItemLabel.get_color(c.label)),
|
||||||
|
255,
|
||||||
|
)
|
||||||
|
draw.rectangle(
|
||||||
|
[(x0, y0), (x1, y1)],
|
||||||
|
outline=cluster_outline_color,
|
||||||
|
fill=cluster_fill_color,
|
||||||
|
)
|
||||||
|
# Add label name and confidence
|
||||||
|
label_text = f"{c.label.name} ({c.confidence:.2f})"
|
||||||
|
# Create semi-transparent background for text
|
||||||
|
text_bbox = draw.textbbox((x0, y0), label_text, font=font)
|
||||||
|
text_bg_padding = 2
|
||||||
|
draw.rectangle(
|
||||||
|
[
|
||||||
|
(
|
||||||
|
text_bbox[0] - text_bg_padding,
|
||||||
|
text_bbox[1] - text_bg_padding,
|
||||||
|
),
|
||||||
|
(
|
||||||
|
text_bbox[2] + text_bg_padding,
|
||||||
|
text_bbox[3] + text_bg_padding,
|
||||||
|
),
|
||||||
|
],
|
||||||
|
fill=(255, 255, 255, 180), # Semi-transparent white
|
||||||
|
)
|
||||||
|
# Draw text
|
||||||
|
draw.text(
|
||||||
|
(x0, y0),
|
||||||
|
label_text,
|
||||||
|
fill=(0, 0, 0, 255), # Solid black
|
||||||
|
font=font,
|
||||||
|
)
|
@ -54,12 +54,12 @@ tokens), &
|
|||||||
chunks with same headings & captions) — users can opt out of this step via param
|
chunks with same headings & captions) — users can opt out of this step via param
|
||||||
`merge_peers` (by default `True`)
|
`merge_peers` (by default `True`)
|
||||||
|
|
||||||
👉 Example: see [here](../../examples/hybrid_chunking).
|
👉 Example: see [here](../examples/hybrid_chunking.ipynb).
|
||||||
|
|
||||||
## Hierarchical Chunker
|
## Hierarchical Chunker
|
||||||
|
|
||||||
The `HierarchicalChunker` implementation uses the document structure information from
|
The `HierarchicalChunker` implementation uses the document structure information from
|
||||||
the [`DoclingDocument`](../docling_document) to create one chunk for each individual
|
the [`DoclingDocument`](./docling_document.md) to create one chunk for each individual
|
||||||
detected document element, by default only merging together list items (can be opted out
|
detected document element, by default only merging together list items (can be opted out
|
||||||
via param `merge_list_items`). It also takes care of attaching all relevant document
|
via param `merge_list_items`). It also takes care of attaching all relevant document
|
||||||
metadata, including headers and captions.
|
metadata, including headers and captions.
|
||||||
|
1057
docs/examples/backend_xml_rag.ipynb
Normal file
1057
docs/examples/backend_xml_rag.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
@ -5,7 +5,11 @@ from pathlib import Path
|
|||||||
|
|
||||||
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
|
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
|
||||||
from docling.datamodel.base_models import InputFormat
|
from docling.datamodel.base_models import InputFormat
|
||||||
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
from docling.datamodel.pipeline_options import (
|
||||||
|
AcceleratorDevice,
|
||||||
|
AcceleratorOptions,
|
||||||
|
PdfPipelineOptions,
|
||||||
|
)
|
||||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||||
from docling.models.ocr_mac_model import OcrMacOptions
|
from docling.models.ocr_mac_model import OcrMacOptions
|
||||||
from docling.models.tesseract_ocr_cli_model import TesseractCliOcrOptions
|
from docling.models.tesseract_ocr_cli_model import TesseractCliOcrOptions
|
||||||
@ -76,7 +80,7 @@ def main():
|
|||||||
pipeline_options.table_structure_options.do_cell_matching = True
|
pipeline_options.table_structure_options.do_cell_matching = True
|
||||||
pipeline_options.ocr_options.lang = ["es"]
|
pipeline_options.ocr_options.lang = ["es"]
|
||||||
pipeline_options.accelerator_options = AcceleratorOptions(
|
pipeline_options.accelerator_options = AcceleratorOptions(
|
||||||
num_threads=4, device=Device.AUTO
|
num_threads=4, device=AcceleratorDevice.AUTO
|
||||||
)
|
)
|
||||||
|
|
||||||
doc_converter = DocumentConverter(
|
doc_converter = DocumentConverter(
|
||||||
|
88
docs/examples/develop_formula_understanding.py
Normal file
88
docs/examples/develop_formula_understanding.py
Normal file
@ -0,0 +1,88 @@
|
|||||||
|
import logging
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Iterable
|
||||||
|
|
||||||
|
from docling_core.types.doc import DocItemLabel, DoclingDocument, NodeItem, TextItem
|
||||||
|
|
||||||
|
from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement
|
||||||
|
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
||||||
|
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||||
|
from docling.models.base_model import BaseItemAndImageEnrichmentModel
|
||||||
|
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
|
||||||
|
|
||||||
|
|
||||||
|
class ExampleFormulaUnderstandingPipelineOptions(PdfPipelineOptions):
|
||||||
|
do_formula_understanding: bool = True
|
||||||
|
|
||||||
|
|
||||||
|
# A new enrichment model using both the document element and its image as input
|
||||||
|
class ExampleFormulaUnderstandingEnrichmentModel(BaseItemAndImageEnrichmentModel):
|
||||||
|
images_scale = 2.6
|
||||||
|
|
||||||
|
def __init__(self, enabled: bool):
|
||||||
|
self.enabled = enabled
|
||||||
|
|
||||||
|
def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:
|
||||||
|
return (
|
||||||
|
self.enabled
|
||||||
|
and isinstance(element, TextItem)
|
||||||
|
and element.label == DocItemLabel.FORMULA
|
||||||
|
)
|
||||||
|
|
||||||
|
def __call__(
|
||||||
|
self,
|
||||||
|
doc: DoclingDocument,
|
||||||
|
element_batch: Iterable[ItemAndImageEnrichmentElement],
|
||||||
|
) -> Iterable[NodeItem]:
|
||||||
|
if not self.enabled:
|
||||||
|
return
|
||||||
|
|
||||||
|
for enrich_element in element_batch:
|
||||||
|
enrich_element.image.show()
|
||||||
|
|
||||||
|
yield enrich_element.item
|
||||||
|
|
||||||
|
|
||||||
|
# How the pipeline can be extended.
|
||||||
|
class ExampleFormulaUnderstandingPipeline(StandardPdfPipeline):
|
||||||
|
|
||||||
|
def __init__(self, pipeline_options: ExampleFormulaUnderstandingPipelineOptions):
|
||||||
|
super().__init__(pipeline_options)
|
||||||
|
self.pipeline_options: ExampleFormulaUnderstandingPipelineOptions
|
||||||
|
|
||||||
|
self.enrichment_pipe = [
|
||||||
|
ExampleFormulaUnderstandingEnrichmentModel(
|
||||||
|
enabled=self.pipeline_options.do_formula_understanding
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
if self.pipeline_options.do_formula_understanding:
|
||||||
|
self.keep_backend = True
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_default_options(cls) -> ExampleFormulaUnderstandingPipelineOptions:
|
||||||
|
return ExampleFormulaUnderstandingPipelineOptions()
|
||||||
|
|
||||||
|
|
||||||
|
# Example main. In the final version, we simply have to set do_formula_understanding to true.
|
||||||
|
def main():
|
||||||
|
logging.basicConfig(level=logging.INFO)
|
||||||
|
|
||||||
|
input_doc_path = Path("./tests/data/2203.01017v2.pdf")
|
||||||
|
|
||||||
|
pipeline_options = ExampleFormulaUnderstandingPipelineOptions()
|
||||||
|
pipeline_options.do_formula_understanding = True
|
||||||
|
|
||||||
|
doc_converter = DocumentConverter(
|
||||||
|
format_options={
|
||||||
|
InputFormat.PDF: PdfFormatOption(
|
||||||
|
pipeline_cls=ExampleFormulaUnderstandingPipeline,
|
||||||
|
pipeline_options=pipeline_options,
|
||||||
|
)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
result = doc_converter.convert(input_doc_path)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
@ -22,7 +22,6 @@ class ExamplePictureClassifierPipelineOptions(PdfPipelineOptions):
|
|||||||
|
|
||||||
|
|
||||||
class ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel):
|
class ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel):
|
||||||
|
|
||||||
def __init__(self, enabled: bool):
|
def __init__(self, enabled: bool):
|
||||||
self.enabled = enabled
|
self.enabled = enabled
|
||||||
|
|
||||||
@ -54,7 +53,6 @@ class ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel):
|
|||||||
|
|
||||||
|
|
||||||
class ExamplePictureClassifierPipeline(StandardPdfPipeline):
|
class ExamplePictureClassifierPipeline(StandardPdfPipeline):
|
||||||
|
|
||||||
def __init__(self, pipeline_options: ExamplePictureClassifierPipelineOptions):
|
def __init__(self, pipeline_options: ExamplePictureClassifierPipelineOptions):
|
||||||
super().__init__(pipeline_options)
|
super().__init__(pipeline_options)
|
||||||
self.pipeline_options: ExamplePictureClassifierPipeline
|
self.pipeline_options: ExamplePictureClassifierPipeline
|
||||||
|
29
docs/examples/inspect_picture_content.py
Normal file
29
docs/examples/inspect_picture_content.py
Normal file
@ -0,0 +1,29 @@
|
|||||||
|
from docling_core.types.doc import TextItem
|
||||||
|
|
||||||
|
from docling.datamodel.base_models import InputFormat
|
||||||
|
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
||||||
|
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||||
|
|
||||||
|
source = "tests/data/amt_handbook_sample.pdf"
|
||||||
|
|
||||||
|
pipeline_options = PdfPipelineOptions()
|
||||||
|
pipeline_options.images_scale = 2
|
||||||
|
pipeline_options.generate_page_images = True
|
||||||
|
|
||||||
|
doc_converter = DocumentConverter(
|
||||||
|
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
|
||||||
|
)
|
||||||
|
|
||||||
|
result = doc_converter.convert(source)
|
||||||
|
|
||||||
|
doc = result.document
|
||||||
|
|
||||||
|
for picture in doc.pictures:
|
||||||
|
# picture.get_image(doc).show() # display the picture
|
||||||
|
print(picture.caption_text(doc), " contains these elements:")
|
||||||
|
|
||||||
|
for item, level in doc.iterate_items(root=picture, traverse_pictures=True):
|
||||||
|
if isinstance(item, TextItem):
|
||||||
|
print(item.text)
|
||||||
|
|
||||||
|
print("\n")
|
894
docs/examples/rag_azuresearch.ipynb
Normal file
894
docs/examples/rag_azuresearch.ipynb
Normal file
@ -0,0 +1,894 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {
|
||||||
|
"id": "Ag9kcX2B_atc"
|
||||||
|
},
|
||||||
|
"source": [
|
||||||
|
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/rag_azuresearch.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# RAG with Azure AI Search"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"| Step | Tech | Execution |\n",
|
||||||
|
"| ------------------ | ------------------ | --------- |\n",
|
||||||
|
"| Embedding | Azure OpenAI | 🌐 Remote |\n",
|
||||||
|
"| Vector Store | Azure AI Search | 🌐 Remote |\n",
|
||||||
|
"| Gen AI | Azure OpenAI | 🌐 Remote |"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"## A recipe 🧑🍳 🐥 💚\n",
|
||||||
|
"\n",
|
||||||
|
"This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:\n",
|
||||||
|
"- [Docling](https://ds4sd.github.io/docling/) for document parsing and chunking\n",
|
||||||
|
"- [Azure AI Search](https://azure.microsoft.com/products/ai-services/ai-search/?msockid=0109678bea39665431e37323ebff6723) for vector indexing and retrieval\n",
|
||||||
|
"- [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service?msockid=0109678bea39665431e37323ebff6723) for embeddings and chat completion\n",
|
||||||
|
"\n",
|
||||||
|
"This sample demonstrates how to:\n",
|
||||||
|
"1. Parse a PDF with Docling.\n",
|
||||||
|
"2. Chunk the parsed text.\n",
|
||||||
|
"3. Use Azure OpenAI for embeddings.\n",
|
||||||
|
"4. Index and search in Azure AI Search.\n",
|
||||||
|
"5. Run a retrieval-augmented generation (RAG) query with Azure OpenAI GPT-4o.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# If running in a fresh environment (like Google Colab), uncomment and run this single command:\n",
|
||||||
|
"%pip install \"docling~=2.12\" azure-search-documents==11.5.2 azure-identity openai rich torch python-dotenv"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Part 0: Prerequisites\n",
|
||||||
|
" - **Azure AI Search** resource\n",
|
||||||
|
" - **Azure OpenAI** resource with a deployed embedding and chat completion model (e.g. `text-embedding-3-small` and `gpt-4o`) \n",
|
||||||
|
" - **Docling 2.12+** (installs `docling_core` automatically) Docling installed (Python 3.8+ environment)\n",
|
||||||
|
"\n",
|
||||||
|
"- A **GPU-enabled environment** is preferred for faster parsing. Docling 2.12 automatically detects GPU if present.\n",
|
||||||
|
" - If you only have CPU, parsing large PDFs can be slower. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 1,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import os\n",
|
||||||
|
"\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"\n",
|
||||||
|
"load_dotenv()\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"def _get_env(key, default=None):\n",
|
||||||
|
" try:\n",
|
||||||
|
" from google.colab import userdata\n",
|
||||||
|
"\n",
|
||||||
|
" try:\n",
|
||||||
|
" return userdata.get(key)\n",
|
||||||
|
" except userdata.SecretNotFoundError:\n",
|
||||||
|
" pass\n",
|
||||||
|
" except ImportError:\n",
|
||||||
|
" pass\n",
|
||||||
|
" return os.getenv(key, default)\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"AZURE_SEARCH_ENDPOINT = _get_env(\"AZURE_SEARCH_ENDPOINT\")\n",
|
||||||
|
"AZURE_SEARCH_KEY = _get_env(\"AZURE_SEARCH_KEY\") # Ensure this is your Admin Key\n",
|
||||||
|
"AZURE_SEARCH_INDEX_NAME = _get_env(\"AZURE_SEARCH_INDEX_NAME\", \"docling-rag-sample\")\n",
|
||||||
|
"AZURE_OPENAI_ENDPOINT = _get_env(\"AZURE_OPENAI_ENDPOINT\")\n",
|
||||||
|
"AZURE_OPENAI_API_KEY = _get_env(\"AZURE_OPENAI_API_KEY\")\n",
|
||||||
|
"AZURE_OPENAI_API_VERSION = _get_env(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\")\n",
|
||||||
|
"AZURE_OPENAI_CHAT_MODEL = _get_env(\n",
|
||||||
|
" \"AZURE_OPENAI_CHAT_MODEL\"\n",
|
||||||
|
") # Using a deployed model named \"gpt-4o\"\n",
|
||||||
|
"AZURE_OPENAI_EMBEDDINGS = _get_env(\n",
|
||||||
|
" \"AZURE_OPENAI_EMBEDDINGS\", \"text-embedding-3-small\"\n",
|
||||||
|
") # Using a deployed model named \"text-embeddings-3-small\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Part 1: Parse the PDF with Docling\n",
|
||||||
|
"\n",
|
||||||
|
"We’ll parse the **Microsoft GraphRAG Research Paper** (~15 pages). Parsing should be relatively quick, even on CPU, but it will be faster on a GPU or MPS device if available.\n",
|
||||||
|
"\n",
|
||||||
|
"*(If you prefer a different document, simply provide a different URL or local file path.)*"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 11,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #808000; text-decoration-color: #808000; font-weight: bold\">Parsing a ~</span><span style=\"color: #808000; text-decoration-color: #808000; font-weight: bold\">15</span><span style=\"color: #808000; text-decoration-color: #808000; font-weight: bold\">-page PDF. The process should be relatively quick, even on CPU...</span>\n",
|
||||||
|
"</pre>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"\u001b[1;33mParsing a ~\u001b[0m\u001b[1;33m15\u001b[0m\u001b[1;33m-page PDF. The process should be relatively quick, even on CPU\u001b[0m\u001b[1;33m...\u001b[0m\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭─────────────────────────────────────────── Docling Markdown Preview ────────────────────────────────────────────╮\n",
|
||||||
|
"│ ## From Local to Global: A Graph RAG Approach to Query-Focused Summarization │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Darren Edge 1† │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Ha Trinh 1† │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Newman Cheng 2 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Joshua Bradley 2 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Alex Chao 3 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Apurva Mody 3 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Steven Truitt 2 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ ## Jonathan Larson 1 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ 1 Microsoft Research 2 Microsoft Strategic Missions and Technologies 3 Microsoft Office of the CTO │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ { daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso } @microsoft.com │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ † These authors contributed equally to this work │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ ## Abstract │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ The use of retrieval-augmented gen... │\n",
|
||||||
|
"╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
|
||||||
|
"</pre>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"╭─────────────────────────────────────────── Docling Markdown Preview ────────────────────────────────────────────╮\n",
|
||||||
|
"│ ## From Local to Global: A Graph RAG Approach to Query-Focused Summarization │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Darren Edge 1† │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Ha Trinh 1† │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Newman Cheng 2 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Joshua Bradley 2 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Alex Chao 3 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Apurva Mody 3 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ Steven Truitt 2 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ ## Jonathan Larson 1 │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ 1 Microsoft Research 2 Microsoft Strategic Missions and Technologies 3 Microsoft Office of the CTO │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ { daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso } @microsoft.com │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ † These authors contributed equally to this work │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ ## Abstract │\n",
|
||||||
|
"│ │\n",
|
||||||
|
"│ The use of retrieval-augmented gen... │\n",
|
||||||
|
"╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"from rich.console import Console\n",
|
||||||
|
"from rich.panel import Panel\n",
|
||||||
|
"\n",
|
||||||
|
"from docling.document_converter import DocumentConverter\n",
|
||||||
|
"\n",
|
||||||
|
"console = Console()\n",
|
||||||
|
"\n",
|
||||||
|
"# This URL points to the Microsoft GraphRAG Research Paper (arXiv: 2404.16130), ~15 pages\n",
|
||||||
|
"source_url = \"https://arxiv.org/pdf/2404.16130\"\n",
|
||||||
|
"\n",
|
||||||
|
"console.print(\n",
|
||||||
|
" \"[bold yellow]Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...[/bold yellow]\"\n",
|
||||||
|
")\n",
|
||||||
|
"converter = DocumentConverter()\n",
|
||||||
|
"result = converter.convert(source_url)\n",
|
||||||
|
"\n",
|
||||||
|
"# Optional: preview the parsed Markdown\n",
|
||||||
|
"md_preview = result.document.export_to_markdown()\n",
|
||||||
|
"console.print(Panel(md_preview[:500] + \"...\", title=\"Docling Markdown Preview\"))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Part 2: Hierarchical Chunking\n",
|
||||||
|
"We convert the `Document` into smaller chunks for embedding and indexing. The built-in `HierarchicalChunker` preserves structure. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 22,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Total chunks from PDF: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">106</span>\n",
|
||||||
|
"</pre>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"Total chunks from PDF: \u001b[1;36m106\u001b[0m\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"from docling.chunking import HierarchicalChunker\n",
|
||||||
|
"\n",
|
||||||
|
"chunker = HierarchicalChunker()\n",
|
||||||
|
"doc_chunks = list(chunker.chunk(result.document))\n",
|
||||||
|
"\n",
|
||||||
|
"all_chunks = []\n",
|
||||||
|
"for idx, c in enumerate(doc_chunks):\n",
|
||||||
|
" chunk_text = c.text\n",
|
||||||
|
" all_chunks.append((f\"chunk_{idx}\", chunk_text))\n",
|
||||||
|
"\n",
|
||||||
|
"console.print(f\"Total chunks from PDF: {len(all_chunks)}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Part 3: Create Azure AI Search Index and Push Chunk Embeddings\n",
|
||||||
|
"We’ll define a vector index in Azure AI Search, then embed each chunk using Azure OpenAI and upload in batches."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 23,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Index <span style=\"color: #008000; text-decoration-color: #008000\">'docling-rag-sample-2'</span> created.\n",
|
||||||
|
"</pre>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"Index \u001b[32m'docling-rag-sample-2'\u001b[0m created.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"from azure.core.credentials import AzureKeyCredential\n",
|
||||||
|
"from azure.search.documents.indexes import SearchIndexClient\n",
|
||||||
|
"from azure.search.documents.indexes.models import (\n",
|
||||||
|
" AzureOpenAIVectorizer,\n",
|
||||||
|
" AzureOpenAIVectorizerParameters,\n",
|
||||||
|
" HnswAlgorithmConfiguration,\n",
|
||||||
|
" SearchableField,\n",
|
||||||
|
" SearchField,\n",
|
||||||
|
" SearchFieldDataType,\n",
|
||||||
|
" SearchIndex,\n",
|
||||||
|
" SimpleField,\n",
|
||||||
|
" VectorSearch,\n",
|
||||||
|
" VectorSearchProfile,\n",
|
||||||
|
")\n",
|
||||||
|
"from rich.console import Console\n",
|
||||||
|
"\n",
|
||||||
|
"console = Console()\n",
|
||||||
|
"\n",
|
||||||
|
"VECTOR_DIM = 1536 # Adjust based on your chosen embeddings model\n",
|
||||||
|
"\n",
|
||||||
|
"index_client = SearchIndexClient(\n",
|
||||||
|
" AZURE_SEARCH_ENDPOINT, AzureKeyCredential(AZURE_SEARCH_KEY)\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"def create_search_index(index_name: str):\n",
|
||||||
|
" # Define fields\n",
|
||||||
|
" fields = [\n",
|
||||||
|
" SimpleField(name=\"chunk_id\", type=SearchFieldDataType.String, key=True),\n",
|
||||||
|
" SearchableField(name=\"content\", type=SearchFieldDataType.String),\n",
|
||||||
|
" SearchField(\n",
|
||||||
|
" name=\"content_vector\",\n",
|
||||||
|
" type=SearchFieldDataType.Collection(SearchFieldDataType.Single),\n",
|
||||||
|
" searchable=True,\n",
|
||||||
|
" filterable=False,\n",
|
||||||
|
" sortable=False,\n",
|
||||||
|
" facetable=False,\n",
|
||||||
|
" vector_search_dimensions=VECTOR_DIM,\n",
|
||||||
|
" vector_search_profile_name=\"default\",\n",
|
||||||
|
" ),\n",
|
||||||
|
" ]\n",
|
||||||
|
" # Vector search config with an AzureOpenAIVectorizer\n",
|
||||||
|
" vector_search = VectorSearch(\n",
|
||||||
|
" algorithms=[HnswAlgorithmConfiguration(name=\"default\")],\n",
|
||||||
|
" profiles=[\n",
|
||||||
|
" VectorSearchProfile(\n",
|
||||||
|
" name=\"default\",\n",
|
||||||
|
" algorithm_configuration_name=\"default\",\n",
|
||||||
|
" vectorizer_name=\"default\",\n",
|
||||||
|
" )\n",
|
||||||
|
" ],\n",
|
||||||
|
" vectorizers=[\n",
|
||||||
|
" AzureOpenAIVectorizer(\n",
|
||||||
|
" vectorizer_name=\"default\",\n",
|
||||||
|
" parameters=AzureOpenAIVectorizerParameters(\n",
|
||||||
|
" resource_url=AZURE_OPENAI_ENDPOINT,\n",
|
||||||
|
" deployment_name=AZURE_OPENAI_EMBEDDINGS,\n",
|
||||||
|
" model_name=\"text-embedding-3-small\",\n",
|
||||||
|
" api_key=AZURE_OPENAI_API_KEY,\n",
|
||||||
|
" ),\n",
|
||||||
|
" )\n",
|
||||||
|
" ],\n",
|
||||||
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
" # Create or update the index\n",
|
||||||
|
" new_index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)\n",
|
||||||
|
" try:\n",
|
||||||
|
" index_client.delete_index(index_name)\n",
|
||||||
|
" except:\n",
|
||||||
|
" pass\n",
|
||||||
|
"\n",
|
||||||
|
" index_client.create_or_update_index(new_index)\n",
|
||||||
|
" console.print(f\"Index '{index_name}' created.\")\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"create_search_index(AZURE_SEARCH_INDEX_NAME)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"#### Generate Embeddings and Upload to Azure AI Search\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 28,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Uploaded batch <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span> -> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">50</span>; all_succeeded: <span style=\"color: #00ff00; text-decoration-color: #00ff00; font-style: italic\">True</span>, first_doc_status_code: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">201</span>\n",
|
||||||
|
"</pre>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"Uploaded batch \u001b[1;36m0\u001b[0m -> \u001b[1;36m50\u001b[0m; all_succeeded: \u001b[3;92mTrue\u001b[0m, first_doc_status_code: \u001b[1;36m201\u001b[0m\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Uploaded batch <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">50</span> -> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">100</span>; all_succeeded: <span style=\"color: #00ff00; text-decoration-color: #00ff00; font-style: italic\">True</span>, first_doc_status_code: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">201</span>\n",
|
||||||
|
"</pre>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"Uploaded batch \u001b[1;36m50\u001b[0m -> \u001b[1;36m100\u001b[0m; all_succeeded: \u001b[3;92mTrue\u001b[0m, first_doc_status_code: \u001b[1;36m201\u001b[0m\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Uploaded batch <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">100</span> -> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">106</span>; all_succeeded: <span style=\"color: #00ff00; text-decoration-color: #00ff00; font-style: italic\">True</span>, first_doc_status_code: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">201</span>\n",
|
||||||
|
"</pre>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"Uploaded batch \u001b[1;36m100\u001b[0m -> \u001b[1;36m106\u001b[0m; all_succeeded: \u001b[3;92mTrue\u001b[0m, first_doc_status_code: \u001b[1;36m201\u001b[0m\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">All chunks uploaded to Azure Search.\n",
|
||||||
|
"</pre>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"All chunks uploaded to Azure Search.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"from azure.search.documents import SearchClient\n",
|
||||||
|
"from openai import AzureOpenAI\n",
|
||||||
|
"\n",
|
||||||
|
"search_client = SearchClient(\n",
|
||||||
|
" AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_INDEX_NAME, AzureKeyCredential(AZURE_SEARCH_KEY)\n",
|
||||||
|
")\n",
|
||||||
|
"openai_client = AzureOpenAI(\n",
|
||||||
|
" api_key=AZURE_OPENAI_API_KEY,\n",
|
||||||
|
" api_version=AZURE_OPENAI_API_VERSION,\n",
|
||||||
|
" azure_endpoint=AZURE_OPENAI_ENDPOINT,\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"def embed_text(text: str):\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Helper to generate embeddings with Azure OpenAI.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" response = openai_client.embeddings.create(\n",
|
||||||
|
" input=text, model=AZURE_OPENAI_EMBEDDINGS\n",
|
||||||
|
" )\n",
|
||||||
|
" return response.data[0].embedding\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"upload_docs = []\n",
|
||||||
|
"for chunk_id, chunk_text in all_chunks:\n",
|
||||||
|
" embedding_vector = embed_text(chunk_text)\n",
|
||||||
|
" upload_docs.append(\n",
|
||||||
|
" {\n",
|
||||||
|
" \"chunk_id\": chunk_id,\n",
|
||||||
|
" \"content\": chunk_text,\n",
|
||||||
|
" \"content_vector\": embedding_vector,\n",
|
||||||
|
" }\n",
|
||||||
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"BATCH_SIZE = 50\n",
|
||||||
|
"for i in range(0, len(upload_docs), BATCH_SIZE):\n",
|
||||||
|
" subset = upload_docs[i : i + BATCH_SIZE]\n",
|
||||||
|
" resp = search_client.upload_documents(documents=subset)\n",
|
||||||
|
"\n",
|
||||||
|
" all_succeeded = all(r.succeeded for r in resp)\n",
|
||||||
|
" console.print(\n",
|
||||||
|
" f\"Uploaded batch {i} -> {i+len(subset)}; all_succeeded: {all_succeeded}, \"\n",
|
||||||
|
" f\"first_doc_status_code: {resp[0].status_code}\"\n",
|
||||||
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
"console.print(\"All chunks uploaded to Azure Search.\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Part 4: Perform RAG over PDF\n",
|
||||||
|
"Combine retrieval from Azure AI Search with Azure OpenAI Chat Completions (aka. grounding your LLM)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 29,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">╭──────────────────────────────────────────────────</span> RAG Prompt <span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">───────────────────────────────────────────────────╮</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ You are an AI assistant helping answering questions about Microsoft GraphRAG. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Use ONLY the text below to answer the user's question. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ If the answer isn't in the text, say you don't know. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Context: │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Community summaries vs. source texts. When comparing community summaries to source texts using Graph RAG, │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ community summaries generally provided a small but consistent improvement in answer comprehensiveness and │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ diversity, except for root-level summaries. Intermediate-level summaries in the Podcast dataset and low-level │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ community summaries in the News dataset achieved comprehensiveness win rates of 57% and 64%, respectively. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Diversity win rates were 57% for Podcast intermediate-level summaries and 60% for News low-level community │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ summaries. Table 3 also illustrates the scalability advantages of Graph RAG compared to source text │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ summarization: for low-level community summaries ( C3 ), Graph RAG required 26-33% fewer context tokens, while │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ for root-level community summaries ( C0 ), it required over 97% fewer tokens. For a modest drop in performance │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ compared with other global methods, root-level Graph RAG offers a highly efficient method for the iterative │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ question answering that characterizes sensemaking activity, while retaining advantages in comprehensiveness │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ (72% win rate) and diversity (62% win rate) over na¨ıve RAG. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ --- │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ We have presented a global approach to Graph RAG, combining knowledge graph generation, retrieval-augmented │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ generation (RAG), and query-focused summarization (QFS) to support human sensemaking over entire text corpora. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Initial evaluations show substantial improvements over a na¨ıve RAG baseline for both the comprehensiveness and │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ diversity of answers, as well as favorable comparisons to a global but graph-free approach using map-reduce │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ source text summarization. For situations requiring many global queries over the same dataset, summaries of │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ root-level communities in the entity-based graph index provide a data index that is both superior to na¨ıve RAG │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ and achieves competitive performance to other global methods at a fraction of the token cost. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ --- │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Trade-offs of building a graph index . We consistently observed Graph RAG achieve the best headto-head results │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ against other methods, but in many cases the graph-free approach to global summarization of source texts │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ performed competitively. The real-world decision about whether to invest in building a graph index depends on │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ multiple factors, including the compute budget, expected number of lifetime queries per dataset, and value │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ obtained from other aspects of the graph index (including the generic community summaries and the use of other │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ graph-related RAG approaches). │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ --- │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Future work . The graph index, rich text annotations, and hierarchical community structure supporting the │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ current Graph RAG approach offer many possibilities for refinement and adaptation. This includes RAG approaches │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ that operate in a more local manner, via embedding-based matching of user queries and graph annotations, as │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ well as the possibility of hybrid RAG schemes that combine embedding-based matching against community reports │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ before employing our map-reduce summarization mechanisms. This 'roll-up' operation could also be extended │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ across more levels of the community hierarchy, as well as implemented as a more exploratory 'drill down' │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ mechanism that follows the information scent contained in higher-level community summaries. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ --- │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Advanced RAG systems include pre-retrieval, retrieval, post-retrieval strategies designed to overcome the │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ drawbacks of Na¨ıve RAG, while Modular RAG systems include patterns for iterative and dynamic cycles of │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ interleaved retrieval and generation (Gao et al., 2023). Our implementation of Graph RAG incorporates multiple │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ concepts related to other systems. For example, our community summaries are a kind of self-memory (Selfmem, │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Cheng et al., 2024) for generation-augmented retrieval (GAR, Mao et al., 2020) that facilitates future │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ generation cycles, while our parallel generation of community answers from these summaries is a kind of │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ iterative (Iter-RetGen, Shao et al., 2023) or federated (FeB4RAG, Wang et al., 2024) retrieval-generation │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ strategy. Other systems have also combined these concepts for multi-document summarization (CAiRE-COVID, Su et │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ al., 2020) and multi-hop question answering (ITRG, Feng et al., 2023; IR-CoT, Trivedi et al., 2022; DSP, │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Khattab et al., 2022). Our use of a hierarchical index and summarization also bears resemblance to further │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ approaches, such as generating a hierarchical index of text chunks by clustering the vectors of text embeddings │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ (RAPTOR, Sarthi et al., 2024) or generating a 'tree of clarifications' to answer multiple interpretations of │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ ambiguous questions (Kim et al., 2023). However, none of these iterative or hierarchical approaches use the │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ kind of self-generated graph index that enables Graph RAG. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ --- │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ source enables large language models (LLMs) to answer questions over private and/or previously unseen document │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ collections. However, RAG fails on global questions directed at an entire text corpus, such as 'What are the │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ main themes in the dataset?', since this is inherently a queryfocused summarization (QFS) task, rather than an │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ typical RAGsystems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ question answering over private text corpora that scales with both the generality of user questions and the │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ quantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ summaries for all groups of closely-related entities. Given a question, each community summary is used to │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ generate a partial response, before all partial responses are again summarized in a final response to the user. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ leads to substantial improvements over a na¨ıve RAG baseline for both the comprehensiveness and diversity of │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ forthcoming at https://aka . ms/graphrag . │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ --- │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Given the multi-stage nature of our Graph RAG mechanism, the multiple conditions we wanted to compare, and the │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ lack of gold standard answers to our activity-based sensemaking questions, we decided to adopt a head-to-head │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ comparison approach using an LLM evaluator. We selected three target metrics capturing qualities that are │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ desirable for sensemaking activities, as well as a control metric (directness) used as a indicator of validity. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Since directness is effectively in opposition to comprehensiveness and diversity, we would not expect any │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ method to win across all four metrics. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ --- │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Figure 1: Graph RAG pipeline using an LLM-derived graph index of source document text. This index spans nodes │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ (e.g., entities), edges (e.g., relationships), and covariates (e.g., claims) that have been detected, │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ extracted, and summarized by LLM prompts tailored to the domain of the dataset. Community detection (e.g., │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Leiden, Traag et al., 2019) is used to partition the graph index into groups of elements (nodes, edges, │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ covariates) that the LLM can summarize in parallel at both indexing time and query time. The 'global answer' to │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ a given query is produced using a final round of query-focused summarization over all community summaries │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ reporting relevance to that query. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ --- │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Retrieval-augmented generation (RAG, Lewis et al., 2020) is an established approach to answering user questions │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ over entire datasets, but it is designed for situations where these answers are contained locally within │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ regions of text whose retrieval provides sufficient grounding for the generation task. Instead, a more │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ appropriate task framing is query-focused summarization (QFS, Dang, 2006), and in particular, query-focused │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ abstractive summarization that generates natural language summaries and not just concatenated excerpts (Baumel │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ et al., 2018; Laskar et al., 2020; Yao et al., 2017) . In recent years, however, such distinctions between │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ summarization tasks that are abstractive versus extractive, generic versus query-focused, and single-document │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ versus multi-document, have become less relevant. While early applications of the transformer architecture │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ showed substantial improvements on the state-of-the-art for all such summarization tasks (Goodwin et al., 2020; │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Laskar et al., 2022; Liu and Lapata, 2019), these tasks are now trivialized by modern LLMs, including the GPT │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ (Achiam et al., 2023; Brown et al., 2020), Llama (Touvron et al., 2023), and Gemini (Anil et al., 2023) series, │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ all of which can use in-context learning to summarize any content provided in their context window. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ --- │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ community descriptions provide complete coverage of the underlying graph index and the input documents it │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ represents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach: │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ first using each community summary to answer the query independently and in parallel, then summarizing all │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ relevant partial answers into a final global answer. │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Question: What are the main advantages of using the Graph RAG approach for query-focused summarization compared │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ to traditional RAG methods? │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ Answer: │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">│ │</span>\n",
|
||||||
|
"<span style=\"color: #800000; text-decoration-color: #800000; font-weight: bold\">╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯</span>\n",
|
||||||
|
"</pre>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"\u001b[1;31m╭─\u001b[0m\u001b[1;31m─────────────────────────────────────────────────\u001b[0m RAG Prompt \u001b[1;31m──────────────────────────────────────────────────\u001b[0m\u001b[1;31m─╮\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mYou are an AI assistant helping answering questions about Microsoft GraphRAG.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mUse ONLY the text below to answer the user's question.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mIf the answer isn't in the text, say you don't know.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mContext:\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mCommunity summaries vs. source texts. When comparing community summaries to source texts using Graph RAG, \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mcommunity summaries generally provided a small but consistent improvement in answer comprehensiveness and \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mdiversity, except for root-level summaries. Intermediate-level summaries in the Podcast dataset and low-level \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mcommunity summaries in the News dataset achieved comprehensiveness win rates of 57% and 64%, respectively. \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mDiversity win rates were 57% for Podcast intermediate-level summaries and 60% for News low-level community \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31msummaries. Table 3 also illustrates the scalability advantages of Graph RAG compared to source text \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31msummarization: for low-level community summaries ( C3 ), Graph RAG required 26-33% fewer context tokens, while \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mfor root-level community summaries ( C0 ), it required over 97% fewer tokens. For a modest drop in performance \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mcompared with other global methods, root-level Graph RAG offers a highly efficient method for the iterative \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mquestion answering that characterizes sensemaking activity, while retaining advantages in comprehensiveness \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m(72% win rate) and diversity (62% win rate) over na¨ıve RAG.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m---\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mWe have presented a global approach to Graph RAG, combining knowledge graph generation, retrieval-augmented \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mgeneration (RAG), and query-focused summarization (QFS) to support human sensemaking over entire text corpora. \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mInitial evaluations show substantial improvements over a na¨ıve RAG baseline for both the comprehensiveness and\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mdiversity of answers, as well as favorable comparisons to a global but graph-free approach using map-reduce \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31msource text summarization. For situations requiring many global queries over the same dataset, summaries of \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mroot-level communities in the entity-based graph index provide a data index that is both superior to na¨ıve RAG\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mand achieves competitive performance to other global methods at a fraction of the token cost.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m---\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mTrade-offs of building a graph index . We consistently observed Graph RAG achieve the best headto-head results \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31magainst other methods, but in many cases the graph-free approach to global summarization of source texts \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mperformed competitively. The real-world decision about whether to invest in building a graph index depends on \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mmultiple factors, including the compute budget, expected number of lifetime queries per dataset, and value \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mobtained from other aspects of the graph index (including the generic community summaries and the use of other \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mgraph-related RAG approaches).\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m---\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mFuture work . The graph index, rich text annotations, and hierarchical community structure supporting the \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mcurrent Graph RAG approach offer many possibilities for refinement and adaptation. This includes RAG approaches\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mthat operate in a more local manner, via embedding-based matching of user queries and graph annotations, as \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mwell as the possibility of hybrid RAG schemes that combine embedding-based matching against community reports \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mbefore employing our map-reduce summarization mechanisms. This 'roll-up' operation could also be extended \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31macross more levels of the community hierarchy, as well as implemented as a more exploratory 'drill down' \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mmechanism that follows the information scent contained in higher-level community summaries.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m---\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mAdvanced RAG systems include pre-retrieval, retrieval, post-retrieval strategies designed to overcome the \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mdrawbacks of Na¨ıve RAG, while Modular RAG systems include patterns for iterative and dynamic cycles of \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31minterleaved retrieval and generation (Gao et al., 2023). Our implementation of Graph RAG incorporates multiple \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mconcepts related to other systems. For example, our community summaries are a kind of self-memory (Selfmem, \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mCheng et al., 2024) for generation-augmented retrieval (GAR, Mao et al., 2020) that facilitates future \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mgeneration cycles, while our parallel generation of community answers from these summaries is a kind of \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31miterative (Iter-RetGen, Shao et al., 2023) or federated (FeB4RAG, Wang et al., 2024) retrieval-generation \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mstrategy. Other systems have also combined these concepts for multi-document summarization (CAiRE-COVID, Su et \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mal., 2020) and multi-hop question answering (ITRG, Feng et al., 2023; IR-CoT, Trivedi et al., 2022; DSP, \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mKhattab et al., 2022). Our use of a hierarchical index and summarization also bears resemblance to further \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mapproaches, such as generating a hierarchical index of text chunks by clustering the vectors of text embeddings\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m(RAPTOR, Sarthi et al., 2024) or generating a 'tree of clarifications' to answer multiple interpretations of \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mambiguous questions (Kim et al., 2023). However, none of these iterative or hierarchical approaches use the \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mkind of self-generated graph index that enables Graph RAG.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m---\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mThe use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31msource enables large language models (LLMs) to answer questions over private and/or previously unseen document \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mcollections. However, RAG fails on global questions directed at an entire text corpus, such as 'What are the \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mmain themes in the dataset?', since this is inherently a queryfocused summarization (QFS) task, rather than an \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mexplicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mtypical RAGsystems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mquestion answering over private text corpora that scales with both the generality of user questions and the \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mquantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mstages: first to derive an entity knowledge graph from the source documents, then to pregenerate community \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31msummaries for all groups of closely-related entities. Given a question, each community summary is used to \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mgenerate a partial response, before all partial responses are again summarized in a final response to the user.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mFor a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mleads to substantial improvements over a na¨ıve RAG baseline for both the comprehensiveness and diversity of \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mgenerated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mforthcoming at https://aka . ms/graphrag .\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m---\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mGiven the multi-stage nature of our Graph RAG mechanism, the multiple conditions we wanted to compare, and the \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mlack of gold standard answers to our activity-based sensemaking questions, we decided to adopt a head-to-head \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mcomparison approach using an LLM evaluator. We selected three target metrics capturing qualities that are \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mdesirable for sensemaking activities, as well as a control metric (directness) used as a indicator of validity.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mSince directness is effectively in opposition to comprehensiveness and diversity, we would not expect any \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mmethod to win across all four metrics.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m---\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mFigure 1: Graph RAG pipeline using an LLM-derived graph index of source document text. This index spans nodes \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m(e.g., entities), edges (e.g., relationships), and covariates (e.g., claims) that have been detected, \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mextracted, and summarized by LLM prompts tailored to the domain of the dataset. Community detection (e.g., \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mLeiden, Traag et al., 2019) is used to partition the graph index into groups of elements (nodes, edges, \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mcovariates) that the LLM can summarize in parallel at both indexing time and query time. The 'global answer' to\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31ma given query is produced using a final round of query-focused summarization over all community summaries \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mreporting relevance to that query.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m---\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mRetrieval-augmented generation (RAG, Lewis et al., 2020) is an established approach to answering user questions\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mover entire datasets, but it is designed for situations where these answers are contained locally within \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mregions of text whose retrieval provides sufficient grounding for the generation task. Instead, a more \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mappropriate task framing is query-focused summarization (QFS, Dang, 2006), and in particular, query-focused \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mabstractive summarization that generates natural language summaries and not just concatenated excerpts (Baumel \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31met al., 2018; Laskar et al., 2020; Yao et al., 2017) . In recent years, however, such distinctions between \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31msummarization tasks that are abstractive versus extractive, generic versus query-focused, and single-document \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mversus multi-document, have become less relevant. While early applications of the transformer architecture \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mshowed substantial improvements on the state-of-the-art for all such summarization tasks (Goodwin et al., 2020;\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mLaskar et al., 2022; Liu and Lapata, 2019), these tasks are now trivialized by modern LLMs, including the GPT \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m(Achiam et al., 2023; Brown et al., 2020), Llama (Touvron et al., 2023), and Gemini (Anil et al., 2023) series,\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mall of which can use in-context learning to summarize any content provided in their context window.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m---\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mcommunity descriptions provide complete coverage of the underlying graph index and the input documents it \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mrepresents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach: \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mfirst using each community summary to answer the query independently and in parallel, then summarizing all \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mrelevant partial answers into a final global answer.\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mQuestion: What are the main advantages of using the Graph RAG approach for query-focused summarization compared\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mto traditional RAG methods?\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31mAnswer:\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m│\u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m \u001b[0m\u001b[1;31m│\u001b[0m\n",
|
||||||
|
"\u001b[1;31m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/html": [
|
||||||
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">╭─────────────────────────────────────────────────</span> RAG Response <span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">──────────────────────────────────────────────────╮</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ The main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ methods include: │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ 1. **Improved Comprehensiveness and Diversity**: Graph RAG shows substantial improvements over a naïve RAG │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ baseline in terms of the comprehensiveness and diversity of answers. This is particularly beneficial for global │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ sensemaking questions over large datasets. │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ 2. **Scalability**: Graph RAG provides scalability advantages, achieving efficient summarization with │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ significantly fewer context tokens required. For instance, it requires 26-33% fewer tokens for low-level │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ community summaries and over 97% fewer tokens for root-level summaries compared to source text summarization. │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ 3. **Efficiency in Iterative Question Answering**: Root-level Graph RAG offers a highly efficient method for │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ iterative question answering, which is crucial for sensemaking activities, with only a modest drop in │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ performance compared to other global methods. │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ 4. **Global Query Handling**: It supports handling global queries effectively, as it combines knowledge graph │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ generation, retrieval-augmented generation, and query-focused summarization, making it suitable for sensemaking │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ over entire text corpora. │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ 5. **Hierarchical Indexing and Summarization**: The use of a hierarchical index and summarization allows for │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ efficient processing and summarizing of community summaries into a final global answer, facilitating a │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ comprehensive coverage of the underlying graph index and input documents. │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ 6. **Reduced Token Cost**: For situations requiring many global queries over the same dataset, Graph RAG │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">│ achieves competitive performance to other global methods at a fraction of the token cost. │</span>\n",
|
||||||
|
"<span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯</span>\n",
|
||||||
|
"</pre>\n"
|
||||||
|
],
|
||||||
|
"text/plain": [
|
||||||
|
"\u001b[1;32m╭─\u001b[0m\u001b[1;32m────────────────────────────────────────────────\u001b[0m RAG Response \u001b[1;32m─────────────────────────────────────────────────\u001b[0m\u001b[1;32m─╮\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32mThe main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32mmethods include:\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m1. **Improved Comprehensiveness and Diversity**: Graph RAG shows substantial improvements over a naïve RAG \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32mbaseline in terms of the comprehensiveness and diversity of answers. This is particularly beneficial for global\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32msensemaking questions over large datasets.\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m2. **Scalability**: Graph RAG provides scalability advantages, achieving efficient summarization with \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32msignificantly fewer context tokens required. For instance, it requires 26-33% fewer tokens for low-level \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32mcommunity summaries and over 97% fewer tokens for root-level summaries compared to source text summarization.\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m3. **Efficiency in Iterative Question Answering**: Root-level Graph RAG offers a highly efficient method for \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32miterative question answering, which is crucial for sensemaking activities, with only a modest drop in \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32mperformance compared to other global methods.\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m4. **Global Query Handling**: It supports handling global queries effectively, as it combines knowledge graph \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32mgeneration, retrieval-augmented generation, and query-focused summarization, making it suitable for sensemaking\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32mover entire text corpora.\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m5. **Hierarchical Indexing and Summarization**: The use of a hierarchical index and summarization allows for \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32mefficient processing and summarizing of community summaries into a final global answer, facilitating a \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32mcomprehensive coverage of the underlying graph index and input documents.\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m6. **Reduced Token Cost**: For situations requiring many global queries over the same dataset, Graph RAG \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m│\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32machieves competitive performance to other global methods at a fraction of the token cost.\u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m \u001b[0m\u001b[1;32m│\u001b[0m\n",
|
||||||
|
"\u001b[1;32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\u001b[0m\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "display_data"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"from azure.search.documents.models import VectorizableTextQuery\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"def generate_chat_response(prompt: str, system_message: str = None):\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Generates a single-turn chat response using Azure OpenAI Chat.\n",
|
||||||
|
" If you need multi-turn conversation or follow-up queries, you'll have to\n",
|
||||||
|
" maintain the messages list externally.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" messages = []\n",
|
||||||
|
" if system_message:\n",
|
||||||
|
" messages.append({\"role\": \"system\", \"content\": system_message})\n",
|
||||||
|
" messages.append({\"role\": \"user\", \"content\": prompt})\n",
|
||||||
|
"\n",
|
||||||
|
" completion = openai_client.chat.completions.create(\n",
|
||||||
|
" model=AZURE_OPENAI_CHAT_MODEL, messages=messages, temperature=0.7\n",
|
||||||
|
" )\n",
|
||||||
|
" return completion.choices[0].message.content\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"user_query = \"What are the main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG methods?\"\n",
|
||||||
|
"user_embed = embed_text(user_query)\n",
|
||||||
|
"\n",
|
||||||
|
"vector_query = VectorizableTextQuery(\n",
|
||||||
|
" text=user_query, # passing in text for a hybrid search\n",
|
||||||
|
" k_nearest_neighbors=5,\n",
|
||||||
|
" fields=\"content_vector\",\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"search_results = search_client.search(\n",
|
||||||
|
" search_text=user_query, vector_queries=[vector_query], select=[\"content\"], top=10\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"retrieved_chunks = []\n",
|
||||||
|
"for result in search_results:\n",
|
||||||
|
" snippet = result[\"content\"]\n",
|
||||||
|
" retrieved_chunks.append(snippet)\n",
|
||||||
|
"\n",
|
||||||
|
"context_str = \"\\n---\\n\".join(retrieved_chunks)\n",
|
||||||
|
"rag_prompt = f\"\"\"\n",
|
||||||
|
"You are an AI assistant helping answering questions about Microsoft GraphRAG.\n",
|
||||||
|
"Use ONLY the text below to answer the user's question.\n",
|
||||||
|
"If the answer isn't in the text, say you don't know.\n",
|
||||||
|
"\n",
|
||||||
|
"Context:\n",
|
||||||
|
"{context_str}\n",
|
||||||
|
"\n",
|
||||||
|
"Question: {user_query}\n",
|
||||||
|
"Answer:\n",
|
||||||
|
"\"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"final_answer = generate_chat_response(rag_prompt)\n",
|
||||||
|
"\n",
|
||||||
|
"console.print(Panel(rag_prompt, title=\"RAG Prompt\", style=\"bold red\"))\n",
|
||||||
|
"console.print(Panel(final_answer, title=\"RAG Response\", style=\"bold green\"))"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"accelerator": "GPU",
|
||||||
|
"colab": {
|
||||||
|
"gpuType": "T4",
|
||||||
|
"provenance": []
|
||||||
|
},
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": ".venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.8"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 0
|
||||||
|
}
|
37
docs/examples/tesseract_lang_detection.py
Normal file
37
docs/examples/tesseract_lang_detection.py
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from docling.datamodel.base_models import InputFormat
|
||||||
|
from docling.datamodel.pipeline_options import (
|
||||||
|
PdfPipelineOptions,
|
||||||
|
TesseractCliOcrOptions,
|
||||||
|
TesseractOcrOptions,
|
||||||
|
)
|
||||||
|
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
input_doc = Path("./tests/data/2206.01062.pdf")
|
||||||
|
|
||||||
|
# Set lang=["auto"] with a tesseract OCR engine: TesseractOcrOptions, TesseractCliOcrOptions
|
||||||
|
# ocr_options = TesseractOcrOptions(lang=["auto"])
|
||||||
|
ocr_options = TesseractCliOcrOptions(lang=["auto"])
|
||||||
|
|
||||||
|
pipeline_options = PdfPipelineOptions(
|
||||||
|
do_ocr=True, force_full_page_ocr=True, ocr_options=ocr_options
|
||||||
|
)
|
||||||
|
|
||||||
|
converter = DocumentConverter(
|
||||||
|
format_options={
|
||||||
|
InputFormat.PDF: PdfFormatOption(
|
||||||
|
pipeline_options=pipeline_options,
|
||||||
|
)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
doc = converter.convert(input_doc).document
|
||||||
|
md = doc.export_to_markdown()
|
||||||
|
print(md)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
75
docs/examples/translate.py
Normal file
75
docs/examples/translate.py
Normal file
@ -0,0 +1,75 @@
|
|||||||
|
import logging
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from docling_core.types.doc import ImageRefMode, PictureItem, TableItem, TextItem
|
||||||
|
|
||||||
|
from docling.datamodel.base_models import FigureElement, InputFormat, Table
|
||||||
|
from docling.datamodel.pipeline_options import PdfPipelineOptions
|
||||||
|
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||||
|
|
||||||
|
_log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
IMAGE_RESOLUTION_SCALE = 2.0
|
||||||
|
|
||||||
|
|
||||||
|
# FIXME: put in your favorite translation code ....
|
||||||
|
def translate(text: str, src: str = "en", dest: str = "de"):
|
||||||
|
|
||||||
|
_log.warning("!!! IMPLEMENT HERE YOUR FAVORITE TRANSLATION CODE!!!")
|
||||||
|
# from googletrans import Translator
|
||||||
|
|
||||||
|
# Initialize the translator
|
||||||
|
# translator = Translator()
|
||||||
|
|
||||||
|
# Translate text from English to German
|
||||||
|
# text = "Hello, how are you?"
|
||||||
|
# translated = translator.translate(text, src="en", dest="de")
|
||||||
|
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
logging.basicConfig(level=logging.INFO)
|
||||||
|
|
||||||
|
input_doc_path = Path("./tests/data/2206.01062.pdf")
|
||||||
|
output_dir = Path("scratch")
|
||||||
|
|
||||||
|
# Important: For operating with page images, we must keep them, otherwise the DocumentConverter
|
||||||
|
# will destroy them for cleaning up memory.
|
||||||
|
# This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images.
|
||||||
|
# scale=1 correspond of a standard 72 DPI image
|
||||||
|
# The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched
|
||||||
|
# with the image field
|
||||||
|
pipeline_options = PdfPipelineOptions()
|
||||||
|
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
|
||||||
|
pipeline_options.generate_page_images = True
|
||||||
|
pipeline_options.generate_picture_images = True
|
||||||
|
|
||||||
|
doc_converter = DocumentConverter(
|
||||||
|
format_options={
|
||||||
|
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
conv_res = doc_converter.convert(input_doc_path)
|
||||||
|
conv_doc = conv_res.document
|
||||||
|
|
||||||
|
# Save markdown with embedded pictures in original text
|
||||||
|
md_filename = output_dir / f"{doc_filename}-with-images-orig.md"
|
||||||
|
conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)
|
||||||
|
|
||||||
|
for element, _level in conv_res.document.iterate_items():
|
||||||
|
if isinstance(element, TextItem):
|
||||||
|
element.orig = element.text
|
||||||
|
element.text = translate(text=element.text)
|
||||||
|
|
||||||
|
elif isinstance(element, TableItem):
|
||||||
|
for cell in element.data.table_cells:
|
||||||
|
cell.text = translate(text=element.text)
|
||||||
|
|
||||||
|
# Save markdown with embedded pictures in translated text
|
||||||
|
md_filename = output_dir / f"{doc_filename}-with-images-translated.md"
|
||||||
|
conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)
|
37
docs/faq.md
37
docs/faq.md
@ -7,28 +7,7 @@ This is a collection of FAQ collected from the user questions on <https://github
|
|||||||
|
|
||||||
### Is Python 3.13 supported?
|
### Is Python 3.13 supported?
|
||||||
|
|
||||||
Full support for Python 3.13 is currently waiting for [pytorch](https://github.com/pytorch/pytorch).
|
Python 3.13 is supported from Docling 2.18.0.
|
||||||
|
|
||||||
At the moment, no release has full support, but nightly builds are available. Docling was tested on Python 3.13 with the following steps:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# Create a python 3.13 virtualenv
|
|
||||||
python3.13 -m venv venv
|
|
||||||
source ./venv/bin/activate
|
|
||||||
|
|
||||||
# Install torch nightly builds, see https://pytorch.org/
|
|
||||||
pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
|
|
||||||
|
|
||||||
# Install docling
|
|
||||||
pip3 install docling
|
|
||||||
|
|
||||||
# Run docling
|
|
||||||
docling --no-ocr https://arxiv.org/pdf/2408.09869
|
|
||||||
```
|
|
||||||
|
|
||||||
_Note: we are disabling OCR since easyocr and the nightly torch builds have some conflicts._
|
|
||||||
|
|
||||||
Source: Issue [#136](https://github.com/DS4SD/docling/issues/136)
|
|
||||||
|
|
||||||
|
|
||||||
??? question "Install conflicts with numpy (python 3.13)"
|
??? question "Install conflicts with numpy (python 3.13)"
|
||||||
@ -123,6 +102,12 @@ This is a collection of FAQ collected from the user questions on <https://github
|
|||||||
|
|
||||||
- Update to the latest version of [certifi](https://pypi.org/project/certifi/), i.e. `pip install --upgrade certifi`
|
- Update to the latest version of [certifi](https://pypi.org/project/certifi/), i.e. `pip install --upgrade certifi`
|
||||||
- Use [pip-system-certs](https://pypi.org/project/pip-system-certs/) to use the latest trusted certificates on your system.
|
- Use [pip-system-certs](https://pypi.org/project/pip-system-certs/) to use the latest trusted certificates on your system.
|
||||||
|
- Set environment variables `SSL_CERT_FILE` and `REQUESTS_CA_BUNDLE` to the value of `python -m certifi`:
|
||||||
|
```
|
||||||
|
CERT_PATH=$(python -m certifi)
|
||||||
|
export SSL_CERT_FILE=${CERT_PATH}
|
||||||
|
export REQUESTS_CA_BUNDLE=${CERT_PATH}
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
??? question "Which OCR languages are supported?"
|
??? question "Which OCR languages are supported?"
|
||||||
@ -145,3 +130,11 @@ This is a collection of FAQ collected from the user questions on <https://github
|
|||||||
pipeline_options = PdfPipelineOptions()
|
pipeline_options = PdfPipelineOptions()
|
||||||
pipeline_options.ocr_options.lang = ["fr", "de", "es", "en"] # example of languages for EasyOCR
|
pipeline_options.ocr_options.lang = ["fr", "de", "es", "en"] # example of languages for EasyOCR
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
??? Some images are missing from MS Word and Powerpoint"
|
||||||
|
|
||||||
|
### Some images are missing from MS Word and Powerpoint
|
||||||
|
|
||||||
|
The image processing library used by Docling is able to handle embedded WMF images only on Windows platform.
|
||||||
|
If you are on other operaring systems, these images will be ignored.
|
||||||
|
@ -14,21 +14,25 @@
|
|||||||
[](https://opensource.org/licenses/MIT)
|
[](https://opensource.org/licenses/MIT)
|
||||||
[](https://pepy.tech/projects/docling)
|
[](https://pepy.tech/projects/docling)
|
||||||
|
|
||||||
Docling parses documents and exports them to the desired format with ease and speed.
|
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
|
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
|
||||||
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
|
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
|
||||||
* 🧩 Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format
|
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
|
||||||
* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
|
||||||
* 🔍 OCR support for scanned PDFs
|
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
|
||||||
|
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
|
||||||
|
* 🔍 Extensive OCR support for scanned PDFs and images
|
||||||
* 💻 Simple and convenient CLI
|
* 💻 Simple and convenient CLI
|
||||||
|
|
||||||
### Coming soon
|
### Coming soon
|
||||||
|
|
||||||
* ♾️ Equation & code extraction
|
|
||||||
* 📝 Metadata extraction, including title, authors, references & language
|
* 📝 Metadata extraction, including title, authors, references & language
|
||||||
|
* 📝 Inclusion of Visual Language Models ([SmolDocling](https://huggingface.co/blog/smolervlm#smoldocling))
|
||||||
|
* 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
|
||||||
|
* 📝 Complex chemistry understanding (Molecular structures)
|
||||||
|
|
||||||
## Get started
|
## Get started
|
||||||
|
|
||||||
@ -42,3 +46,7 @@ Docling parses documents and exports them to the desired format with ease and sp
|
|||||||
## IBM ❤️ Open Source AI
|
## IBM ❤️ Open Source AI
|
||||||
|
|
||||||
Docling has been brought to you by IBM.
|
Docling has been brought to you by IBM.
|
||||||
|
|
||||||
|
[supported_formats]: ./supported_formats.md
|
||||||
|
[docling_document]: ./concepts/docling_document.md
|
||||||
|
[integrations]: ./integrations/index.md
|
||||||
|
34
docs/supported_formats.md
Normal file
34
docs/supported_formats.md
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
Docling can parse various documents formats into a unified representation (Docling
|
||||||
|
Document), which it can export to different formats too — check out
|
||||||
|
[Architecture](./concepts/architecture.md) for more details.
|
||||||
|
|
||||||
|
Below you can find a listing of all supported input and output formats.
|
||||||
|
|
||||||
|
## Supported input formats
|
||||||
|
|
||||||
|
| Format | Description |
|
||||||
|
|--------|-------------|
|
||||||
|
| PDF | |
|
||||||
|
| DOCX, XLSX, PPTX | Default formats in MS Office 2007+, based on Office Open XML |
|
||||||
|
| Markdown | |
|
||||||
|
| AsciiDoc | |
|
||||||
|
| HTML, XHTML | |
|
||||||
|
| PNG, JPEG, TIFF, BMP | Image formats |
|
||||||
|
|
||||||
|
Schema-specific support:
|
||||||
|
|
||||||
|
| Format | Description |
|
||||||
|
|--------|-------------|
|
||||||
|
| USPTO XML | XML format followed by [USPTO](https://www.uspto.gov/patents) patents |
|
||||||
|
| PMC XML | XML format followed by [PubMed Central®](https://pmc.ncbi.nlm.nih.gov/) articles |
|
||||||
|
| Docling JSON | JSON-serialized [Docling Document](./concepts/docling_document.md) |
|
||||||
|
|
||||||
|
## Supported output formats
|
||||||
|
|
||||||
|
| Format | Description |
|
||||||
|
|--------|-------------|
|
||||||
|
| HTML | Both image embedding and referencing are supported |
|
||||||
|
| Markdown | |
|
||||||
|
| JSON | Lossless serialization of Docling Document |
|
||||||
|
| Text | Plain text, i.e. without Markdown markers |
|
||||||
|
| Doctags | |
|
@ -126,6 +126,39 @@ result = converter.convert(source)
|
|||||||
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
You can limit the CPU threads used by Docling by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.
|
||||||
|
|
||||||
|
|
||||||
|
#### Use specific backend converters
|
||||||
|
|
||||||
|
!!! note
|
||||||
|
|
||||||
|
This section discusses directly invoking a [backend](./concepts/architecture.md),
|
||||||
|
i.e. using a low-level API. This should only be done when necessary. For most cases,
|
||||||
|
using a `DocumentConverter` (high-level API) as discussed in the sections above
|
||||||
|
should suffice — and is the recommended way.
|
||||||
|
|
||||||
|
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of [supported formats](./supported_formats.md)).
|
||||||
|
You can restrict the `DocumentConverter` to a set of allowed document formats, as shown in the [Multi-format conversion](./examples/run_with_formats.py) example.
|
||||||
|
Alternatively, you can also use the specific backend that matches your document content. For instance, you can use `HTMLDocumentBackend` for HTML pages:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import urllib.request
|
||||||
|
from io import BytesIO
|
||||||
|
from docling.backend.html_backend import HTMLDocumentBackend
|
||||||
|
from docling.datamodel.base_models import InputFormat
|
||||||
|
from docling.datamodel.document import InputDocument
|
||||||
|
|
||||||
|
url = "https://en.wikipedia.org/wiki/Duck"
|
||||||
|
text = urllib.request.urlopen(url).read()
|
||||||
|
in_doc = InputDocument(
|
||||||
|
path_or_stream=BytesIO(text),
|
||||||
|
format=InputFormat.HTML,
|
||||||
|
backend=HTMLDocumentBackend,
|
||||||
|
filename="duck.html",
|
||||||
|
)
|
||||||
|
backend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))
|
||||||
|
dl_doc = backend.convert()
|
||||||
|
print(dl_doc.export_to_markdown())
|
||||||
|
```
|
||||||
|
|
||||||
## Chunking
|
## Chunking
|
||||||
|
|
||||||
You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a
|
You can chunk a Docling document using a [chunker](concepts/chunking.md), such as a
|
||||||
|
@ -95,8 +95,8 @@ doc_converter = (
|
|||||||
|
|
||||||
More options are shown in the following example units:
|
More options are shown in the following example units:
|
||||||
|
|
||||||
- [run_with_formats.py](../examples/run_with_formats/)
|
- [run_with_formats.py](examples/run_with_formats.py)
|
||||||
- [custom_convert.py](../examples/custom_convert/)
|
- [custom_convert.py](examples/custom_convert.py)
|
||||||
|
|
||||||
### Converting documents
|
### Converting documents
|
||||||
|
|
||||||
@ -226,4 +226,4 @@ leverages the new `DoclingDocument` and provides a new, richer chunk output form
|
|||||||
- any applicable headings for context
|
- any applicable headings for context
|
||||||
- any applicable captions for context
|
- any applicable captions for context
|
||||||
|
|
||||||
For an example, check out [Chunking usage](../usage/#chunking).
|
For an example, check out [Chunking usage](usage.md#chunking).
|
||||||
|
12
mkdocs.yml
12
mkdocs.yml
@ -56,6 +56,7 @@ nav:
|
|||||||
- "Docling": index.md
|
- "Docling": index.md
|
||||||
- Installation: installation.md
|
- Installation: installation.md
|
||||||
- Usage: usage.md
|
- Usage: usage.md
|
||||||
|
- Supported formats: supported_formats.md
|
||||||
- FAQ: faq.md
|
- FAQ: faq.md
|
||||||
- Docling v2: v2.md
|
- Docling v2: v2.md
|
||||||
- Concepts:
|
- Concepts:
|
||||||
@ -75,15 +76,20 @@ nav:
|
|||||||
- "Table export": examples/export_tables.py
|
- "Table export": examples/export_tables.py
|
||||||
- "Multimodal export": examples/export_multimodal.py
|
- "Multimodal export": examples/export_multimodal.py
|
||||||
- "Force full page OCR": examples/full_page_ocr.py
|
- "Force full page OCR": examples/full_page_ocr.py
|
||||||
|
- "Automatic OCR language detection with tesseract": examples/tesseract_lang_detection.py
|
||||||
- "Accelerator options": examples/run_with_accelerator.py
|
- "Accelerator options": examples/run_with_accelerator.py
|
||||||
|
- "Simple translation": examples/translate.py
|
||||||
|
- examples/backend_xml_rag.ipynb
|
||||||
- ✂️ Chunking:
|
- ✂️ Chunking:
|
||||||
- "Hybrid chunking": examples/hybrid_chunking.ipynb
|
- examples/hybrid_chunking.ipynb
|
||||||
- 💬 RAG / QA:
|
- 🤖 RAG with AI dev frameworks:
|
||||||
- examples/rag_haystack.ipynb
|
- examples/rag_haystack.ipynb
|
||||||
- examples/rag_llamaindex.ipynb
|
|
||||||
- examples/rag_langchain.ipynb
|
- examples/rag_langchain.ipynb
|
||||||
|
- examples/rag_llamaindex.ipynb
|
||||||
|
- 🗂️ More examples:
|
||||||
- examples/rag_weaviate.ipynb
|
- examples/rag_weaviate.ipynb
|
||||||
- RAG with Granite [↗]: https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Docling_RAG.ipynb
|
- RAG with Granite [↗]: https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Docling_RAG.ipynb
|
||||||
|
- examples/rag_azuresearch.ipynb
|
||||||
- examples/retrieval_qdrant.ipynb
|
- examples/retrieval_qdrant.ipynb
|
||||||
- Integrations:
|
- Integrations:
|
||||||
- Integrations: integrations/index.md
|
- Integrations: integrations/index.md
|
||||||
|
1844
poetry.lock
generated
1844
poetry.lock
generated
File diff suppressed because it is too large
Load Diff
@ -1,6 +1,6 @@
|
|||||||
[tool.poetry]
|
[tool.poetry]
|
||||||
name = "docling"
|
name = "docling"
|
||||||
version = "2.15.1" # DO NOT EDIT, updated automatically
|
version = "2.17.0" # DO NOT EDIT, updated automatically
|
||||||
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
|
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
|
||||||
authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Panos Vagenas <pva@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
|
authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Panos Vagenas <pva@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
|
||||||
license = "MIT"
|
license = "MIT"
|
||||||
@ -25,11 +25,11 @@ packages = [{include = "docling"}]
|
|||||||
# actual dependencies:
|
# actual dependencies:
|
||||||
######################
|
######################
|
||||||
python = "^3.9"
|
python = "^3.9"
|
||||||
docling-core = { version = "^2.13.1", extras = ["chunking"] }
|
|
||||||
pydantic = "^2.0.0"
|
pydantic = "^2.0.0"
|
||||||
docling-ibm-models = "^3.1.0"
|
docling-core = {git = "ssh://git@github.com/DS4SD/docling-core.git", rev = "cau/add-content-layer"}
|
||||||
|
docling-ibm-models = "^3.3.0"
|
||||||
deepsearch-glm = "^1.0.0"
|
deepsearch-glm = "^1.0.0"
|
||||||
docling-parse = "^3.0.0"
|
docling-parse = "^3.1.0"
|
||||||
filetype = "^1.2.0"
|
filetype = "^1.2.0"
|
||||||
pypdfium2 = "^4.30.0"
|
pypdfium2 = "^4.30.0"
|
||||||
pydantic-settings = "^2.3.0"
|
pydantic-settings = "^2.3.0"
|
||||||
@ -39,7 +39,10 @@ easyocr = "^1.7"
|
|||||||
tesserocr = { version = "^2.7.1", optional = true }
|
tesserocr = { version = "^2.7.1", optional = true }
|
||||||
certifi = ">=2024.7.4"
|
certifi = ">=2024.7.4"
|
||||||
rtree = "^1.3.0"
|
rtree = "^1.3.0"
|
||||||
scipy = "^1.6.0"
|
scipy = [
|
||||||
|
{ version = "^1.6.0", markers = "python_version >= '3.10'" },
|
||||||
|
{ version = ">=1.6.0,<1.14.0", markers = "python_version < '3.10'" }
|
||||||
|
]
|
||||||
typer = "^0.12.5"
|
typer = "^0.12.5"
|
||||||
python-docx = "^1.1.2"
|
python-docx = "^1.1.2"
|
||||||
python-pptx = "^1.0.2"
|
python-pptx = "^1.0.2"
|
||||||
@ -56,6 +59,7 @@ onnxruntime = [
|
|||||||
{ version = ">=1.7.0,<1.20.0", optional = true, markers = "python_version < '3.10'" },
|
{ version = ">=1.7.0,<1.20.0", optional = true, markers = "python_version < '3.10'" },
|
||||||
{ version = "^1.7.0", optional = true, markers = "python_version >= '3.10'" }
|
{ version = "^1.7.0", optional = true, markers = "python_version >= '3.10'" }
|
||||||
]
|
]
|
||||||
|
pillow = "^10.0.0"
|
||||||
|
|
||||||
[tool.poetry.group.dev.dependencies]
|
[tool.poetry.group.dev.dependencies]
|
||||||
black = {extras = ["jupyter"], version = "^24.4.2"}
|
black = {extras = ["jupyter"], version = "^24.4.2"}
|
||||||
|
BIN
tests/data/amt_handbook_sample.pdf
Normal file
BIN
tests/data/amt_handbook_sample.pdf
Normal file
Binary file not shown.
BIN
tests/data/code_and_formula.pdf
Normal file
BIN
tests/data/code_and_formula.pdf
Normal file
Binary file not shown.
BIN
tests/data/docx/unit_test_headers_numbered.docx
Normal file
BIN
tests/data/docx/unit_test_headers_numbered.docx
Normal file
Binary file not shown.
BIN
tests/data/docx/word_tables.docx
Normal file
BIN
tests/data/docx/word_tables.docx
Normal file
Binary file not shown.
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@ -0,0 +1,25 @@
|
|||||||
|
<document>
|
||||||
|
<paragraph><location><page_1><loc_12><loc_88><loc_53><loc_94></location>pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.</paragraph>
|
||||||
|
<paragraph><location><page_1><loc_12><loc_77><loc_53><loc_86></location>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; and the elastic stop nut, representing the fiber insert type.</paragraph>
|
||||||
|
<subtitle-level-1><location><page_1><loc_12><loc_73><loc_28><loc_75></location>Boots Self-Locking Nut</subtitle-level-1>
|
||||||
|
<paragraph><location><page_1><loc_12><loc_64><loc_54><loc_73></location>The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</paragraph>
|
||||||
|
<paragraph><location><page_1><loc_12><loc_52><loc_53><loc_62></location>The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.</paragraph>
|
||||||
|
<paragraph><location><page_1><loc_12><loc_38><loc_54><loc_50></location>The spring, through the medium of the locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</paragraph>
|
||||||
|
<paragraph><location><page_1><loc_12><loc_33><loc_53><loc_36></location>Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is</paragraph>
|
||||||
|
<caption><location><page_1><loc_12><loc_8><loc_31><loc_9></location>Figure 7-26. Self-locking nuts.</caption>
|
||||||
|
<figure>
|
||||||
|
<location><page_1><loc_12><loc_10><loc_52><loc_31></location>
|
||||||
|
<caption>Figure 7-26. Self-locking nuts.</caption>
|
||||||
|
</figure>
|
||||||
|
<paragraph><location><page_1><loc_54><loc_85><loc_95><loc_94></location>the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</paragraph>
|
||||||
|
<paragraph><location><page_1><loc_54><loc_83><loc_55><loc_85></location>.</paragraph>
|
||||||
|
<subtitle-level-1><location><page_1><loc_54><loc_82><loc_76><loc_83></location>Stainless Steel Self-Locking Nut</subtitle-level-1>
|
||||||
|
<paragraph><location><page_1><loc_54><loc_54><loc_96><loc_81></location>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</paragraph>
|
||||||
|
<subtitle-level-1><location><page_1><loc_54><loc_51><loc_65><loc_52></location>Elastic Stop Nut</subtitle-level-1>
|
||||||
|
<paragraph><location><page_1><loc_54><loc_47><loc_93><loc_50></location>The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This</paragraph>
|
||||||
|
<caption><location><page_1><loc_54><loc_8><loc_81><loc_10></location>Figure 7-27. Stainless steel self-locking nut.</caption>
|
||||||
|
<figure>
|
||||||
|
<location><page_1><loc_54><loc_11><loc_94><loc_46></location>
|
||||||
|
<caption>Figure 7-27. Stainless steel self-locking nut.</caption>
|
||||||
|
</figure>
|
||||||
|
</document>
|
File diff suppressed because one or more lines are too long
31
tests/data/groundtruth/docling_v1/amt_handbook_sample.md
Normal file
31
tests/data/groundtruth/docling_v1/amt_handbook_sample.md
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.
|
||||||
|
|
||||||
|
The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; and the elastic stop nut, representing the fiber insert type.
|
||||||
|
|
||||||
|
## Boots Self-Locking Nut
|
||||||
|
|
||||||
|
The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.
|
||||||
|
|
||||||
|
The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.
|
||||||
|
|
||||||
|
The spring, through the medium of the locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.
|
||||||
|
|
||||||
|
Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is
|
||||||
|
|
||||||
|
Figure 7-26. Self-locking nuts.
|
||||||
|
<!-- image -->
|
||||||
|
|
||||||
|
the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.
|
||||||
|
|
||||||
|
.
|
||||||
|
|
||||||
|
## Stainless Steel Self-Locking Nut
|
||||||
|
|
||||||
|
The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.
|
||||||
|
|
||||||
|
## Elastic Stop Nut
|
||||||
|
|
||||||
|
The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This
|
||||||
|
|
||||||
|
Figure 7-27. Stainless steel self-locking nut.
|
||||||
|
<!-- image -->
|
File diff suppressed because one or more lines are too long
@ -0,0 +1,13 @@
|
|||||||
|
<document>
|
||||||
|
<subtitle-level-1><location><page_1><loc_22><loc_83><loc_45><loc_84></location>Java Code Example</subtitle-level-1>
|
||||||
|
<paragraph><location><page_1><loc_22><loc_63><loc_78><loc_81></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</paragraph>
|
||||||
|
<paragraph><location><page_1><loc_39><loc_61><loc_61><loc_62></location>Listing 1: Simple Java Program</paragraph>
|
||||||
|
<paragraph><location><page_1><loc_22><loc_56><loc_55><loc_60></location>public static void print() { System.out.println( "Java Code" ); }</paragraph>
|
||||||
|
<paragraph><location><page_1><loc_22><loc_37><loc_78><loc_55></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</paragraph>
|
||||||
|
<subtitle-level-1><location><page_2><loc_22><loc_84><loc_32><loc_85></location>Formula</subtitle-level-1>
|
||||||
|
<paragraph><location><page_2><loc_22><loc_65><loc_80><loc_82></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</paragraph>
|
||||||
|
<paragraph><location><page_2><loc_22><loc_58><loc_80><loc_65></location>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt.</paragraph>
|
||||||
|
<paragraph><location><page_2><loc_22><loc_38><loc_80><loc_55></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</paragraph>
|
||||||
|
<paragraph><location><page_2><loc_22><loc_29><loc_80><loc_38></location>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.</paragraph>
|
||||||
|
<paragraph><location><page_2><loc_22><loc_21><loc_80><loc_29></location>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.</paragraph>
|
||||||
|
</document>
|
1
tests/data/groundtruth/docling_v1/code_and_formula.json
Normal file
1
tests/data/groundtruth/docling_v1/code_and_formula.json
Normal file
File diff suppressed because one or more lines are too long
19
tests/data/groundtruth/docling_v1/code_and_formula.md
Normal file
19
tests/data/groundtruth/docling_v1/code_and_formula.md
Normal file
@ -0,0 +1,19 @@
|
|||||||
|
## Java Code Example
|
||||||
|
|
||||||
|
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
|
||||||
|
|
||||||
|
Listing 1: Simple Java Program
|
||||||
|
|
||||||
|
public static void print() { System.out.println( "Java Code" ); }
|
||||||
|
|
||||||
|
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
|
||||||
|
|
||||||
|
## Formula
|
||||||
|
|
||||||
|
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
|
||||||
|
|
||||||
|
Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt.
|
||||||
|
|
||||||
|
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
|
||||||
|
|
||||||
|
Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.
|
File diff suppressed because one or more lines are too long
@ -0,0 +1,17 @@
|
|||||||
|
<document>
|
||||||
|
<subtitle-level-1><location><page_1><loc_22><loc_83><loc_41><loc_84></location>Figures Example</subtitle-level-1>
|
||||||
|
<paragraph><location><page_1><loc_22><loc_63><loc_78><loc_81></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</paragraph>
|
||||||
|
<caption><location><page_1><loc_37><loc_32><loc_63><loc_33></location>Figure 1: This is an example image.</caption>
|
||||||
|
<figure>
|
||||||
|
<location><page_1><loc_22><loc_36><loc_78><loc_62></location>
|
||||||
|
<caption>Figure 1: This is an example image.</caption>
|
||||||
|
</figure>
|
||||||
|
<paragraph><location><page_1><loc_22><loc_15><loc_78><loc_30></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.</paragraph>
|
||||||
|
<paragraph><location><page_2><loc_22><loc_66><loc_78><loc_84></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</paragraph>
|
||||||
|
<caption><location><page_2><loc_37><loc_33><loc_63><loc_34></location>Figure 2: This is an example image.</caption>
|
||||||
|
<figure>
|
||||||
|
<location><page_2><loc_36><loc_36><loc_64><loc_65></location>
|
||||||
|
<caption>Figure 2: This is an example image.</caption>
|
||||||
|
</figure>
|
||||||
|
<paragraph><location><page_2><loc_22><loc_15><loc_78><loc_31></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.</paragraph>
|
||||||
|
</document>
|
File diff suppressed because one or more lines are too long
15
tests/data/groundtruth/docling_v1/picture_classification.md
Normal file
15
tests/data/groundtruth/docling_v1/picture_classification.md
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
## Figures Example
|
||||||
|
|
||||||
|
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
|
||||||
|
|
||||||
|
Figure 1: This is an example image.
|
||||||
|
<!-- image -->
|
||||||
|
|
||||||
|
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
|
||||||
|
|
||||||
|
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
|
||||||
|
|
||||||
|
Figure 2: This is an example image.
|
||||||
|
<!-- image -->
|
||||||
|
|
||||||
|
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@ -106,12 +106,12 @@
|
|||||||
<text><location><page_6><loc_8><loc_70><loc_47><loc_80></location>The output features for each table cell are then fed into the feed-forward network (FFN). The FFN consists of a Multi-Layer Perceptron (3 layers with ReLU activation function) that predicts the normalized coordinates for the bounding box of each table cell. Finally, the predicted bounding boxes are classified based on whether they are empty or not using a linear layer.</text>
|
<text><location><page_6><loc_8><loc_70><loc_47><loc_80></location>The output features for each table cell are then fed into the feed-forward network (FFN). The FFN consists of a Multi-Layer Perceptron (3 layers with ReLU activation function) that predicts the normalized coordinates for the bounding box of each table cell. Finally, the predicted bounding boxes are classified based on whether they are empty or not using a linear layer.</text>
|
||||||
<text><location><page_6><loc_8><loc_44><loc_47><loc_69></location>Loss Functions. We formulate a multi-task loss Eq. 2 to train our network. The Cross-Entropy loss (denoted as l$_{s}$ ) is used to train the Structure Decoder which predicts the structure tokens. As for the Cell BBox Decoder it is trained with a combination of losses denoted as l$_{box}$ . l$_{box}$ consists of the generally used l$_{1}$ loss for object detection and the IoU loss ( l$_{iou}$ ) to be scale invariant as explained in [25]. In comparison to DETR, we do not use the Hungarian algorithm [15] to match the predicted bounding boxes with the ground-truth boxes, as we have already achieved a one-toone match through two steps: 1) Our token input sequence is naturally ordered, therefore the hidden states of the table data cells are also in order when they are provided as input to the Cell BBox Decoder , and 2) Our bounding boxes generation mechanism (see Sec. 3) ensures a one-to-one mapping between the cell content and its bounding box for all post-processed datasets.</text>
|
<text><location><page_6><loc_8><loc_44><loc_47><loc_69></location>Loss Functions. We formulate a multi-task loss Eq. 2 to train our network. The Cross-Entropy loss (denoted as l$_{s}$ ) is used to train the Structure Decoder which predicts the structure tokens. As for the Cell BBox Decoder it is trained with a combination of losses denoted as l$_{box}$ . l$_{box}$ consists of the generally used l$_{1}$ loss for object detection and the IoU loss ( l$_{iou}$ ) to be scale invariant as explained in [25]. In comparison to DETR, we do not use the Hungarian algorithm [15] to match the predicted bounding boxes with the ground-truth boxes, as we have already achieved a one-toone match through two steps: 1) Our token input sequence is naturally ordered, therefore the hidden states of the table data cells are also in order when they are provided as input to the Cell BBox Decoder , and 2) Our bounding boxes generation mechanism (see Sec. 3) ensures a one-to-one mapping between the cell content and its bounding box for all post-processed datasets.</text>
|
||||||
<text><location><page_6><loc_8><loc_41><loc_47><loc_43></location>The loss used to train the TableFormer can be defined as following:</text>
|
<text><location><page_6><loc_8><loc_41><loc_47><loc_43></location>The loss used to train the TableFormer can be defined as following:</text>
|
||||||
<formula><location><page_6><loc_20><loc_35><loc_47><loc_38></location>l$_{box}$ = λ$_{iou}$l$_{iou}$ + λ$_{l}$$_{1}$ l = λl$_{s}$ + (1 - λ ) l$_{box}$ (1)</formula>
|
<formula><location><page_6><loc_20><loc_35><loc_47><loc_38></location></formula>
|
||||||
<text><location><page_6><loc_8><loc_32><loc_46><loc_33></location>where λ ∈ [0, 1], and λ$_{iou}$, λ$_{l}$$_{1}$ ∈$_{R}$ are hyper-parameters.</text>
|
<text><location><page_6><loc_8><loc_32><loc_46><loc_33></location>where λ ∈ [0, 1], and λ$_{iou}$, λ$_{l}$$_{1}$ ∈$_{R}$ are hyper-parameters.</text>
|
||||||
<section_header_level_1><location><page_6><loc_8><loc_28><loc_28><loc_30></location>5. Experimental Results</section_header_level_1>
|
<section_header_level_1><location><page_6><loc_8><loc_28><loc_28><loc_30></location>5. Experimental Results</section_header_level_1>
|
||||||
<section_header_level_1><location><page_6><loc_8><loc_26><loc_29><loc_27></location>5.1. Implementation Details</section_header_level_1>
|
<section_header_level_1><location><page_6><loc_8><loc_26><loc_29><loc_27></location>5.1. Implementation Details</section_header_level_1>
|
||||||
<text><location><page_6><loc_8><loc_19><loc_47><loc_25></location>TableFormer uses ResNet-18 as the CNN Backbone Network . The input images are resized to 448*448 pixels and the feature map has a dimension of 28*28. Additionally, we enforce the following input constraints:</text>
|
<text><location><page_6><loc_8><loc_19><loc_47><loc_25></location>TableFormer uses ResNet-18 as the CNN Backbone Network . The input images are resized to 448*448 pixels and the feature map has a dimension of 28*28. Additionally, we enforce the following input constraints:</text>
|
||||||
<formula><location><page_6><loc_15><loc_14><loc_47><loc_17></location>Image width and height ≤ 1024 pixels Structural tags length ≤ 512 tokens. (2)</formula>
|
<formula><location><page_6><loc_15><loc_14><loc_47><loc_17></location></formula>
|
||||||
<text><location><page_6><loc_8><loc_10><loc_47><loc_13></location>Although input constraints are used also by other methods, such as EDD, ours are less restrictive due to the improved</text>
|
<text><location><page_6><loc_8><loc_10><loc_47><loc_13></location>Although input constraints are used also by other methods, such as EDD, ours are less restrictive due to the improved</text>
|
||||||
<text><location><page_6><loc_50><loc_86><loc_89><loc_91></location>runtime performance and lower memory footprint of TableFormer. This allows to utilize input samples with longer sequences and images with larger dimensions.</text>
|
<text><location><page_6><loc_50><loc_86><loc_89><loc_91></location>runtime performance and lower memory footprint of TableFormer. This allows to utilize input samples with longer sequences and images with larger dimensions.</text>
|
||||||
<text><location><page_6><loc_50><loc_59><loc_89><loc_85></location>The Transformer Encoder consists of two "Transformer Encoder Layers", with an input feature size of 512, feed forward network of 1024, and 4 attention heads. As for the Transformer Decoder it is composed of four "Transformer Decoder Layers" with similar input and output dimensions as the "Transformer Encoder Layers". Even though our model uses fewer layers and heads than the default implementation parameters, our extensive experimentation has proved this setup to be more suitable for table images. We attribute this finding to the inherent design of table images, which contain mostly lines and text, unlike the more elaborate content present in other scopes (e.g. the COCO dataset). Moreover, we have added ResNet blocks to the inputs of the Structure Decoder and Cell BBox Decoder. This prevents a decoder having a stronger influence over the learned weights which would damage the other prediction task (structure vs bounding boxes), but learn task specific weights instead. Lastly our dropout layers are set to 0.5.</text>
|
<text><location><page_6><loc_50><loc_59><loc_89><loc_85></location>The Transformer Encoder consists of two "Transformer Encoder Layers", with an input feature size of 512, feed forward network of 1024, and 4 attention heads. As for the Transformer Decoder it is composed of four "Transformer Decoder Layers" with similar input and output dimensions as the "Transformer Encoder Layers". Even though our model uses fewer layers and heads than the default implementation parameters, our extensive experimentation has proved this setup to be more suitable for table images. We attribute this finding to the inherent design of table images, which contain mostly lines and text, unlike the more elaborate content present in other scopes (e.g. the COCO dataset). Moreover, we have added ResNet blocks to the inputs of the Structure Decoder and Cell BBox Decoder. This prevents a decoder having a stronger influence over the learned weights which would damage the other prediction task (structure vs bounding boxes), but learn task specific weights instead. Lastly our dropout layers are set to 0.5.</text>
|
||||||
@ -122,7 +122,7 @@
|
|||||||
<text><location><page_6><loc_50><loc_10><loc_89><loc_14></location>We also share our baseline results on the challenging SynthTabNet dataset. Throughout our experiments, the same parameters stated in Sec. 5.1 are utilized.</text>
|
<text><location><page_6><loc_50><loc_10><loc_89><loc_14></location>We also share our baseline results on the challenging SynthTabNet dataset. Throughout our experiments, the same parameters stated in Sec. 5.1 are utilized.</text>
|
||||||
<section_header_level_1><location><page_7><loc_8><loc_89><loc_27><loc_91></location>5.3. Datasets and Metrics</section_header_level_1>
|
<section_header_level_1><location><page_7><loc_8><loc_89><loc_27><loc_91></location>5.3. Datasets and Metrics</section_header_level_1>
|
||||||
<text><location><page_7><loc_8><loc_83><loc_47><loc_88></location>The Tree-Edit-Distance-Based Similarity (TEDS) metric was introduced in [37]. It represents the prediction, and ground-truth as a tree structure of HTML tags. This similarity is calculated as:</text>
|
<text><location><page_7><loc_8><loc_83><loc_47><loc_88></location>The Tree-Edit-Distance-Based Similarity (TEDS) metric was introduced in [37]. It represents the prediction, and ground-truth as a tree structure of HTML tags. This similarity is calculated as:</text>
|
||||||
<formula><location><page_7><loc_14><loc_78><loc_47><loc_81></location>TEDS ( T$_{a}$, T$_{b}$ ) = 1 - EditDist ( T$_{a}$, T$_{b}$ ) max ( | T$_{a}$ | , | T$_{b}$ | ) (3)</formula>
|
<formula><location><page_7><loc_14><loc_78><loc_47><loc_81></location></formula>
|
||||||
<text><location><page_7><loc_8><loc_73><loc_47><loc_77></location>where T$_{a}$ and T$_{b}$ represent tables in tree structure HTML format. EditDist denotes the tree-edit distance, and | T | represents the number of nodes in T .</text>
|
<text><location><page_7><loc_8><loc_73><loc_47><loc_77></location>where T$_{a}$ and T$_{b}$ represent tables in tree structure HTML format. EditDist denotes the tree-edit distance, and | T | represents the number of nodes in T .</text>
|
||||||
<section_header_level_1><location><page_7><loc_8><loc_70><loc_28><loc_72></location>5.4. Quantitative Analysis</section_header_level_1>
|
<section_header_level_1><location><page_7><loc_8><loc_70><loc_28><loc_72></location>5.4. Quantitative Analysis</section_header_level_1>
|
||||||
<text><location><page_7><loc_8><loc_50><loc_47><loc_69></location>Structure. As shown in Tab. 2, TableFormer outperforms all SOTA methods across different datasets by a large margin for predicting the table structure from an image. All the more, our model outperforms pre-trained methods. During the evaluation we do not apply any table filtering. We also provide our baseline results on the SynthTabNet dataset. It has been observed that large tables (e.g. tables that occupy half of the page or more) yield poor predictions. We attribute this issue to the image resizing during the preprocessing step, that produces downsampled images with indistinguishable features. This problem can be addressed by treating such big tables with a separate model which accepts a large input image size.</text>
|
<text><location><page_7><loc_8><loc_50><loc_47><loc_69></location>Structure. As shown in Tab. 2, TableFormer outperforms all SOTA methods across different datasets by a large margin for predicting the table structure from an image. All the more, our model outperforms pre-trained methods. During the evaluation we do not apply any table filtering. We also provide our baseline results on the SynthTabNet dataset. It has been observed that large tables (e.g. tables that occupy half of the page or more) yield poor predictions. We attribute this issue to the image resizing during the preprocessing step, that produces downsampled images with indistinguishable features. This problem can be addressed by treating such big tables with a separate model which accepts a large input image size.</text>
|
||||||
@ -304,7 +304,7 @@
|
|||||||
<list_item><location><page_12><loc_8><loc_29><loc_47><loc_33></location>3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.</list_item>
|
<list_item><location><page_12><loc_8><loc_29><loc_47><loc_33></location>3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.</list_item>
|
||||||
<list_item><location><page_12><loc_8><loc_24><loc_47><loc_28></location>4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:</list_item>
|
<list_item><location><page_12><loc_8><loc_24><loc_47><loc_28></location>4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:</list_item>
|
||||||
</unordered_list>
|
</unordered_list>
|
||||||
<formula><location><page_12><loc_18><loc_17><loc_47><loc_21></location>alignment = arg min c { D$_{c}$ } D$_{c}$ = max { x$_{c}$ } - min { x$_{c}$ } (4)</formula>
|
<formula><location><page_12><loc_18><loc_17><loc_47><loc_21></location></formula>
|
||||||
<text><location><page_12><loc_8><loc_13><loc_47><loc_16></location>where c is one of { left, centroid, right } and x$_{c}$ is the xcoordinate for the corresponding point.</text>
|
<text><location><page_12><loc_8><loc_13><loc_47><loc_16></location>where c is one of { left, centroid, right } and x$_{c}$ is the xcoordinate for the corresponding point.</text>
|
||||||
<unordered_list>
|
<unordered_list>
|
||||||
<list_item><location><page_12><loc_8><loc_10><loc_47><loc_13></location>5. Use the alignment computed in step 4, to compute the median x -coordinate for all table columns and the me-</list_item>
|
<list_item><location><page_12><loc_8><loc_10><loc_47><loc_13></location>5. Use the alignment computed in step 4, to compute the median x -coordinate for all table columns and the me-</list_item>
|
||||||
|
File diff suppressed because one or more lines are too long
@ -52,11 +52,11 @@ To meet the design criteria listed above, we developed a new model called TableF
|
|||||||
|
|
||||||
The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe
|
The paper is structured as follows. In Sec. 2, we give a brief overview of the current state-of-the-art. In Sec. 3, we describe the datasets on which we train. In Sec. 4, we introduce the TableFormer model-architecture and describe
|
||||||
|
|
||||||
its results & performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
|
its results & performance in Sec. 5. As a conclusion, we describe how this new model-architecture can be re-purposed for other tasks in the computer-vision community.
|
||||||
|
|
||||||
## 2. Previous work and State of the Art
|
## 2. Previous work and State of the Art
|
||||||
|
|
||||||
Identifying the structure of a table has been an outstanding problem in the document-parsing community, that motivates many organised public challenges [6, 4, 14]. The difficulty of the problem can be attributed to a number of factors. First, there is a large variety in the shapes and sizes of tables. Such large variety requires a flexible method. This is especially true for complex column- and row headers, which can be extremely intricate and demanding. A second factor of complexity is the lack of data with regard to table-structure. Until the publication of PubTabNet [37], there were no large datasets (i.e. > 100 K tables) that provided structure information. This happens primarily due to the fact that tables are notoriously time-consuming to annotate by hand. However, this has definitely changed in recent years with the deliverance of PubTabNet [37], FinTabNet [36], TableBank [17] etc.
|
Identifying the structure of a table has been an outstanding problem in the document-parsing community, that motivates many organised public challenges [6, 4, 14]. The difficulty of the problem can be attributed to a number of factors. First, there is a large variety in the shapes and sizes of tables. Such large variety requires a flexible method. This is especially true for complex column- and row headers, which can be extremely intricate and demanding. A second factor of complexity is the lack of data with regard to table-structure. Until the publication of PubTabNet [37], there were no large datasets (i.e. > 100 K tables) that provided structure information. This happens primarily due to the fact that tables are notoriously time-consuming to annotate by hand. However, this has definitely changed in recent years with the deliverance of PubTabNet [37], FinTabNet [36], TableBank [17] etc.
|
||||||
|
|
||||||
Before the rising popularity of deep neural networks, the community relied heavily on heuristic and/or statistical methods to do table structure identification [3, 7, 11, 5, 13, 28]. Although such methods work well on constrained tables [12], a more data-driven approach can be applied due to the advent of convolutional neural networks (CNNs) and the availability of large datasets. To the best-of-our knowledge, there are currently two different types of network architecture that are being pursued for state-of-the-art tablestructure identification.
|
Before the rising popularity of deep neural networks, the community relied heavily on heuristic and/or statistical methods to do table structure identification [3, 7, 11, 5, 13, 28]. Although such methods work well on constrained tables [12], a more data-driven approach can be applied due to the advent of convolutional neural networks (CNNs) and the availability of large datasets. To the best-of-our knowledge, there are currently two different types of network architecture that are being pursued for state-of-the-art tablestructure identification.
|
||||||
|
|
||||||
@ -115,7 +115,7 @@ Given the image of a table, TableFormer is able to predict: 1) a sequence of tok
|
|||||||
|
|
||||||
## 4.1. Model architecture.
|
## 4.1. Model architecture.
|
||||||
|
|
||||||
We now describe in detail the proposed method, which is composed of three main components, see Fig. 4. Our CNN Backbone Network encodes the input as a feature vector of predefined length. The input feature vector of the encoded image is passed to the Structure Decoder to produce a sequence of HTML tags that represent the structure of the table. With each prediction of an HTML standard data cell (' < td > ') the hidden state of that cell is passed to the Cell BBox Decoder. As for spanning cells, such as row or column span, the tag is broken down to ' < ', 'rowspan=' or 'colspan=', with the number of spanning cells (attribute), and ' > '. The hidden state attached to ' < ' is passed to the Cell BBox Decoder. A shared feed forward network (FFN) receives the hidden states from the Structure Decoder, to provide the final detection predictions of the bounding box coordinates and their classification.
|
We now describe in detail the proposed method, which is composed of three main components, see Fig. 4. Our CNN Backbone Network encodes the input as a feature vector of predefined length. The input feature vector of the encoded image is passed to the Structure Decoder to produce a sequence of HTML tags that represent the structure of the table. With each prediction of an HTML standard data cell (' < td > ') the hidden state of that cell is passed to the Cell BBox Decoder. As for spanning cells, such as row or column span, the tag is broken down to ' < ', 'rowspan=' or 'colspan=', with the number of spanning cells (attribute), and ' > '. The hidden state attached to ' < ' is passed to the Cell BBox Decoder. A shared feed forward network (FFN) receives the hidden states from the Structure Decoder, to provide the final detection predictions of the bounding box coordinates and their classification.
|
||||||
|
|
||||||
CNN Backbone Network. A ResNet-18 CNN is the backbone that receives the table image and encodes it as a vector of predefined length. The network has been modified by removing the linear and pooling layer, as we are not per-
|
CNN Backbone Network. A ResNet-18 CNN is the backbone that receives the table image and encodes it as a vector of predefined length. The network has been modified by removing the linear and pooling layer, as we are not per-
|
||||||
|
|
||||||
@ -123,7 +123,7 @@ Figure 3: TableFormer takes in an image of the PDF and creates bounding box and
|
|||||||
|
|
||||||
<!-- image -->
|
<!-- image -->
|
||||||
|
|
||||||
Figure 4: Given an input image of a table, the Encoder produces fixed-length features that represent the input image. The features are then passed to both the Structure Decoder and Cell BBox Decoder . During training, the Structure Decoder receives 'tokenized tags' of the HTML code that represent the table structure. Afterwards, a transformer encoder and decoder architecture is employed to produce features that are received by a linear layer, and the Cell BBox Decoder. The linear layer is applied to the features to predict the tags. Simultaneously, the Cell BBox Decoder selects features referring to the data cells (' < td > ', ' < ') and passes them through an attention network, an MLP, and a linear layer to predict the bounding boxes.
|
Figure 4: Given an input image of a table, the Encoder produces fixed-length features that represent the input image. The features are then passed to both the Structure Decoder and Cell BBox Decoder . During training, the Structure Decoder receives 'tokenized tags' of the HTML code that represent the table structure. Afterwards, a transformer encoder and decoder architecture is employed to produce features that are received by a linear layer, and the Cell BBox Decoder. The linear layer is applied to the features to predict the tags. Simultaneously, the Cell BBox Decoder selects features referring to the data cells (' < td > ', ' < ') and passes them through an attention network, an MLP, and a linear layer to predict the bounding boxes.
|
||||||
|
|
||||||
<!-- image -->
|
<!-- image -->
|
||||||
|
|
||||||
@ -133,7 +133,7 @@ Structure Decoder. The transformer architecture of this component is based on th
|
|||||||
|
|
||||||
The transformer encoder receives an encoded image from the CNN Backbone Network and refines it through a multi-head dot-product attention layer, followed by a Feed Forward Network. During training, the transformer decoder receives as input the output feature produced by the transformer encoder, and the tokenized input of the HTML ground-truth tags. Using a stack of multi-head attention layers, different aspects of the tag sequence could be inferred. This is achieved by each attention head on a layer operating in a different subspace, and then combining altogether their attention score.
|
The transformer encoder receives an encoded image from the CNN Backbone Network and refines it through a multi-head dot-product attention layer, followed by a Feed Forward Network. During training, the transformer decoder receives as input the output feature produced by the transformer encoder, and the tokenized input of the HTML ground-truth tags. Using a stack of multi-head attention layers, different aspects of the tag sequence could be inferred. This is achieved by each attention head on a layer operating in a different subspace, and then combining altogether their attention score.
|
||||||
|
|
||||||
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTML structure tags become the object query.
|
Cell BBox Decoder. Our architecture allows to simultaneously predict HTML tags and bounding boxes for each table cell without the need of a separate object detector end to end. This approach is inspired by DETR [1] which employs a Transformer Encoder, and Decoder that looks for a specific number of object queries (potential object detections). As our model utilizes a transformer architecture, the hidden state of the < td > ' and ' < ' HTML structure tags become the object query.
|
||||||
|
|
||||||
The encoding generated by the CNN Backbone Network along with the features acquired for every data cell from the Transformer Decoder are then passed to the attention network. The attention network takes both inputs and learns to provide an attention weighted encoding. This weighted at-
|
The encoding generated by the CNN Backbone Network along with the features acquired for every data cell from the Transformer Decoder are then passed to the attention network. The attention network takes both inputs and learns to provide an attention weighted encoding. This weighted at-
|
||||||
|
|
||||||
@ -141,13 +141,13 @@ tention encoding is then multiplied to the encoded image to produce a feature fo
|
|||||||
|
|
||||||
The output features for each table cell are then fed into the feed-forward network (FFN). The FFN consists of a Multi-Layer Perceptron (3 layers with ReLU activation function) that predicts the normalized coordinates for the bounding box of each table cell. Finally, the predicted bounding boxes are classified based on whether they are empty or not using a linear layer.
|
The output features for each table cell are then fed into the feed-forward network (FFN). The FFN consists of a Multi-Layer Perceptron (3 layers with ReLU activation function) that predicts the normalized coordinates for the bounding box of each table cell. Finally, the predicted bounding boxes are classified based on whether they are empty or not using a linear layer.
|
||||||
|
|
||||||
Loss Functions. We formulate a multi-task loss Eq. 2 to train our network. The Cross-Entropy loss (denoted as l$\_{s}$ ) is used to train the Structure Decoder which predicts the structure tokens. As for the Cell BBox Decoder it is trained with a combination of losses denoted as l$\_{box}$ . l$\_{box}$ consists of the generally used l$\_{1}$ loss for object detection and the IoU loss ( l$\_{iou}$ ) to be scale invariant as explained in [25]. In comparison to DETR, we do not use the Hungarian algorithm [15] to match the predicted bounding boxes with the ground-truth boxes, as we have already achieved a one-toone match through two steps: 1) Our token input sequence is naturally ordered, therefore the hidden states of the table data cells are also in order when they are provided as input to the Cell BBox Decoder , and 2) Our bounding boxes generation mechanism (see Sec. 3) ensures a one-to-one mapping between the cell content and its bounding box for all post-processed datasets.
|
Loss Functions. We formulate a multi-task loss Eq. 2 to train our network. The Cross-Entropy loss (denoted as l$_{s}$ ) is used to train the Structure Decoder which predicts the structure tokens. As for the Cell BBox Decoder it is trained with a combination of losses denoted as l$_{box}$ . l$_{box}$ consists of the generally used l$_{1}$ loss for object detection and the IoU loss ( l$_{iou}$ ) to be scale invariant as explained in [25]. In comparison to DETR, we do not use the Hungarian algorithm [15] to match the predicted bounding boxes with the ground-truth boxes, as we have already achieved a one-toone match through two steps: 1) Our token input sequence is naturally ordered, therefore the hidden states of the table data cells are also in order when they are provided as input to the Cell BBox Decoder , and 2) Our bounding boxes generation mechanism (see Sec. 3) ensures a one-to-one mapping between the cell content and its bounding box for all post-processed datasets.
|
||||||
|
|
||||||
The loss used to train the TableFormer can be defined as following:
|
The loss used to train the TableFormer can be defined as following:
|
||||||
|
|
||||||
l$\_{box}$ = λ$\_{iou}$l$\_{iou}$ + λ$\_{l}$$\_{1}$ l = λl$\_{s}$ + (1 - λ ) l$\_{box}$ (1)
|
<!-- formula-not-decoded -->
|
||||||
|
|
||||||
where λ ∈ [0, 1], and λ$\_{iou}$, λ$\_{l}$$\_{1}$ ∈$\_{R}$ are hyper-parameters.
|
where λ ∈ [0, 1], and λ$_{iou}$, λ$_{l}$$\_{1}$ ∈$\_{R}$ are hyper-parameters.
|
||||||
|
|
||||||
## 5. Experimental Results
|
## 5. Experimental Results
|
||||||
|
|
||||||
@ -155,7 +155,7 @@ where λ ∈ [0, 1], and λ$\_{iou}$, λ$\_{l}$$\_{1}$ ∈$\_{R}$ are hyper-para
|
|||||||
|
|
||||||
TableFormer uses ResNet-18 as the CNN Backbone Network . The input images are resized to 448*448 pixels and the feature map has a dimension of 28*28. Additionally, we enforce the following input constraints:
|
TableFormer uses ResNet-18 as the CNN Backbone Network . The input images are resized to 448*448 pixels and the feature map has a dimension of 28*28. Additionally, we enforce the following input constraints:
|
||||||
|
|
||||||
Image width and height ≤ 1024 pixels Structural tags length ≤ 512 tokens. (2)
|
<!-- formula-not-decoded -->
|
||||||
|
|
||||||
Although input constraints are used also by other methods, such as EDD, ours are less restrictive due to the improved
|
Although input constraints are used also by other methods, such as EDD, ours are less restrictive due to the improved
|
||||||
|
|
||||||
@ -177,9 +177,9 @@ We also share our baseline results on the challenging SynthTabNet dataset. Throu
|
|||||||
|
|
||||||
The Tree-Edit-Distance-Based Similarity (TEDS) metric was introduced in [37]. It represents the prediction, and ground-truth as a tree structure of HTML tags. This similarity is calculated as:
|
The Tree-Edit-Distance-Based Similarity (TEDS) metric was introduced in [37]. It represents the prediction, and ground-truth as a tree structure of HTML tags. This similarity is calculated as:
|
||||||
|
|
||||||
TEDS ( T$\_{a}$, T$\_{b}$ ) = 1 - EditDist ( T$\_{a}$, T$\_{b}$ ) max ( | T$\_{a}$ | , | T$\_{b}$ | ) (3)
|
<!-- formula-not-decoded -->
|
||||||
|
|
||||||
where T$\_{a}$ and T$\_{b}$ represent tables in tree structure HTML format. EditDist denotes the tree-edit distance, and | T | represents the number of nodes in T .
|
where T$_{a}$ and T$_{b}$ represent tables in tree structure HTML format. EditDist denotes the tree-edit distance, and | T | represents the number of nodes in T .
|
||||||
|
|
||||||
## 5.4. Quantitative Analysis
|
## 5.4. Quantitative Analysis
|
||||||
|
|
||||||
@ -277,7 +277,7 @@ Figure 6: An example of TableFormer predictions (bounding boxes and structure) f
|
|||||||
|
|
||||||
We showcase several visualizations for the different components of our network on various "complex" tables within datasets presented in this work in Fig. 5 and Fig. 6 As it is shown, our model is able to predict bounding boxes for all table cells, even for the empty ones. Additionally, our post-processing techniques can extract the cell content by matching the predicted bounding boxes to the PDF cells based on their overlap and spatial proximity. The left part of Fig. 5 demonstrates also the adaptability of our method to any language, as it can successfully extract Japanese text, although the training set contains only English content. We provide more visualizations including the intermediate steps in the supplementary material. Overall these illustrations justify the versatility of our method across a diverse range of table appearances and content type.
|
We showcase several visualizations for the different components of our network on various "complex" tables within datasets presented in this work in Fig. 5 and Fig. 6 As it is shown, our model is able to predict bounding boxes for all table cells, even for the empty ones. Additionally, our post-processing techniques can extract the cell content by matching the predicted bounding boxes to the PDF cells based on their overlap and spatial proximity. The left part of Fig. 5 demonstrates also the adaptability of our method to any language, as it can successfully extract Japanese text, although the training set contains only English content. We provide more visualizations including the intermediate steps in the supplementary material. Overall these illustrations justify the versatility of our method across a diverse range of table appearances and content type.
|
||||||
|
|
||||||
## 6. Future Work & Conclusion
|
## 6. Future Work & Conclusion
|
||||||
|
|
||||||
In this paper, we presented TableFormer an end-to-end transformer based approach to predict table structures and bounding boxes of cells from an image. This approach enables us to recreate the table structure, and extract the cell content from PDF or OCR by using bounding boxes. Additionally, it provides the versatility required in real-world scenarios when dealing with various types of PDF documents, and languages. Furthermore, our method outperforms all state-of-the-arts with a wide margin. Finally, we introduce "SynthTabNet" a challenging synthetically generated dataset that reinforces missing characteristics from other datasets.
|
In this paper, we presented TableFormer an end-to-end transformer based approach to predict table structures and bounding boxes of cells from an image. This approach enables us to recreate the table structure, and extract the cell content from PDF or OCR by using bounding boxes. Additionally, it provides the versatility required in real-world scenarios when dealing with various types of PDF documents, and languages. Furthermore, our method outperforms all state-of-the-arts with a wide margin. Finally, we introduce "SynthTabNet" a challenging synthetically generated dataset that reinforces missing characteristics from other datasets.
|
||||||
|
|
||||||
@ -377,9 +377,9 @@ Here is a step-by-step description of the prediction postprocessing:
|
|||||||
- 3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.
|
- 3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.
|
||||||
- 4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:
|
- 4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:
|
||||||
|
|
||||||
alignment = arg min c { D$\_{c}$ } D$\_{c}$ = max { x$\_{c}$ } - min { x$\_{c}$ } (4)
|
<!-- formula-not-decoded -->
|
||||||
|
|
||||||
where c is one of { left, centroid, right } and x$\_{c}$ is the xcoordinate for the corresponding point.
|
where c is one of { left, centroid, right } and x$_{c}$ is the xcoordinate for the corresponding point.
|
||||||
|
|
||||||
- 5. Use the alignment computed in step 4, to compute the median x -coordinate for all table columns and the me-
|
- 5. Use the alignment computed in step 4, to compute the median x -coordinate for all table columns and the me-
|
||||||
|
|
||||||
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@ -55,7 +55,7 @@ In this paper, we present the DocLayNet dataset. It provides pageby-page layout
|
|||||||
|
|
||||||
This enables experimentation with annotation uncertainty and quality control analysis.
|
This enables experimentation with annotation uncertainty and quality control analysis.
|
||||||
|
|
||||||
- (5) Pre-defined Train-, Test- & Validation-set : Like DocBank, we provide fixed train-, test- & validation-sets to ensure proportional representation of the class-labels. Further, we prevent leakage of unique layouts across sets, which has a large effect on model accuracy scores.
|
- (5) Pre-defined Train-, Test- & Validation-set : Like DocBank, we provide fixed train-, test- & validation-sets to ensure proportional representation of the class-labels. Further, we prevent leakage of unique layouts across sets, which has a large effect on model accuracy scores.
|
||||||
|
|
||||||
All aspects outlined above are detailed in Section 3. In Section 4, we will elaborate on how we designed and executed this large-scale human annotation campaign. We will also share key insights and lessons learned that might prove helpful for other parties planning to set up annotation campaigns.
|
All aspects outlined above are detailed in Section 3. In Section 4, we will elaborate on how we designed and executed this large-scale human annotation campaign. We will also share key insights and lessons learned that might prove helpful for other parties planning to set up annotation campaigns.
|
||||||
|
|
||||||
@ -77,9 +77,9 @@ Figure 2: Distribution of DocLayNet pages across document categories.
|
|||||||
|
|
||||||
<!-- image -->
|
<!-- image -->
|
||||||
|
|
||||||
to a minimum, since they introduce difficulties in annotation (see Section 4). As a second condition, we focussed on medium to large documents ( > 10 pages) with technical content, dense in complex tables, figures, plots and captions. Such documents carry a lot of information value, but are often hard to analyse with high accuracy due to their challenging layouts. Counterexamples of documents not included in the dataset are receipts, invoices, hand-written documents or photographs showing "text in the wild".
|
to a minimum, since they introduce difficulties in annotation (see Section 4). As a second condition, we focussed on medium to large documents ( > 10 pages) with technical content, dense in complex tables, figures, plots and captions. Such documents carry a lot of information value, but are often hard to analyse with high accuracy due to their challenging layouts. Counterexamples of documents not included in the dataset are receipts, invoices, hand-written documents or photographs showing "text in the wild".
|
||||||
|
|
||||||
The pages in DocLayNet can be grouped into six distinct categories, namely Financial Reports , Manuals , Scientific Articles , Laws & Regulations , Patents and Government Tenders . Each document category was sourced from various repositories. For example, Financial Reports contain both free-style format annual reports 2 which expose company-specific, artistic layouts as well as the more formal SEC filings. The two largest categories ( Financial Reports and Manuals ) contain a large amount of free-style layouts in order to obtain maximum variability. In the other four categories, we boosted the variability by mixing documents from independent providers, such as different government websites or publishers. In Figure 2, we show the document categories contained in DocLayNet with their respective sizes.
|
The pages in DocLayNet can be grouped into six distinct categories, namely Financial Reports , Manuals , Scientific Articles , Laws & Regulations , Patents and Government Tenders . Each document category was sourced from various repositories. For example, Financial Reports contain both free-style format annual reports 2 which expose company-specific, artistic layouts as well as the more formal SEC filings. The two largest categories ( Financial Reports and Manuals ) contain a large amount of free-style layouts in order to obtain maximum variability. In the other four categories, we boosted the variability by mixing documents from independent providers, such as different government websites or publishers. In Figure 2, we show the document categories contained in DocLayNet with their respective sizes.
|
||||||
|
|
||||||
We did not control the document selection with regard to language. The vast majority of documents contained in DocLayNet (close to 95%) are published in English language. However, DocLayNet also contains a number of documents in other languages such as German (2.5%), French (1.0%) and Japanese (1.0%). While the document language has negligible impact on the performance of computer vision methods such as object detection and segmentation models, it might prove challenging for layout analysis methods which exploit textual features.
|
We did not control the document selection with regard to language. The vast majority of documents contained in DocLayNet (close to 95%) are published in English language. However, DocLayNet also contains a number of documents in other languages such as German (2.5%), French (1.0%) and Japanese (1.0%). While the document language has negligible impact on the performance of computer vision methods such as object detection and segmentation models, it might prove challenging for layout analysis methods which exploit textual features.
|
||||||
|
|
||||||
@ -192,7 +192,7 @@ In Table 2, we present baseline experiments (given in mAP) on Mask R-CNN [12], F
|
|||||||
|
|
||||||
Table 3: Performance of a Mask R-CNN R50 network in mAP@0.5-0.95 scores trained on DocLayNet with different class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.
|
Table 3: Performance of a Mask R-CNN R50 network in mAP@0.5-0.95 scores trained on DocLayNet with different class label sets. The reduced label sets were obtained by either down-mapping or dropping labels.
|
||||||
|
|
||||||
Table 4: Performance of a Mask R-CNN R50 network with document-wise and page-wise split for different label sets. Naive page-wise split will result in GLYPH<tildelow> 10% point improvement.
|
Table 4: Performance of a Mask R-CNN R50 network with document-wise and page-wise split for different label sets. Naive page-wise split will result in GLYPH<tildelow> 10% point improvement.
|
||||||
|
|
||||||
| Class-count | 11 | 6 | 5 | 4 |
|
| Class-count | 11 | 6 | 5 | 4 |
|
||||||
|----------------|------|---------|---------|---------|
|
|----------------|------|---------|---------|---------|
|
||||||
@ -243,7 +243,7 @@ Many documents in DocLayNet have a unique styling. In order to avoid overfitting
|
|||||||
|
|
||||||
Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,
|
Throughout this paper, we claim that DocLayNet's wider variety of document layouts leads to more robust layout detection models. In Table 5, we provide evidence for that. We trained models on each of the available datasets (PubLayNet, DocBank and DocLayNet) and evaluated them on the test sets of the other datasets. Due to the different label sets and annotation styles, a direct comparison is not possible. Hence, we focussed on the common labels among the datasets. Between PubLayNet and DocLayNet, these are Picture ,
|
||||||
|
|
||||||
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.
|
Table 5: Prediction Performance (mAP@0.5-0.95) of a Mask R-CNN R50 network across the PubLayNet, DocBank & DocLayNet data-sets. By evaluating on common label classes of each dataset, we observe that the DocLayNet-trained model has much less pronounced variations in performance across all datasets.
|
||||||
|
|
||||||
| | | Testing on | Testing on | Testing on |
|
| | | Testing on | Testing on | Testing on |
|
||||||
|-----------------|------------|--------------|--------------|--------------|
|
|-----------------|------------|--------------|--------------|--------------|
|
||||||
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@ -38,7 +38,7 @@ Approaches to formalize the logical structure and layout of tables in electronic
|
|||||||
|
|
||||||
Other work [20] aims at predicting a grid for each table and deciding which cells must be merged using an attention network. Im2Seq methods cast the problem as a sequence generation task [4,5,9,22], and therefore need an internal tablestructure representation language, which is often implemented with standard markup languages (e.g. HTML, LaTeX, Markdown). In theory, Im2Seq methods have a natural advantage over the OD and GNN methods by virtue of directly predicting the table-structure. As such, no post-processing or rules are needed in order to obtain the table-structure, which is necessary with OD and GNN approaches. In practice, this is not entirely true, because a predicted sequence of table-structure markup does not necessarily have to be syntactically correct. Hence, depending on the quality of the predicted sequence, some post-processing needs to be performed to ensure a syntactically valid (let alone correct) sequence.
|
Other work [20] aims at predicting a grid for each table and deciding which cells must be merged using an attention network. Im2Seq methods cast the problem as a sequence generation task [4,5,9,22], and therefore need an internal tablestructure representation language, which is often implemented with standard markup languages (e.g. HTML, LaTeX, Markdown). In theory, Im2Seq methods have a natural advantage over the OD and GNN methods by virtue of directly predicting the table-structure. As such, no post-processing or rules are needed in order to obtain the table-structure, which is necessary with OD and GNN approaches. In practice, this is not entirely true, because a predicted sequence of table-structure markup does not necessarily have to be syntactically correct. Hence, depending on the quality of the predicted sequence, some post-processing needs to be performed to ensure a syntactically valid (let alone correct) sequence.
|
||||||
|
|
||||||
Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCR and uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.
|
Within the Im2Seq method, we find several popular models, namely the encoder-dual-decoder model (EDD) [22], TableFormer [9], Tabsplitter[2] and Ye et. al. [19]. EDD uses two consecutive long short-term memory (LSTM) decoders to predict a table in HTML representation. The tag decoder predicts a sequence of HTML tags. For each decoded table cell ( <td> ), the attention is passed to the cell decoder to predict the content with an embedded OCR approach. The latter makes it susceptible to transcription errors in the cell content of the table. TableFormer address this reliance on OCR and uses two transformer decoders for HTML structure and cell bounding box prediction in an end-to-end architecture. The predicted cell bounding box is then used to extract text tokens from an originating (digital) PDF page, circumventing any need for OCR. TabSplitter [2] proposes a compact double-matrix representation of table rows and columns to do error detection and error correction of HTML structure sequences based on predictions from [19]. This compact double-matrix representation can not be used directly by the Img2seq model training, so the model uses HTML as an intermediate form. Chi et. al. [4] introduce a data set and a baseline method using bidirectional LSTMs to predict LaTeX code. Kayal [5] introduces Gated ResNet transformers to predict LaTeX code, and a separate OCR module to extract content.
|
||||||
|
|
||||||
Im2Seq approaches have shown to be well-suited for the TSR task and allow a full end-to-end network design that can output the final table structure without pre- or post-processing logic. Furthermore, Im2Seq models have demonstrated to deliver state-of-the-art prediction accuracy [9]. This motivated the authors to investigate if the performance (both in accuracy and inference time) can be further improved by optimising the table structure representation language. We believe this is a necessary step before further improving neural network architectures for this task.
|
Im2Seq approaches have shown to be well-suited for the TSR task and allow a full end-to-end network design that can output the final table structure without pre- or post-processing logic. Furthermore, Im2Seq models have demonstrated to deliver state-of-the-art prediction accuracy [9]. This motivated the authors to investigate if the performance (both in accuracy and inference time) can be further improved by optimising the table structure representation language. We believe this is a necessary step before further improving neural network architectures for this task.
|
||||||
|
|
||||||
@ -46,13 +46,13 @@ Im2Seq approaches have shown to be well-suited for the TSR task and allow a full
|
|||||||
|
|
||||||
All known Im2Seq based models for TSR fundamentally work in similar ways. Given an image of a table, the Im2Seq model predicts the structure of the table by generating a sequence of tokens. These tokens originate from a finite vocab-
|
All known Im2Seq based models for TSR fundamentally work in similar ways. Given an image of a table, the Im2Seq model predicts the structure of the table by generating a sequence of tokens. These tokens originate from a finite vocab-
|
||||||
|
|
||||||
ulary and can be interpreted as a table structure. For example, with the HTML tokens <table> , </table> , <tr> , </tr> , <td> and </td> , one can construct simple table structures without any spanning cells. In reality though, one needs at least 28 HTML tokens to describe the most common complex tables observed in real-world documents [21,22], due to a variety of spanning cells definitions in the HTML token vocabulary.
|
ulary and can be interpreted as a table structure. For example, with the HTML tokens <table> , </table> , <tr> , </tr> , <td> and </td> , one can construct simple table structures without any spanning cells. In reality though, one needs at least 28 HTML tokens to describe the most common complex tables observed in real-world documents [21,22], due to a variety of spanning cells definitions in the HTML token vocabulary.
|
||||||
|
|
||||||
Fig. 2. Frequency of tokens in HTML and OTSL as they appear in PubTabNet.
|
Fig. 2. Frequency of tokens in HTML and OTSL as they appear in PubTabNet.
|
||||||
|
|
||||||
<!-- image -->
|
<!-- image -->
|
||||||
|
|
||||||
Obviously, HTML and other general-purpose markup languages were not designed for Im2Seq models. As such, they have some serious drawbacks. First, the token vocabulary needs to be artificially large in order to describe all plausible tabular structures. Since most Im2Seq models use an autoregressive approach, they generate the sequence token by token. Therefore, to reduce inference time, a shorter sequence length is critical. Every table-cell is represented by at least two tokens ( <td> and </td> ). Furthermore, when tokenizing the HTML structure, one needs to explicitly enumerate possible column-spans and row-spans as words. In practice, this ends up requiring 28 different HTML tokens (when including column- and row-spans up to 10 cells) just to describe every table in the PubTabNet dataset. Clearly, not every token is equally represented, as is depicted in Figure 2. This skewed distribution of tokens in combination with variable token row-length makes it challenging for models to learn the HTML structure.
|
Obviously, HTML and other general-purpose markup languages were not designed for Im2Seq models. As such, they have some serious drawbacks. First, the token vocabulary needs to be artificially large in order to describe all plausible tabular structures. Since most Im2Seq models use an autoregressive approach, they generate the sequence token by token. Therefore, to reduce inference time, a shorter sequence length is critical. Every table-cell is represented by at least two tokens ( <td> and </td> ). Furthermore, when tokenizing the HTML structure, one needs to explicitly enumerate possible column-spans and row-spans as words. In practice, this ends up requiring 28 different HTML tokens (when including column- and row-spans up to 10 cells) just to describe every table in the PubTabNet dataset. Clearly, not every token is equally represented, as is depicted in Figure 2. This skewed distribution of tokens in combination with variable token row-length makes it challenging for models to learn the HTML structure.
|
||||||
|
|
||||||
Additionally, it would be desirable if the representation would easily allow an early detection of invalid sequences on-the-go, before the prediction of the entire table structure is completed. HTML is not well-suited for this purpose as the verification of incomplete sequences is non-trivial or even impossible.
|
Additionally, it would be desirable if the representation would easily allow an early detection of invalid sequences on-the-go, before the prediction of the entire table structure is completed. HTML is not well-suited for this purpose as the verification of incomplete sequences is non-trivial or even impossible.
|
||||||
|
|
||||||
@ -194,7 +194,7 @@ Secondly, OTSL has more inherent structure and a significantly restricted vocabu
|
|||||||
- 12. Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). vol. 1, pp. 1162-1167. IEEE (2017)
|
- 12. Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). vol. 1, pp. 1162-1167. IEEE (2017)
|
||||||
- 13. Siddiqui, S.A., Fateh, I.A., Rizvi, S.T.R., Dengel, A., Ahmed, S.: Deeptabstr: Deep learning based table structure recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1403-1409 (2019). https:// doi.org/10.1109/ICDAR.2019.00226
|
- 13. Siddiqui, S.A., Fateh, I.A., Rizvi, S.T.R., Dengel, A., Ahmed, S.: Deeptabstr: Deep learning based table structure recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1403-1409 (2019). https:// doi.org/10.1109/ICDAR.2019.00226
|
||||||
- 14. Smock, B., Pesala, R., Abraham, R.: PubTables-1M: Towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4634-4642 (June 2022)
|
- 14. Smock, B., Pesala, R., Abraham, R.: PubTables-1M: Towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4634-4642 (June 2022)
|
||||||
- 15. Staar, P.W.J., Dolfi, M., Auer, C., Bekas, C.: Corpus conversion service: A machine learning platform to ingest documents at scale. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 774-782. KDD '18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3219819.3219834 , https://doi.org/10. 1145/3219819.3219834
|
- 15. Staar, P.W.J., Dolfi, M., Auer, C., Bekas, C.: Corpus conversion service: A machine learning platform to ingest documents at scale. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 774-782. KDD '18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3219819.3219834 , https://doi.org/10. 1145/3219819.3219834
|
||||||
- 16. Wang, X.: Tabular Abstraction, Editing, and Formatting. Ph.D. thesis, CAN (1996), aAINN09397
|
- 16. Wang, X.: Tabular Abstraction, Editing, and Formatting. Ph.D. thesis, CAN (1996), aAINN09397
|
||||||
- 17. Xue, W., Li, Q., Tao, D.: Res2tim: Reconstruct syntactic structures from table images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 749-755. IEEE (2019)
|
- 17. Xue, W., Li, Q., Tao, D.: Res2tim: Reconstruct syntactic structures from table images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 749-755. IEEE (2019)
|
||||||
|
|
||||||
|
File diff suppressed because one or more lines are too long
@ -0,0 +1,23 @@
|
|||||||
|
<document>
|
||||||
|
<text><location><page_1><loc_12><loc_88><loc_53><loc_94></location>pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.</text>
|
||||||
|
<text><location><page_1><loc_12><loc_77><loc_53><loc_86></location>The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; and the elastic stop nut, representing the fiber insert type.</text>
|
||||||
|
<section_header_level_1><location><page_1><loc_12><loc_73><loc_28><loc_75></location>Boots Self-Locking Nut</section_header_level_1>
|
||||||
|
<text><location><page_1><loc_12><loc_64><loc_54><loc_73></location>The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.</text>
|
||||||
|
<text><location><page_1><loc_12><loc_52><loc_53><loc_62></location>The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.</text>
|
||||||
|
<text><location><page_1><loc_12><loc_38><loc_54><loc_50></location>The spring, through the medium of the locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.</text>
|
||||||
|
<text><location><page_1><loc_12><loc_33><loc_53><loc_36></location>Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is</text>
|
||||||
|
<figure>
|
||||||
|
<location><page_1><loc_12><loc_10><loc_52><loc_31></location>
|
||||||
|
<caption>Figure 7-26. Self-locking nuts.</caption>
|
||||||
|
</figure>
|
||||||
|
<text><location><page_1><loc_54><loc_85><loc_95><loc_94></location>the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.</text>
|
||||||
|
<text><location><page_1><loc_54><loc_83><loc_55><loc_85></location>.</text>
|
||||||
|
<section_header_level_1><location><page_1><loc_54><loc_82><loc_76><loc_83></location>Stainless Steel Self-Locking Nut</section_header_level_1>
|
||||||
|
<text><location><page_1><loc_54><loc_54><loc_96><loc_81></location>The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.</text>
|
||||||
|
<section_header_level_1><location><page_1><loc_54><loc_51><loc_65><loc_52></location>Elastic Stop Nut</section_header_level_1>
|
||||||
|
<text><location><page_1><loc_54><loc_47><loc_93><loc_50></location>The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This</text>
|
||||||
|
<figure>
|
||||||
|
<location><page_1><loc_54><loc_11><loc_94><loc_46></location>
|
||||||
|
<caption>Figure 7-27. Stainless steel self-locking nut.</caption>
|
||||||
|
</figure>
|
||||||
|
</document>
|
File diff suppressed because one or more lines are too long
33
tests/data/groundtruth/docling_v2/amt_handbook_sample.md
Normal file
33
tests/data/groundtruth/docling_v2/amt_handbook_sample.md
Normal file
@ -0,0 +1,33 @@
|
|||||||
|
pulleys, provided the inner race of the bearing is clamped to the supporting structure by the nut and bolt. Plates must be attached to the structure in a positive manner to eliminate rotation or misalignment when tightening the bolts or screws.
|
||||||
|
|
||||||
|
The two general types of self-locking nuts currently in use are the all-metal type and the fiber lock type. For the sake of simplicity, only three typical kinds of self-locking nuts are considered in this handbook: the Boots self-locking and the stainless steel self-locking nuts, representing the all-metal types; and the elastic stop nut, representing the fiber insert type.
|
||||||
|
|
||||||
|
## Boots Self-Locking Nut
|
||||||
|
|
||||||
|
The Boots self-locking nut is of one piece, all-metal construction designed to hold tight despite severe vibration. Note in Figure 7-26 that it has two sections and is essentially two nuts in one: a locking nut and a load-carrying nut. The two sections are connected with a spring, which is an integral part of the nut.
|
||||||
|
|
||||||
|
The spring keeps the locking and load-carrying sections such a distance apart that the two sets of threads are out of phase or spaced so that a bolt, which has been screwed through the load-carrying section, must push the locking section outward against the force of the spring to engage the threads of the locking section properly.
|
||||||
|
|
||||||
|
The spring, through the medium of the locking section, exerts a constant locking force on the bolt in the same direction as a force that would tighten the nut. In this nut, the load-carrying section has the thread strength of a standard nut of comparable size, while the locking section presses against the threads of the bolt and locks the nut firmly in position. Only a wrench applied to the nut loosens it. The nut can be removed and reused without impairing its efficiency.
|
||||||
|
|
||||||
|
Boots self-locking nuts are made with three different spring styles and in various shapes and sizes. The wing type that is
|
||||||
|
|
||||||
|
Figure 7-26. Self-locking nuts.
|
||||||
|
|
||||||
|
<!-- image -->
|
||||||
|
|
||||||
|
the most common ranges in size for No. 6 up to 1 / 4 inch, the Rol-top ranges from 1 / 4 inch to 1 / 6 inch, and the bellows type ranges in size from No. 8 up to 3 / 8 inch. Wing-type nuts are made of anodized aluminum alloy, cadmium-plated carbon steel, or stainless steel. The Rol-top nut is cadmium-plated steel, and the bellows type is made of aluminum alloy only.
|
||||||
|
|
||||||
|
.
|
||||||
|
|
||||||
|
## Stainless Steel Self-Locking Nut
|
||||||
|
|
||||||
|
The stainless steel self-locking nut may be spun on and off by hand as its locking action takes places only when the nut is seated against a solid surface and tightened. The nut consists of two parts: a case with a beveled locking shoulder and key and a thread insert with a locking shoulder and slotted keyway. Until the nut is tightened, it spins on the bolt easily, because the threaded insert is the proper size for the bolt. However, when the nut is seated against a solid surface and tightened, the locking shoulder of the insert is pulled downward and wedged against the locking shoulder of the case. This action compresses the threaded insert and causes it to clench the bolt tightly. The cross-sectional view in Figure 7-27 shows how the key of the case fits into the slotted keyway of the insert so that when the case is turned, the threaded insert is turned with it. Note that the slot is wider than the key. This permits the slot to be narrowed and the insert to be compressed when the nut is tightened.
|
||||||
|
|
||||||
|
## Elastic Stop Nut
|
||||||
|
|
||||||
|
The elastic stop nut is a standard nut with the height increased to accommodate a fiber locking collar. This
|
||||||
|
|
||||||
|
Figure 7-27. Stainless steel self-locking nut.
|
||||||
|
|
||||||
|
<!-- image -->
|
File diff suppressed because one or more lines are too long
33
tests/data/groundtruth/docling_v2/blocks.md.md
Normal file
33
tests/data/groundtruth/docling_v2/blocks.md.md
Normal file
@ -0,0 +1,33 @@
|
|||||||
|
Unordered list:
|
||||||
|
|
||||||
|
- foo
|
||||||
|
|
||||||
|
Empty unordered list:
|
||||||
|
|
||||||
|
Ordered list:
|
||||||
|
|
||||||
|
- bar
|
||||||
|
|
||||||
|
Empty ordered list:
|
||||||
|
|
||||||
|
Heading:
|
||||||
|
|
||||||
|
# my heading
|
||||||
|
|
||||||
|
Empty heading:
|
||||||
|
|
||||||
|
Indented code block:
|
||||||
|
|
||||||
|
```
|
||||||
|
print("Hi!")
|
||||||
|
```
|
||||||
|
|
||||||
|
Empty indented code block:
|
||||||
|
|
||||||
|
Fenced code block:
|
||||||
|
|
||||||
|
```
|
||||||
|
print("Hello world!")
|
||||||
|
```
|
||||||
|
|
||||||
|
Empty fenced code block:
|
@ -0,0 +1,14 @@
|
|||||||
|
<document>
|
||||||
|
<section_header_level_1><location><page_1><loc_22><loc_83><loc_45><loc_84></location>Java Code Example</section_header_level_1>
|
||||||
|
<text><location><page_1><loc_22><loc_63><loc_78><loc_81></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</text>
|
||||||
|
<paragraph><location><page_1><loc_39><loc_61><loc_61><loc_62></location>Listing 1: Simple Java Program</paragraph>
|
||||||
|
<code><location><page_1><loc_22><loc_56><loc_55><loc_60></location>public static void print() { System.out.println( "Java Code" ); }</code>
|
||||||
|
<text><location><page_1><loc_22><loc_37><loc_78><loc_55></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</text>
|
||||||
|
<section_header_level_1><location><page_2><loc_22><loc_84><loc_32><loc_85></location>Formula</section_header_level_1>
|
||||||
|
<text><location><page_2><loc_22><loc_65><loc_80><loc_82></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</text>
|
||||||
|
<text><location><page_2><loc_22><loc_58><loc_80><loc_65></location>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt.</text>
|
||||||
|
<formula><location><page_2><loc_47><loc_56><loc_56><loc_57></location></formula>
|
||||||
|
<text><location><page_2><loc_22><loc_38><loc_80><loc_55></location>Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.</text>
|
||||||
|
<text><location><page_2><loc_22><loc_29><loc_80><loc_38></location>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.</text>
|
||||||
|
<text><location><page_2><loc_22><loc_21><loc_80><loc_29></location>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.</text>
|
||||||
|
</document>
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user