merged with main

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
This commit is contained in:
Peter Staar 2024-11-17 06:03:12 +01:00
commit b0e5154d87
47 changed files with 834 additions and 258 deletions

26
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View File

@ -0,0 +1,26 @@
---
name: Bug report
about: Report an issue to help improve Docling
title: ''
labels: bug
assignees: ''
---
### Bug
<!-- Describe the buggy behavior you have observed. -->
...
### Steps to reproduce
<!-- Describe the sequence of steps for reproducing the bug. -->
...
### Docling version
<!-- Copy the output of `docling --version`. -->
...
### Python version
<!-- Copy the output of `python --version`. -->
...
<!-- ⚠️ ATTENTION: When sharing screenshots, attachments, or other data make sure not to include any sensitive information. -->

1
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@ -0,0 +1 @@
blank_issues_enabled: false

View File

@ -0,0 +1,18 @@
---
name: Feature request
about: Suggest an idea for enhancing Docling
title: ''
labels: enhancement
assignees: ''
---
### Requested feature
<!-- Describe the feature you have in mind and the user need it addresses. -->
...
### Alternatives
<!-- Describe any alternatives you have considered. -->
...
<!-- ⚠️ ATTENTION: When sharing screenshots, attachments, or other data make sure not to include any sensitive information. -->

14
.github/ISSUE_TEMPLATE/question.md vendored Normal file
View File

@ -0,0 +1,14 @@
---
name: Question
about: Ask a question
title: ''
labels: question
assignees: ''
---
### Question
<!-- Describe what you would like to achieve and which part you need help with. -->
...
<!-- ⚠️ ATTENTION: When sharing screenshots, attachments, or other data make sure not to include any sensitive information. -->

View File

@ -3,7 +3,8 @@
<!-- STEPS TO FOLLOW:
1. Add a description of the changes (frequently the same as the commit description)
2. Enter the issue number next to "Resolves #" below (if there is no tracking issue resolved, **remove that section**)
3. Follow the steps in the checklist below, starting with the **Commit Message Formatting**.
3. Make sure the PR title follows the **Commit Message Formatting**: https://www.conventionalcommits.org/en/v1.0.0/#summary.
4. Follow the steps in the checklist below, starting with the **Commit Message Formatting**.
-->
<!-- Uncomment this section with the issue number if an issue is being resolved
@ -13,8 +14,6 @@ Resolves #
**Checklist:**
- [ ] **Commit Message Formatting**: Commit titles and messages follow guidelines in the
[conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary).
- [ ] Documentation has been updated, if necessary.
- [ ] Examples have been added, if necessary.
- [ ] Tests have been added, if necessary.

18
.github/mergify.yml vendored Normal file
View File

@ -0,0 +1,18 @@
merge_protections:
- name: Enforce conventional commit
description: Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
if:
- base = main
success_conditions:
- "title ~=
^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\\(.+\
\\))?:"
- name: Require two reviewer for test updates
description: When test data is updated, we require two reviewers
if:
- base = main
- or:
- files ~= ^tests/data
- files ~= ^tests/data_scanned
success_conditions:
- "#approved-reviews-by >= 2"

View File

@ -11,15 +11,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install poetry
run: pipx install poetry==1.8.3
shell: bash
- uses: actions/setup-python@v5
with:
cache: 'poetry'
- name: Install dependencies
run: poetry install --only docs
shell: bash
- uses: ./.github/actions/setup-poetry
- name: Build docs
run: poetry run mkdocs build --verbose --clean
- name: Build and push docs

View File

@ -1,3 +1,55 @@
## [v2.5.2](https://github.com/DS4SD/docling/releases/tag/v2.5.2) - 2024-11-13
### Fix
* Skip glm model downloads ([#322](https://github.com/DS4SD/docling/issues/322)) ([`c9341bf`](https://github.com/DS4SD/docling/commit/c9341bf22e08920284cbc14821c190eaf6abf8a6))
## [v2.5.1](https://github.com/DS4SD/docling/releases/tag/v2.5.1) - 2024-11-12
### Fix
* Handling of single-cell tables in DOCX backend ([#314](https://github.com/DS4SD/docling/issues/314)) ([`fb8ba86`](https://github.com/DS4SD/docling/commit/fb8ba861e28eda0079daa44fb1ea3ed17745f1d2))
### Documentation
* Hybrid RAG with Qdrant ([#312](https://github.com/DS4SD/docling/issues/312)) ([`7f5d35e`](https://github.com/DS4SD/docling/commit/7f5d35ea3c225ce1ce7328825842f98755c0104f))
* Add Data Prep Kit integration ([#316](https://github.com/DS4SD/docling/issues/316)) ([`93fc1be`](https://github.com/DS4SD/docling/commit/93fc1be61abfe0669daf26c0984a51ec8675bf62))
## [v2.5.0](https://github.com/DS4SD/docling/releases/tag/v2.5.0) - 2024-11-12
### Feature
* **OCR:** Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning ([#290](https://github.com/DS4SD/docling/issues/290)) ([`c6b3763`](https://github.com/DS4SD/docling/commit/c6b3763ecb6ef862840a30978ee177b907f86505))
### Fix
* Configure env prefix for docling settings ([#315](https://github.com/DS4SD/docling/issues/315)) ([`5d4a10b`](https://github.com/DS4SD/docling/commit/5d4a10b121317fa481208dacbee47032b08ff928))
* Added handling of grouped elements in pptx backend ([#307](https://github.com/DS4SD/docling/issues/307)) ([`81c8243`](https://github.com/DS4SD/docling/commit/81c8243a8bf177feed8f87ea283b5bb6836350cb))
* Allow mps usage for easyocr ([#286](https://github.com/DS4SD/docling/issues/286)) ([`97f214e`](https://github.com/DS4SD/docling/commit/97f214efddcf66f0734a95c17c08936f6111d113))
### Documentation
* Add navigation indices ([#305](https://github.com/DS4SD/docling/issues/305)) ([`1239ade`](https://github.com/DS4SD/docling/commit/1239ade2750349d13d4e865d88449b232bbad944))
## [v2.4.2](https://github.com/DS4SD/docling/releases/tag/v2.4.2) - 2024-11-08
### Fix
* **EasyOcrModel:** Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr ([#282](https://github.com/DS4SD/docling/issues/282)) ([`0eb065e`](https://github.com/DS4SD/docling/commit/0eb065e9b6e4619d4c412ed98bc7408915ca3f95))
## [v2.4.1](https://github.com/DS4SD/docling/releases/tag/v2.4.1) - 2024-11-08
### Fix
* **tesserocr:** Raise Exception if tesserocr has not loaded any languages ([#279](https://github.com/DS4SD/docling/issues/279)) ([`704d792`](https://github.com/DS4SD/docling/commit/704d792a7997c4ca34f9f9045ed4ae02b4f5df47))
* Dockerfile example copy command ([#234](https://github.com/DS4SD/docling/issues/234)) ([`90836db`](https://github.com/DS4SD/docling/commit/90836db90accf4a66c9c20544c98452840e3a308))
### Documentation
* Update badges & credits ([#248](https://github.com/DS4SD/docling/issues/248)) ([`a84ec27`](https://github.com/DS4SD/docling/commit/a84ec276b0997c4ba9b32e18e911a966124dc3bc))
* Add coming-soon section ([#235](https://github.com/DS4SD/docling/issues/235)) ([`5ce02c5`](https://github.com/DS4SD/docling/commit/5ce02c5c598a2efa615ad15f0ead8d752d3ad2ea))
* Add artifacts-path param to CLI ([#233](https://github.com/DS4SD/docling/issues/233)) ([`d5e65ae`](https://github.com/DS4SD/docling/commit/d5e65aedac23d6849c805a0e88dd06f2a285eb18))
## [v2.4.0](https://github.com/DS4SD/docling/releases/tag/v2.4.0) - 2024-11-04
### Feature

View File

@ -14,7 +14,7 @@ RUN pip install --no-cache-dir docling --extra-index-url https://download.pytorc
ENV HF_HOME=/tmp/
ENV TORCH_HOME=/tmp/
COPY examples/minimal.py /root/minimal.py
COPY docs/examples/minimal.py /root/minimal.py
RUN python -c 'from deepsearch_glm.utils.load_pretrained_models import load_pretrained_nlp_models; load_pretrained_nlp_models(verbose=True);'
RUN python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; StandardPdfPipeline.download_models_hf(force=True);'

View File

@ -6,6 +6,10 @@
# Docling
<p align="center">
<a href="https://trendshift.io/repositories/12132" target="_blank"><img src="https://trendshift.io/api/badge/repositories/12132" alt="DS4SD%2Fdocling | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</p>
[![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://ds4sd.github.io/docling/)
[![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
@ -19,19 +23,22 @@
Docling parses documents and exports them to the desired format with ease and speed.
## Features
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON
* 📑 Advanced PDF document understanding including page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
* 📝 Metadata extraction, including title, authors, references & language
* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
* 🤖 Easy integration with LlamaIndex 🦙 & LangChain 🦜🔗 for powerful RAG / QA applications
* 🔍 OCR support for scanned PDFs
* 💻 Simple and convenient CLI
Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty examples and unlock the full power of Docling!
### Coming soon
* ♾️ Equation & code extraction
* 📝 Metadata extraction, including title, authors, references & language
* 🦜🔗 Native LangChain extension
## Installation
@ -57,16 +64,13 @@ result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
```
Check out [Getting started](https://ds4sd.github.io/docling/).
You will find lots of tuning options to leverage all the advanced capabilities.
## Get help and support
Please feel free to connect with us using the [discussion section](https://github.com/DS4SD/docling/discussions).
## Technical report
For more details on Docling's inner workings, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).
@ -75,7 +79,6 @@ For more details on Docling's inner workings, check out the [Docling Technical R
Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.
## References
If you use Docling in your projects, please consider citing the following:
@ -95,5 +98,9 @@ If you use Docling in your projects, please consider citing the following:
## License
The Docling codebase is under MIT license.
The Docling codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.
## IBM ❤️ Open Source AI
Docling has been brought to you by IBM.

View File

@ -29,7 +29,7 @@ class DoclingParsePageBackend(PdfPageBackend):
self._dpage = parsed_page["pages"][0]
else:
_log.info(
f"An error occured when loading page {page_no} of document {document_hash}."
f"An error occurred when loading page {page_no} of document {document_hash}."
)
def is_valid(self) -> bool:

View File

@ -31,7 +31,7 @@ class DoclingParseV2PageBackend(PdfPageBackend):
self._dpage = parsed_page["pages"][0]
else:
_log.info(
f"An error occured when loading page {page_no} of document {document_hash}."
f"An error occurred when loading page {page_no} of document {document_hash}."
)
def is_valid(self) -> bool:

View File

@ -120,6 +120,8 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
self.handle_header(element, idx, doc)
elif element.name in ["p"]:
self.handle_paragraph(element, idx, doc)
elif element.name in ["pre"]:
self.handle_code(element, idx, doc)
elif element.name in ["ul", "ol"]:
self.handle_list(element, idx, doc)
elif element.name in ["li"]:
@ -205,6 +207,16 @@ class HTMLDocumentBackend(DeclarativeDocumentBackend):
level=hlevel,
)
def handle_code(self, element, idx, doc):
"""Handles monospace code snippets (pre)."""
if element.text is None:
return
text = element.text.strip()
label = DocItemLabel.CODE
if len(text) == 0:
return
doc.add_text(parent=self.parents[self.level], label=label, text=text)
def handle_paragraph(self, element, idx, doc):
"""Handles paragraph tags (p)."""
if element.text is None:

View File

@ -358,41 +358,36 @@ class MsPowerpointDocumentBackend(DeclarativeDocumentBackend, PaginatedDocumentB
size = Size(width=slide_width, height=slide_height)
parent_page = doc.add_page(page_no=slide_ind + 1, size=size)
# parent_page = doc.add_page(page_no=slide_ind, size=size, hash=hash)
# Loop through each shape in the slide
for shape in slide.shapes:
def handle_shapes(shape, parent_slide, slide_ind, doc):
handle_groups(shape, parent_slide, slide_ind, doc)
if shape.has_table:
# Handle Tables
self.handle_tables(shape, parent_slide, slide_ind, doc)
if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
# Handle Tables
# Handle Pictures
self.handle_pictures(shape, parent_slide, slide_ind, doc)
# If shape doesn't have any text, move on to the next shape
if not hasattr(shape, "text"):
continue
return
if shape.text is None:
continue
return
if len(shape.text.strip()) == 0:
continue
return
if not shape.has_text_frame:
_log.warn("Warning: shape has text but not text_frame")
continue
# if shape.is_placeholder:
# Handle Titles (Headers) and Subtitles
# Check if the shape is a placeholder (titles are placeholders)
# self.handle_title(shape, parent_slide, slide_ind, doc)
# self.handle_text_elements(shape, parent_slide, slide_ind, doc)
# else:
_log.warning("Warning: shape has text but not text_frame")
return
# Handle other text elements, including lists (bullet lists, numbered lists)
self.handle_text_elements(shape, parent_slide, slide_ind, doc)
return
# figures...
# doc.add_figure(data=BaseFigureData(), parent=self.parents[self.level], caption=None)
def handle_groups(shape, parent_slide, slide_ind, doc):
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
for groupedshape in shape.shapes:
handle_shapes(groupedshape, parent_slide, slide_ind, doc)
# Loop through each shape in the slide
for shape in slide.shapes:
handle_shapes(shape, parent_slide, slide_ind, doc)
return doc

View File

@ -9,10 +9,12 @@ from docling_core.types.doc import (
DoclingDocument,
DocumentOrigin,
GroupLabel,
ImageRef,
TableCell,
TableData,
)
from lxml import etree
from PIL import Image
from docling.backend.abstract_backend import DeclarativeDocumentBackend
from docling.datamodel.base_models import InputFormat
@ -130,14 +132,8 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
def walk_linear(self, body, docx_obj, doc) -> DoclingDocument:
for element in body:
tag_name = etree.QName(element).localname
# Check for Inline Images (drawings or blip elements)
found_drawing = etree.ElementBase.xpath(
element, ".//w:drawing", namespaces=self.xml_namespaces
)
found_pict = etree.ElementBase.xpath(
element, ".//w:pict", namespaces=self.xml_namespaces
)
# Check for Inline Images (blip elements)
drawing_blip = element.xpath(".//a:blip")
# Check for Tables
if element.tag.endswith("tbl"):
@ -146,8 +142,8 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
except Exception:
_log.debug("could not parse a table, broken docx table")
elif found_drawing or found_pict:
self.handle_pictures(element, docx_obj, doc)
elif drawing_blip:
self.handle_pictures(element, docx_obj, drawing_blip, doc)
# Check for Text
elif tag_name in ["p"]:
self.handle_text_elements(element, docx_obj, doc)
@ -201,7 +197,6 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
label_str = ""
label_level = 0
if parts[0] == "Heading":
# print("{} - {}".format(parts[0], parts[1]))
label_str = parts[0]
label_level = self.str_to_int(parts[1], default=None)
if parts[1] == "Heading":
@ -217,19 +212,16 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
if paragraph.text is None:
# _log.warn(f"paragraph has text==None")
return
text = paragraph.text.strip()
# if len(text)==0 # keep empty paragraphs, they seperate adjacent lists!
# Common styles for bullet and numbered lists.
# "List Bullet", "List Number", "List Paragraph"
# TODO: reliably identify wether list is a numbered list or not
# Identify wether list is a numbered list or not
# is_numbered = "List Bullet" not in paragraph.style.name
is_numbered = False
p_style_name, p_level = self.get_label_and_level(paragraph)
numid, ilevel = self.get_numId_and_ilvl(paragraph)
# print("numid: {}, ilevel: {}, text: {}".format(numid, ilevel, text))
if numid == 0:
numid = None
@ -450,8 +442,13 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
for row in table.rows:
# Calculate the max number of columns
num_cols = max(num_cols, sum(get_colspan(cell) for cell in row.cells))
# if row.cells:
# num_cols = max(num_cols, len(row.cells))
if num_rows == 1 and num_cols == 1:
cell_element = table.rows[0].cells[0]
# In case we have a table of only 1 cell, we consider it furniture
# And proceed processing the content of the cell as though it's in the document body
self.walk_linear(cell_element._element, docx_obj, doc)
return
# Initialize the table grid
table_grid = [[None for _ in range(num_cols)] for _ in range(num_rows)]
@ -491,6 +488,24 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
doc.add_table(data=data, parent=self.parents[level - 1])
return
def handle_pictures(self, element, docx_obj, doc):
doc.add_picture(parent=self.parents[self.level], caption=None)
def handle_pictures(self, element, docx_obj, drawing_blip, doc):
def get_docx_image(element, drawing_blip):
rId = drawing_blip[0].get(
"{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed"
)
if rId in docx_obj.part.rels:
# Access the image part using the relationship ID
image_part = docx_obj.part.rels[rId].target_part
image_data = image_part.blob # Get the binary image data
return image_data
image_data = get_docx_image(element, drawing_blip)
image_bytes = BytesIO(image_data)
# Open the BytesIO object with PIL to create an Image
pil_image = Image.open(image_bytes)
doc.add_picture(
parent=self.parents[self.level],
image=ImageRef.from_pil(image=pil_image, dpi=72),
caption=None,
)
return

View File

@ -29,7 +29,7 @@ class PyPdfiumPageBackend(PdfPageBackend):
self._ppage: pdfium.PdfPage = pdfium_doc[page_no]
except PdfiumError as e:
_log.info(
f"An exception occured when loading page {page_no} of document {document_hash}.",
f"An exception occurred when loading page {page_no} of document {document_hash}.",
exc_info=True,
)
self.valid = False

View File

@ -153,6 +153,13 @@ def convert(
..., help="If enabled, the bitmap content will be processed using OCR."
),
] = True,
force_ocr: Annotated[
bool,
typer.Option(
...,
help="Replace any existing text with OCR generated text over the full content.",
),
] = False,
ocr_engine: Annotated[
OcrEngine, typer.Option(..., help="The OCR engine to use.")
] = OcrEngine.EASYOCR,
@ -178,6 +185,15 @@ def convert(
output: Annotated[
Path, typer.Option(..., help="Output directory where results are saved.")
] = Path("."),
verbose: Annotated[
int,
typer.Option(
"--verbose",
"-v",
count=True,
help="Set the verbosity level. -v for info logging, -vv for debug logging.",
),
] = 0,
version: Annotated[
Optional[bool],
typer.Option(
@ -188,7 +204,12 @@ def convert(
),
] = None,
):
logging.basicConfig(level=logging.INFO)
if verbose == 0:
logging.basicConfig(level=logging.WARNING)
elif verbose == 1:
logging.basicConfig(level=logging.INFO)
elif verbose == 2:
logging.basicConfig(level=logging.DEBUG)
if from_formats is None:
from_formats = [e for e in InputFormat]
@ -219,11 +240,11 @@ def convert(
match ocr_engine:
case OcrEngine.EASYOCR:
ocr_options: OcrOptions = EasyOcrOptions()
ocr_options: OcrOptions = EasyOcrOptions(force_full_page_ocr=force_ocr)
case OcrEngine.TESSERACT_CLI:
ocr_options = TesseractCliOcrOptions()
ocr_options = TesseractCliOcrOptions(force_full_page_ocr=force_ocr)
case OcrEngine.TESSERACT:
ocr_options = TesseractOcrOptions()
ocr_options = TesseractOcrOptions(force_full_page_ocr=force_ocr)
case _:
raise RuntimeError(f"Unexpected OCR engine type {ocr_engine}")
@ -280,5 +301,7 @@ def convert(
_log.info(f"All documents were converted in {end_time:.2f} seconds.")
click_app = typer.main.get_command(app)
if __name__ == "__main__":
app()

View File

@ -22,6 +22,7 @@ class TableStructureOptions(BaseModel):
class OcrOptions(BaseModel):
kind: str
force_full_page_ocr: bool = False # If enabled a full page OCR is always applied
bitmap_area_threshold: float = (
0.05 # percentage of the area for a bitmap to processed with OCR
)

View File

@ -2,7 +2,7 @@ import sys
from pathlib import Path
from pydantic import BaseModel
from pydantic_settings import BaseSettings
from pydantic_settings import BaseSettings, SettingsConfigDict
class DocumentLimits(BaseModel):
@ -40,6 +40,8 @@ class DebugSettings(BaseModel):
class AppSettings(BaseSettings):
model_config = SettingsConfigDict(env_prefix="DOCLING_", env_nested_delimiter="_")
perf: BatchConcurrencySettings
debug: DebugSettings

View File

@ -10,7 +10,7 @@ from PIL import Image, ImageDraw
from rtree import index
from scipy.ndimage import find_objects, label
from docling.datamodel.base_models import OcrCell, Page
from docling.datamodel.base_models import Cell, OcrCell, Page
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import OcrOptions
from docling.datamodel.settings import settings
@ -73,7 +73,9 @@ class BaseOcrModel(BasePageModel):
coverage, ocr_rects = find_ocr_rects(page.size, bitmap_rects)
# return full-page rectangle if sufficiently covered with bitmaps
if coverage > max(BITMAP_COVERAGE_TRESHOLD, self.options.bitmap_area_threshold):
if self.options.force_full_page_ocr or coverage > max(
BITMAP_COVERAGE_TRESHOLD, self.options.bitmap_area_threshold
):
return [
BoundingBox(
l=0,
@ -96,7 +98,7 @@ class BaseOcrModel(BasePageModel):
return ocr_rects
# Filters OCR cells by dropping any OCR cell that intersects with an existing programmatic cell.
def filter_ocr_cells(self, ocr_cells, programmatic_cells):
def _filter_ocr_cells(self, ocr_cells, programmatic_cells):
# Create R-tree index for programmatic cells
p = index.Property()
p.dimension = 2
@ -117,6 +119,23 @@ class BaseOcrModel(BasePageModel):
]
return filtered_ocr_cells
def post_process_cells(self, ocr_cells, programmatic_cells):
r"""
Post-process the ocr and programmatic cells and return the final list of of cells
"""
if self.options.force_full_page_ocr:
# If a full page OCR is forced, use only the OCR cells
cells = [
Cell(id=c_ocr.id, text=c_ocr.text, bbox=c_ocr.bbox)
for c_ocr in ocr_cells
]
return cells
## Remove OCR cells which overlap with programmatic cells.
filtered_ocr_cells = self._filter_ocr_cells(ocr_cells, programmatic_cells)
programmatic_cells.extend(filtered_ocr_cells)
return programmatic_cells
def draw_ocr_rects_and_cells(self, conv_res, page, ocr_rects, show: bool = False):
image = copy.deepcopy(page.image)
draw = ImageDraw.Draw(image, "RGBA")

View File

@ -43,7 +43,8 @@ class GlmModel:
def __init__(self, options: GlmOptions):
self.options = options
load_pretrained_nlp_models()
if self.options.model_names != "":
load_pretrained_nlp_models()
self.model = init_nlp_model(model_names=self.options.model_names)
def _to_legacy_document(self, conv_res) -> DsDocument:

View File

@ -2,9 +2,10 @@ import logging
from typing import Iterable
import numpy
import torch
from docling_core.types.doc import BoundingBox, CoordOrigin
from docling.datamodel.base_models import OcrCell, Page
from docling.datamodel.base_models import Cell, OcrCell, Page
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import EasyOcrOptions
from docling.datamodel.settings import settings
@ -32,6 +33,7 @@ class EasyOcrModel(BaseOcrModel):
self.reader = easyocr.Reader(
lang_list=self.options.lang,
gpu=self.options.use_gpu,
model_storage_directory=self.options.model_storage_directory,
download_enabled=self.options.download_enabled,
)
@ -86,12 +88,8 @@ class EasyOcrModel(BaseOcrModel):
]
all_ocr_cells.extend(cells)
## Remove OCR cells which overlap with programmatic cells.
filtered_ocr_cells = self.filter_ocr_cells(
all_ocr_cells, page.cells
)
page.cells.extend(filtered_ocr_cells)
# Post-process the cells
page.cells = self.post_process_cells(all_ocr_cells, page.cells)
# DEBUG code:
if settings.debug.visualize_ocr:

View File

@ -7,7 +7,7 @@ from typing import Iterable, Optional, Tuple
import pandas as pd
from docling_core.types.doc import BoundingBox, CoordOrigin
from docling.datamodel.base_models import OcrCell, Page
from docling.datamodel.base_models import Cell, OcrCell, Page
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import TesseractCliOcrOptions
from docling.datamodel.settings import settings
@ -170,12 +170,8 @@ class TesseractOcrCliModel(BaseOcrModel):
)
all_ocr_cells.append(cell)
## Remove OCR cells which overlap with programmatic cells.
filtered_ocr_cells = self.filter_ocr_cells(
all_ocr_cells, page.cells
)
page.cells.extend(filtered_ocr_cells)
# Post-process the cells
page.cells = self.post_process_cells(all_ocr_cells, page.cells)
# DEBUG code:
if settings.debug.visualize_ocr:

View File

@ -3,7 +3,7 @@ from typing import Iterable
from docling_core.types.doc import BoundingBox, CoordOrigin
from docling.datamodel.base_models import OcrCell, Page
from docling.datamodel.base_models import Cell, OcrCell, Page
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import TesseractOcrOptions
from docling.datamodel.settings import settings
@ -22,25 +22,37 @@ class TesseractOcrModel(BaseOcrModel):
self.reader = None
if self.enabled:
setup_errmsg = (
install_errmsg = (
"tesserocr is not correctly installed. "
"Please install it via `pip install tesserocr` to use this OCR engine. "
"Note that tesserocr might have to be manually compiled for working with"
"Note that tesserocr might have to be manually compiled for working with "
"your Tesseract installation. The Docling documentation provides examples for it. "
"Alternatively, Docling has support for other OCR engines. See the documentation."
"Alternatively, Docling has support for other OCR engines. See the documentation: "
"https://ds4sd.github.io/docling/installation/"
)
missing_langs_errmsg = (
"tesserocr is not correctly configured. No language models have been detected. "
"Please ensure that the TESSDATA_PREFIX envvar points to tesseract languages dir. "
"You can find more information how to setup other OCR engines in Docling "
"documentation: "
"https://ds4sd.github.io/docling/installation/"
)
try:
import tesserocr
except ImportError:
raise ImportError(setup_errmsg)
raise ImportError(install_errmsg)
try:
tesseract_version = tesserocr.tesseract_version()
_log.debug("Initializing TesserOCR: %s", tesseract_version)
except:
raise ImportError(setup_errmsg)
raise ImportError(install_errmsg)
_, tesserocr_languages = tesserocr.get_languages()
if not tesserocr_languages:
raise ImportError(missing_langs_errmsg)
# Initialize the tesseractAPI
_log.debug("Initializing TesserOCR: %s", tesseract_version)
lang = "+".join(self.options.lang)
if self.options.path is not None:
self.reader = tesserocr.PyTessBaseAPI(
@ -128,12 +140,8 @@ class TesseractOcrModel(BaseOcrModel):
# del high_res_image
all_ocr_cells.extend(cells)
## Remove OCR cells which overlap with programmatic cells.
filtered_ocr_cells = self.filter_ocr_cells(
all_ocr_cells, page.cells
)
page.cells.extend(filtered_ocr_cells)
# Post-process the cells
page.cells = self.post_process_cells(all_ocr_cells, page.cells)
# DEBUG code:
if settings.debug.visualize_ocr:

Binary file not shown.

After

Width:  |  Height:  |  Size: 443 KiB

Binary file not shown.

9
docs/cli.md Normal file
View File

@ -0,0 +1,9 @@
# CLI Reference
This page provides documentation for our command line tools.
::: mkdocs-click
:module: docling.cli.main
:command: click_app
:prog_name: docling
:style: table

View File

@ -0,0 +1,19 @@
![docling_architecture](../assets/docling_arch.png)
In a nutshell, Docling's architecture is outlined in the diagram above.
For each document format, the *document converter* knows which format-specific *backend* to employ for parsing the document and which *pipeline* to use for orchestrating the execution, along with any relevant *options*.
!!! tip
While the document converter holds a default mapping, this configuration is parametrizable, so e.g. for the PDF format, different backends and different pipeline options can be used — see [Usage](../usage.md#adjust-pipeline-features).
The *conversion result* contains the [*Docling document*](./docling_document.md), Docling's fundamental document representation.
Some typical scenarios for using a Docling document include directly calling its *export methods*, such as for markdown, dictionary etc., or having it chunked by a *chunker*.
For more details on Docling's architecture, check out the [Docling Technical Report](https://arxiv.org/abs/2408.09869).
!!! note
The components illustrated with dashed outline indicate base classes that can be subclassed for specialized implementations.

1
docs/concepts/index.md Normal file
View File

@ -0,0 +1 @@
Use the navigation on the left to browse some core Docling concepts.

View File

@ -80,6 +80,20 @@ def main():
}
)
# Docling Parse with EasyOCR (CPU only)
# ----------------------
# pipeline_options = PdfPipelineOptions()
# pipeline_options.do_ocr = True
# pipeline_options.ocr_options.use_gpu = False # <-- set this.
# pipeline_options.do_table_structure = True
# pipeline_options.table_structure_options.do_cell_matching = True
# doc_converter = DocumentConverter(
# format_options={
# InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
# }
# )
# Docling Parse with Tesseract
# ----------------------
# pipeline_options = PdfPipelineOptions()

View File

@ -0,0 +1,42 @@
from pathlib import Path
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
EasyOcrOptions,
PdfPipelineOptions,
TesseractCliOcrOptions,
TesseractOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
def main():
input_doc = Path("./tests/data/2206.01062.pdf")
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
# Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions
# ocr_options = EasyOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
)
}
)
doc = converter.convert(input_doc).document
md = doc.export_to_markdown()
print(md)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,288 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/hybrid_rag_qdrant\n",
".ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Hybrid RAG with Qdrant"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This example demonstrates using Docling with [Qdrant](https://qdrant.tech/) to perform a hybrid search across your documents using dense and sparse vectors.\n",
"\n",
"We'll chunk the documents using Docling before adding them to a Qdrant collection. By limiting the length of the chunks, we can preserve the meaning in each vector embedding."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 👉 Qdrant client uses [FastEmbed](https://github.com/qdrant/fastembed) to generate vector embeddings. You can install the `fastembed-gpu` package if you've got the hardware to support it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.3.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install --no-warn-conflicts -q qdrant-client docling docling-core fastembed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's import all the classes we'll be working with."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from docling_core.transforms.chunker import HierarchicalChunker\n",
"from qdrant_client import QdrantClient\n",
"\n",
"from docling.datamodel.base_models import InputFormat\n",
"from docling.document_converter import DocumentConverter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- For Docling, we'll set the allowed formats to HTML since we'll only be working with webpages in this tutorial.\n",
"- If we set a sparse model, Qdrant client will fuse the dense and sparse results using RRF. [Reference](https://qdrant.tech/documentation/tutorials/hybrid-search-fastembed/)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c1077c6634d9434584c41cc12f9107c9",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Fetching 5 files: 0%| | 0/5 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "67069c07b73448d491944452159d10bc",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Fetching 29 files: 0%| | 0/29 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"COLLECTION_NAME = \"docling\"\n",
"\n",
"doc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML])\n",
"client = QdrantClient(location=\":memory:\")\n",
"# The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI.\n",
"# For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant\n",
"# client = QdrantClient(location=\"http://localhost:6333\")\n",
"\n",
"client.set_model(\"sentence-transformers/all-MiniLM-L6-v2\")\n",
"client.set_sparse_model(\"Qdrant/bm25\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now download and chunk the document using Docling. For demonstration, we'll use an article about chunking strategies :)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"result = doc_converter.convert(\n",
" \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\"\n",
")\n",
"documents, metadatas = [], []\n",
"for chunk in HierarchicalChunker().chunk(result.document):\n",
" documents.append(chunk.text)\n",
" metadatas.append(chunk.meta.export_json_dict())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now upload the documents to Qdrant.\n",
"\n",
"- The `add()` method batches the documents and uses FastEmbed to generate vector embeddings on our machine."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['e74ae15be5eb4805858307846318e784',\n",
" 'f83f6125b0fa4a0595ae6a0777c9d90d',\n",
" '9cf63c7f30764715bf3804a19db36d7d',\n",
" '007dbe6d355b4b49af3b736cbd63a4d8',\n",
" 'e5e31f21f2e84aa68beca0dfc532cbe9',\n",
" '69c10816af204bb28630a1f957d8dd3e',\n",
" 'b63546b9b1744063bdb076b234d883ca',\n",
" '90ad15ba8fa6494489e1d3221e30bfcf',\n",
" '13517debb483452ea40fc7aa04c08c50',\n",
" '84ccab5cfab74e27a55acef1c63e3fad',\n",
" 'e8aa2ef46d234c5a8a9da64b701d60b4',\n",
" '190bea5ba43c45e792197c50898d1d90',\n",
" 'a730319ea65645ca81e735ace0bcc72e',\n",
" '415e7f6f15864e30b836e23ae8d71b43',\n",
" '5569bce4e65541868c762d149c6f491e',\n",
" '74d9b234e9c04ebeb8e4e1ca625789ac',\n",
" '308b1c5006a94a679f4c8d6f2396993c',\n",
" 'aaa5ec6d385a418388e660c425bf1dbe',\n",
" '630be8e43e4e4472a9cdb9af9462a43a',\n",
" '643b316224de4770a5349bf69cf93471',\n",
" 'da9265e6f6c2485493d15223eefdf411',\n",
" 'a916e447d52c4084b5ce81a0c5a65b07',\n",
" '2883c620858e4e728b88e127155a4f2c',\n",
" '2a998f0e9c124af99027060b94027874',\n",
" 'be551fbd2b9e42f48ebae0cbf1f481bc',\n",
" '95b7f7608e974ca6847097ee4590fba1',\n",
" '309db4f3863b4e3aaf16d5f346c309f3',\n",
" 'c818383267f64fd68b2237b024bd724e',\n",
" '1f16e78338c94238892171b400051cd4',\n",
" '25c680c3e064462cab071ea9bf1bad8c',\n",
" 'f41ab7e480a248c6bb87019341c7ca74',\n",
" 'd440128bed6d4dcb987152b48ecd9a8a',\n",
" 'c110d5dfdc5849808851788c2404dd15']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"client.add(COLLECTION_NAME, documents=documents, metadata=metadatas, batch_size=64)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Query Documents"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<=== Retrieved documents ===>\n",
"Document Specific Chunking is a strategy that respects the document's structure. Rather than using a set number of characters or a recursive process, it creates chunks that align with the logical sections of the document, like paragraphs or subsections. This approach maintains the original author's organization of content and helps keep the text coherent. It makes the retrieved information more relevant and useful, particularly for structured documents with clearly defined sections.\n",
"Document Specific Chunking can handle a variety of document formats, such as:\n",
"Consequently, there are also splitters available for this purpose.\n",
"1. We start at the top of the document, treating the first part as a chunk.\n",
"   2. We continue down the document, deciding if a new sentence or piece of information belongs with the first chunk or should start a new one.\n",
"    3. We keep this up until we reach the end of the document.\n",
"Have you ever wondered how we, humans, would chunk? Here's a breakdown of a possible way a human would process a new document:\n",
"The goal of chunking is, as its name says, to chunk the information into multiple smaller pieces in order to store it in a more efficient and meaningful way. This allows the retrieval to capture pieces of information that are more related to the question at hand, and the generation to be more precise, but also less costly, as only a part of a document will be included in the LLM prompt, instead of the whole document.\n",
"To put these strategies into action, there's a whole array of tools and libraries at your disposal. For example, llama_index is a fantastic tool that lets you create document indices and retrieve chunked documents. Let's not forget LangChain, another remarkable tool that makes implementing chunking strategies a breeze, particularly when dealing with multi-language data. Diving into these tools and understanding how they can work in harmony with the chunking strategies we've discussed is a crucial part of mastering Retrieval Augmented Generation.\n",
"Semantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.\n",
"You can see here that with a chunk size of 105, the Markdown structure of the document is taken into account, and the chunks thus preserve the semantics of the text!\n",
"And there you have it! These chunking strategies are like a personal toolbox when it comes to implementing Retrieval Augmented Generation. They're a ton of ways to slice and dice text, each with its unique features and quirks. This variety gives you the freedom to pick the strategy that suits your project best, allowing you to tailor your approach to perfectly fit the unique needs of your work.\n"
]
}
],
"source": [
"points = client.query(COLLECTION_NAME, query_text=\"Can I split documents?\", limit=10)\n",
"\n",
"print(\"<=== Retrieved documents ===>\")\n",
"for point in points:\n",
" print(point.document)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

1
docs/examples/index.md Normal file
View File

@ -0,0 +1 @@
Use the navigation on the left to browse through examples covering a range of possible workflows and use cases.

View File

@ -2,9 +2,9 @@
<p align="center">
<img loading="lazy" alt="Docling" src="assets/docling_processing.png" width="100%" />
<a href="https://trendshift.io/repositories/12132" target="_blank"><img src="https://trendshift.io/api/badge/repositories/12132" alt="DS4SD%2Fdocling | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</p>
[![arXiv](https://img.shields.io/badge/arXiv-2408.09869-b31b1b.svg)](https://arxiv.org/abs/2408.09869)
[![PyPI version](https://img.shields.io/pypi/v/docling)](https://pypi.org/project/docling/)
![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
@ -22,7 +22,16 @@ Docling parses documents and exports them to the desired format with ease and sp
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format
* 📝 Metadata extraction, including title, authors, references & language
* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
* 🤖 Easy integration with LlamaIndex 🦙 & LangChain 🦜🔗 for powerful RAG / QA applications
* 🔍 OCR support for scanned PDFs
* 💻 Simple and convenient CLI
### Coming soon
* ♾️ Equation & code extraction
* 📝 Metadata extraction, including title, authors, references & language
* 🦜🔗 Native LangChain extension
## IBM ❤️ Open Source AI
Docling has been brought to you by IBM.

View File

@ -0,0 +1,13 @@
## Get started
Docling is used by the [Data Prep Kit \[↗\]](https://ibm.github.io/data-prep-kit/) open-source toolkit for preparing unstructured data for LLM application development ranging from laptop scale to datacenter scale.
Below you find the Data Prep Kit modules powered by Docling.
## PDF ingestion to Parquet
- 💻 [GitHub \[↗\]](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/pdf2parquet)
- 📖 [API docs \[↗\]](https://ibm.github.io/data-prep-kit/transforms/language/pdf2parquet/python/)
## Document chunking
- 💻 [GitHub \[↗\]](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_chunk)
- 📖 [API docs \[↗\]](https://ibm.github.io/data-prep-kit/transforms/language/doc_chunk/python/)

View File

@ -0,0 +1 @@
Use the navigation on the left to browse through Docling integrations with popular frameworks and tools.

View File

@ -1,6 +1,6 @@
## Get started
Docling is available as an official LlamaIndex extension!
Docling is available as an official [LlamaIndex \[↗\]](https://docs.llamaindex.ai/) extension.
To get started, check out the [step-by-step guide in LlamaIndex \[↗\]](https://docs.llamaindex.ai/en/stable/examples/data_connectors/DoclingReaderDemo/)<!--{target="_blank"}-->.

View File

@ -22,50 +22,7 @@ A simple example would look like this:
docling https://arxiv.org/pdf/2206.01062
```
To see all available options (export formats etc.) run `docling --help`.
<details>
<summary><b>CLI reference</b></summary>
Here are the available options as of this writing (for an up-to-date listing, run `docling --help`):
```console
$ docling --help
Usage: docling [OPTIONS] source
╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] │
│ [required] │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from [docx|pptx|html|image|pdf|asciidoc|md] Specify input formats to convert from. │
│ Defaults to all formats. │
│ [default: None] │
│ --to [md|json|text|doctags] Specify output formats. Defaults to │
│ Markdown. │
│ [default: None] │
│ --ocr --no-ocr If enabled, the bitmap content will be │
│ processed using OCR. │
│ [default: ocr] │
│ --ocr-engine [easyocr|tesseract_cli|tesseract] The OCR engine to use. │
│ [default: easyocr] │
│ --pdf-backend [pypdfium2|dlparse_v1|dlparse_v2] The PDF backend to use. │
│ [default: dlparse_v1] │
│ --table-mode [fast|accurate] The mode to use in the table structure │
│ model. │
│ [default: fast] │
│ --abort-on-error --no-abort-on-error If enabled, the bitmap content will be │
│ processed using OCR. │
│ [default: no-abort-on-error] │
│ --output PATH Output directory where results are │
│ saved. │
│ [default: .] │
│ --version Show version information. │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```
</details>
To see all available options (export formats etc.) run `docling --help`. More details in the [CLI reference page](./cli.md).
@ -161,7 +118,7 @@ from docling.datamodel.base_models import DocumentStream
from docling.document_converter import DocumentConverter
buf = BytesIO(your_binary_stream)
source = DocumentStream(filename="my_doc.pdf", stream=buf)
source = DocumentStream(name="my_doc.pdf", stream=buf)
converter = DocumentConverter()
result = converter.convert(source)
```

View File

@ -39,7 +39,7 @@ theme:
- content.code.copy
- announce.dismiss
- navigation.tabs
# - navigation.indexes # <= if set, each "section" can have its own page, if index.md is used
- navigation.indexes # <= if set, each "section" can have its own page, if index.md is used
- navigation.instant
- navigation.instant.prefetch
# - navigation.instant.preview
@ -55,11 +55,15 @@ nav:
- Home: index.md
- Installation: installation.md
- Usage: usage.md
- CLI: cli.md
- Docling v2: v2.md
- Concepts:
- Concepts: concepts/index.md
- Architecture: concepts/architecture.md
- Docling Document: concepts/docling_document.md
# - Chunking: concepts/chunking.md
- Examples:
- Examples: examples/index.md
- Conversion:
- "Simple conversion": examples/minimal.py
- "Custom conversion": examples/custom_convert.py
@ -69,16 +73,20 @@ nav:
- "Figure enrichment": examples/develop_picture_enrichment.py
- "Table export": examples/export_tables.py
- "Multimodal export": examples/export_multimodal.py
- "Force full page OCR": examples/full_page_ocr.py
- RAG / QA:
- "RAG with LlamaIndex 🦙": examples/rag_llamaindex.ipynb
- "RAG with LangChain 🦜🔗": examples/rag_langchain.ipynb
- "Hybrid RAG with Qdrant": examples/hybrid_rag_qdrant.ipynb
# - Chunking:
# - Chunking: examples/chunking.md
# - CLI:
# - CLI: examples/cli.md
- Integrations:
- "LlamaIndex 🦙 extension": integrations/llamaindex.md
# - "LangChain 🦜🔗 extension": integrations/langchain.md
- Integrations: integrations/index.md
- "Data Prep Kit": integrations/data_prep_kit.md
- "LlamaIndex 🦙": integrations/llamaindex.md
# - "LangChain 🦜🔗": integrations/langchain.md
# - API reference:
# - API reference: api_reference/index.md
@ -92,9 +100,16 @@ markdown_extensions:
- admonition
- pymdownx.details
- attr_list
- mkdocs-click
plugins:
- search
- mkdocs-jupyter
# - mkdocstrings:
# default_handler: python
# options:
# preload_modules:
# - docling
# - docling_core
extra_css:
- stylesheets/extra.css

17
poetry.lock generated
View File

@ -2594,6 +2594,21 @@ watchdog = ">=2.0"
i18n = ["babel (>=2.9.0)"]
min-versions = ["babel (==2.9.0)", "click (==7.0)", "colorama (==0.4)", "ghp-import (==1.0)", "importlib-metadata (==4.4)", "jinja2 (==2.11.1)", "markdown (==3.3.6)", "markupsafe (==2.0.1)", "mergedeep (==1.3.4)", "mkdocs-get-deps (==0.2.0)", "packaging (==20.5)", "pathspec (==0.11.1)", "pyyaml (==5.1)", "pyyaml-env-tag (==0.1)", "watchdog (==2.0)"]
[[package]]
name = "mkdocs-click"
version = "0.8.1"
description = "An MkDocs extension to generate documentation for Click command line applications"
optional = false
python-versions = ">=3.7"
files = [
{file = "mkdocs_click-0.8.1-py3-none-any.whl", hash = "sha256:a100ff938be63911f86465a1c21d29a669a7c51932b700fdb3daa90d13b61ee4"},
{file = "mkdocs_click-0.8.1.tar.gz", hash = "sha256:0a88cce04870c5d70ff63138e2418219c3c4119cc928a59c66b76eb5214edba6"},
]
[package.dependencies]
click = ">=8.1"
markdown = ">=3.3"
[[package]]
name = "mkdocs-get-deps"
version = "0.2.0"
@ -7176,4 +7191,4 @@ tesserocr = ["tesserocr"]
[metadata]
lock-version = "2.0"
python-versions = "^3.10"
content-hash = "95357a52d305fc7dda3da7e397f20d6fe0d4050a90d904c1714536c5a005ea34"
content-hash = "9a7b0fe34d218e02da79cf62f27f7d2763dcebc92c2e791bc2814cf5d4de8cc2"

View File

@ -1,6 +1,6 @@
[tool.poetry]
name = "docling"
version = "2.4.0" # DO NOT EDIT, updated automatically
version = "2.5.2" # DO NOT EDIT, updated automatically
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Panos Vagenas <pva@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
license = "MIT"
@ -71,6 +71,7 @@ nbqa = "^1.9.0"
[tool.poetry.group.docs.dependencies]
mkdocs-material = "^9.5.40"
mkdocs-jupyter = "^0.25.0"
mkdocs-click = "^0.8.1"
[tool.poetry.group.examples.dependencies]
datasets = "^2.21.0"

Binary file not shown.

View File

@ -2,7 +2,7 @@ item-0 at level 0: unspecified: group _root_
item-1 at level 1: paragraph: Summer activities
item-2 at level 1: title: Swimming in the lake
item-3 at level 2: paragraph: Duck
item-4 at level 2: paragraph:
item-4 at level 2: picture
item-5 at level 2: paragraph: Figure 1: This is a cute duckling
item-6 at level 2: section_header: Lets swim!
item-7 at level 3: paragraph: To get started with swimming, fi ... down in a water and try not to drown:

File diff suppressed because one or more lines are too long

View File

@ -4,6 +4,8 @@ Summer activities
Duck
<!-- image -->
Figure 1: This is a cute duckling
## Lets swim!

View File

@ -15,34 +15,8 @@ from docling.document_converter import DocumentConverter, PdfFormatOption
from .verify_utils import verify_conversion_result_v1, verify_conversion_result_v2
GENERATE = False
# Debug
def save_output(pdf_path: Path, doc_result: ConversionResult, engine: str):
r""" """
import json
import os
parent = pdf_path.parent
eng = "" if engine is None else f".{engine}"
dict_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.json")
with open(dict_fn, "w") as fd:
json.dump(doc_result.legacy_document.export_to_dict(), fd)
pages_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.pages.json")
pages = [p.model_dump() for p in doc_result.pages]
with open(pages_fn, "w") as fd:
json.dump(pages, fd)
doctags_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.doctags.txt")
with open(doctags_fn, "w") as fd:
fd.write(doc_result.legacy_document.export_to_doctags())
md_fn = os.path.join(parent, f"{pdf_path.stem}{eng}.md")
with open(md_fn, "w") as fd:
fd.write(doc_result.legacy_document.export_to_markdown())
GENERATE_V1 = False
GENERATE_V2 = False
def get_pdf_paths():
@ -74,13 +48,15 @@ def get_converter(ocr_options: OcrOptions):
def test_e2e_conversions():
pdf_paths = get_pdf_paths()
engines: List[OcrOptions] = [
EasyOcrOptions(),
TesseractOcrOptions(),
TesseractCliOcrOptions(),
EasyOcrOptions(force_full_page_ocr=True),
TesseractOcrOptions(force_full_page_ocr=True),
TesseractCliOcrOptions(force_full_page_ocr=True),
]
for ocr_options in engines:
@ -91,20 +67,16 @@ def test_e2e_conversions():
doc_result: ConversionResult = converter.convert(pdf_path)
# Save conversions
# save_output(pdf_path, doc_result, None)
# Debug
verify_conversion_result_v1(
input_path=pdf_path,
doc_result=doc_result,
generate=GENERATE,
generate=GENERATE_V1,
fuzzy=True,
)
verify_conversion_result_v2(
input_path=pdf_path,
doc_result=doc_result,
generate=GENERATE,
generate=GENERATE_V2,
fuzzy=True,
)

View File

@ -256,15 +256,19 @@ def verify_conversion_result_v1(
dt_path = gt_subpath.with_suffix(f"{engine_suffix}.doctags.txt")
if generate: # only used when re-generating truth
pages_path.parent.mkdir(parents=True, exist_ok=True)
with open(pages_path, "w") as fw:
fw.write(json.dumps(doc_pred_pages, default=pydantic_encoder))
json_path.parent.mkdir(parents=True, exist_ok=True)
with open(json_path, "w") as fw:
fw.write(json.dumps(doc_pred, default=pydantic_encoder))
md_path.parent.mkdir(parents=True, exist_ok=True)
with open(md_path, "w") as fw:
fw.write(doc_pred_md)
dt_path.parent.mkdir(parents=True, exist_ok=True)
with open(dt_path, "w") as fw:
fw.write(doc_pred_dt)
else: # default branch in test
@ -328,15 +332,19 @@ def verify_conversion_result_v2(
dt_path = gt_subpath.with_suffix(f"{engine_suffix}.doctags.txt")
if generate: # only used when re-generating truth
pages_path.parent.mkdir(parents=True, exist_ok=True)
with open(pages_path, "w") as fw:
fw.write(json.dumps(doc_pred_pages, default=pydantic_encoder))
json_path.parent.mkdir(parents=True, exist_ok=True)
with open(json_path, "w") as fw:
fw.write(json.dumps(doc_pred, default=pydantic_encoder))
md_path.parent.mkdir(parents=True, exist_ok=True)
with open(md_path, "w") as fw:
fw.write(doc_pred_md)
dt_path.parent.mkdir(parents=True, exist_ok=True)
with open(dt_path, "w") as fw:
fw.write(doc_pred_dt)
else: # default branch in test