Merge branch 'release_v3' into nli/performance

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
This commit is contained in:
Christoph Auer 2024-12-09 16:52:54 +01:00
commit ce82e23b66
32 changed files with 902 additions and 157 deletions

2
.github/mergify.yml vendored
View File

@ -6,7 +6,7 @@ merge_protections:
success_conditions:
- "title ~=
^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\\(.+\
\\))?:"
\\))?(!)?:"
- name: Require two reviewer for test updates
description: When test data is updated, we require two reviewers
if:

View File

@ -1,3 +1,23 @@
## [v2.9.0](https://github.com/DS4SD/docling/releases/tag/v2.9.0) - 2024-12-09
### Feature
* Expose new hybrid chunker, update docs ([#384](https://github.com/DS4SD/docling/issues/384)) ([`c8ecdd9`](https://github.com/DS4SD/docling/commit/c8ecdd987e80227db3850ea729ecb36d2b609040))
* **MS Word backend:** Make detection of headers and other styles localization agnostic ([#534](https://github.com/DS4SD/docling/issues/534)) ([`3e073df`](https://github.com/DS4SD/docling/commit/3e073dfbebbc65f995d4df946c1650699a26782c))
### Fix
* Correcting DefaultText ID for MS Word backend ([#537](https://github.com/DS4SD/docling/issues/537)) ([`eb7ffcd`](https://github.com/DS4SD/docling/commit/eb7ffcdd1cda1caa8ec8ba2fc313ff1e7d9acd4f))
* Add `py.typed` marker file ([#531](https://github.com/DS4SD/docling/issues/531)) ([`9102fe1`](https://github.com/DS4SD/docling/commit/9102fe1adcd43432e5fb3f35af704b7442c5d633))
* Enable HTML export in CLI and add options for image mode ([#513](https://github.com/DS4SD/docling/issues/513)) ([`0d11e30`](https://github.com/DS4SD/docling/commit/0d11e30dd813020c0189de849cd7b2e285d08694))
* Missing text in docx (t tag) when embedded in a table ([#528](https://github.com/DS4SD/docling/issues/528)) ([`b730b2d`](https://github.com/DS4SD/docling/commit/b730b2d7a04a8773a00ed88889d28b0c476ba052))
* Restore pydantic version pin after fixes ([#512](https://github.com/DS4SD/docling/issues/512)) ([`c830b92`](https://github.com/DS4SD/docling/commit/c830b92b2e043ea63d216f65b3f9d88d2a8c33f7))
* Folder input in cli ([#511](https://github.com/DS4SD/docling/issues/511)) ([`8ada0bc`](https://github.com/DS4SD/docling/commit/8ada0bccc744df94f755adf71cf8b163e6304375))
### Documentation
* Document new integrations ([#532](https://github.com/DS4SD/docling/issues/532)) ([`e780333`](https://github.com/DS4SD/docling/commit/e7803334409a343a59c536c529a03d6f5cdbfe15))
## [v2.8.3](https://github.com/DS4SD/docling/releases/tag/v2.8.3) - 2024-12-03
### Fix

View File

@ -4,7 +4,7 @@
</a>
</p>
# 🦆 Docling
# Docling
<p align="center">
<a href="https://trendshift.io/repositories/12132" target="_blank"><img src="https://trendshift.io/api/badge/repositories/12132" alt="DS4SD%2Fdocling | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
@ -26,7 +26,7 @@ Docling parses documents and exports them to the desired format with ease and sp
## Features
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to Markdown and JSON
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
* 📑 Advanced PDF document understanding including page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
* 🤖 Easy integration with 🦙 LlamaIndex & 🦜🔗 LangChain for powerful RAG / QA applications

View File

@ -210,12 +210,14 @@ class DoclingParseV2DocumentBackend(PdfDocumentBackend):
self.parser = pdf_parser_v2("fatal")
success = False
if isinstance(path_or_stream, BytesIO):
if isinstance(self.path_or_stream, BytesIO):
success = self.parser.load_document_from_bytesio(
self.document_hash, path_or_stream
self.document_hash, self.path_or_stream
)
elif isinstance(self.path_or_stream, Path):
success = self.parser.load_document(
self.document_hash, str(self.path_or_stream)
)
elif isinstance(path_or_stream, Path):
success = self.parser.load_document(self.document_hash, str(path_or_stream))
if not success:
raise RuntimeError(

View File

@ -1,4 +1,5 @@
import logging
import re
from io import BytesIO
from pathlib import Path
from typing import Set, Union
@ -133,7 +134,6 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
def walk_linear(self, body, docx_obj, doc) -> DoclingDocument:
for element in body:
tag_name = etree.QName(element).localname
# Check for Inline Images (blip elements)
namespaces = {
"a": "http://schemas.openxmlformats.org/drawingml/2006/main",
@ -153,6 +153,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
self.handle_pictures(element, docx_obj, drawing_blip, doc)
# Check for Text
elif tag_name in ["p"]:
# "tcPr", "sectPr"
self.handle_text_elements(element, docx_obj, doc)
else:
_log.debug(f"Ignoring element in DOCX with tag: {tag_name}")
@ -166,6 +167,14 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
except ValueError:
return default
def split_text_and_number(self, input_string):
match = re.match(r"(\D+)(\d+)$|^(\d+)(\D+)", input_string)
if match:
parts = list(filter(None, match.groups()))
return parts
else:
return [input_string]
def get_numId_and_ilvl(self, paragraph):
# Access the XML element of the paragraph
numPr = paragraph._element.find(
@ -188,7 +197,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
def get_label_and_level(self, paragraph):
if paragraph.style is None:
return "Normal", None
label = paragraph.style.name
label = paragraph.style.style_id
if label is None:
return "Normal", None
if ":" in label:
@ -197,7 +206,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
if len(parts) == 2:
return parts[0], int(parts[1])
parts = label.split(" ")
parts = self.split_text_and_number(label)
if "Heading" in label and len(parts) == 2:
parts.sort()
@ -219,14 +228,13 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
if paragraph.text is None:
return
text = paragraph.text.strip()
# if len(text)==0 # keep empty paragraphs, they seperate adjacent lists!
# Common styles for bullet and numbered lists.
# "List Bullet", "List Number", "List Paragraph"
# Identify wether list is a numbered list or not
# is_numbered = "List Bullet" not in paragraph.style.name
is_numbered = False
p_style_name, p_level = self.get_label_and_level(paragraph)
p_style_id, p_level = self.get_label_and_level(paragraph)
numid, ilevel = self.get_numId_and_ilvl(paragraph)
if numid == 0:
@ -238,14 +246,14 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
element,
docx_obj,
doc,
p_style_name,
p_style_id,
p_level,
numid,
ilevel,
text,
is_numbered,
)
self.update_history(p_style_name, p_level, numid, ilevel)
self.update_history(p_style_id, p_level, numid, ilevel)
return
elif numid is None and self.prev_numid() is not None: # Close list
for key, val in self.parents.items():
@ -253,23 +261,23 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
self.parents[key] = None
self.level = self.level_at_new_list - 1
self.level_at_new_list = None
if p_style_name in ["Title"]:
if p_style_id in ["Title"]:
for key, val in self.parents.items():
self.parents[key] = None
self.parents[0] = doc.add_text(
parent=None, label=DocItemLabel.TITLE, text=text
)
elif "Heading" in p_style_name:
self.add_header(element, docx_obj, doc, p_style_name, p_level, text)
elif "Heading" in p_style_id:
self.add_header(element, docx_obj, doc, p_style_id, p_level, text)
elif p_style_name in [
elif p_style_id in [
"Paragraph",
"Normal",
"Subtitle",
"Author",
"Default Text",
"List Paragraph",
"List Bullet",
"DefaultText",
"ListParagraph",
"ListBullet",
"Quote",
]:
level = self.get_level()
@ -285,15 +293,13 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
label=DocItemLabel.PARAGRAPH, parent=self.parents[level - 1], text=text
)
self.update_history(p_style_name, p_level, numid, ilevel)
self.update_history(p_style_id, p_level, numid, ilevel)
return
def add_header(self, element, docx_obj, doc, curr_name, curr_level, text: str):
level = self.get_level()
if isinstance(curr_level, int):
if curr_level > level:
# add invisible group
for i in range(level, curr_level):
self.parents[i] = doc.add_group(
@ -301,9 +307,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
label=GroupLabel.SECTION,
name=f"header-{i}",
)
elif curr_level < level:
# remove the tail
for key, val in self.parents.items():
if key >= curr_level:
@ -314,7 +318,6 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
text=text,
level=curr_level,
)
else:
self.parents[self.level] = doc.add_heading(
parent=self.parents[self.level - 1],
@ -328,7 +331,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
element,
docx_obj,
doc,
p_style_name,
p_style_id,
p_level,
numid,
ilevel,
@ -346,7 +349,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
label=GroupLabel.LIST, name="list", parent=self.parents[level - 1]
)
# TODO: Set marker and enumerated arguments if this is an enumeration element.
# Set marker and enumerated arguments if this is an enumeration element.
self.listIter += 1
if is_numbered:
enum_marker = str(self.listIter) + "."
@ -365,8 +368,8 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
self.level_at_new_list + self.prev_indent() + 1,
self.level_at_new_list + ilevel + 1,
):
# TODO: determine if this is an unordered list or an ordered list.
# Set GroupLabel.ORDERED_LIST when it fits.
# Determine if this is an unordered list or an ordered list.
# Set GroupLabel.ORDERED_LIST when it fits.
self.listIter = 0
if is_numbered:
self.parents[i] = doc.add_group(
@ -467,6 +470,19 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
row_span = get_rowspan(cell)
col_span = get_colspan(cell)
cell_text = cell.text
# In case cell doesn't return text via docx library:
if len(cell_text) == 0:
cell_xml = cell._element
texts = [""]
for elem in cell_xml.iter():
if elem.tag.endswith("t"): # <w:t> tags that contain text
if elem.text:
texts.append(elem.text)
# Join the collected text
cell_text = " ".join(texts).strip()
# Find the next available column in the grid
while table_grid[row_idx][col_idx] is not None:
col_idx += 1
@ -477,15 +493,15 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
table_grid[row_idx + i][col_idx + j] = ""
cell = TableCell(
text=cell.text,
text=cell_text,
row_span=row_span,
col_span=col_span,
start_row_offset_idx=row_idx,
end_row_offset_idx=row_idx + row_span,
start_col_offset_idx=col_idx,
end_col_offset_idx=col_idx + col_span,
col_header=False, # col_header,
row_header=False, # ((not col_header) and html_cell.name=='th')
col_header=False,
row_header=False,
)
data.table_cells.append(cell)

View File

@ -0,0 +1,12 @@
#
# Copyright IBM Corp. 2024 - 2024
# SPDX-License-Identifier: MIT
#
from docling_core.transforms.chunker.base import BaseChunk, BaseChunker, BaseMeta
from docling_core.transforms.chunker.hierarchical_chunker import (
DocChunk,
DocMeta,
HierarchicalChunker,
)
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker

View File

@ -10,7 +10,9 @@ from pathlib import Path
from typing import Annotated, Dict, Iterable, List, Optional, Type
import typer
from docling_core.types.doc import ImageRefMode
from docling_core.utils.file import resolve_source_to_path
from pydantic import TypeAdapter, ValidationError
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend
@ -88,9 +90,11 @@ def export_documents(
conv_results: Iterable[ConversionResult],
output_dir: Path,
export_json: bool,
export_html: bool,
export_md: bool,
export_txt: bool,
export_doctags: bool,
image_export_mode: ImageRefMode,
):
success_count = 0
@ -101,33 +105,45 @@ def export_documents(
success_count += 1
doc_filename = conv_res.input.file.stem
# Export Deep Search document JSON format:
# Export JSON format:
if export_json:
fname = output_dir / f"{doc_filename}.json"
with fname.open("w", encoding="utf8") as fp:
_log.info(f"writing JSON output to {fname}")
fp.write(json.dumps(conv_res.document.export_to_dict()))
_log.info(f"writing JSON output to {fname}")
conv_res.document.save_as_json(
filename=fname, image_mode=image_export_mode
)
# Export HTML format:
if export_html:
fname = output_dir / f"{doc_filename}.html"
_log.info(f"writing HTML output to {fname}")
conv_res.document.save_as_html(
filename=fname, image_mode=image_export_mode
)
# Export Text format:
if export_txt:
fname = output_dir / f"{doc_filename}.txt"
with fname.open("w", encoding="utf8") as fp:
_log.info(f"writing Text output to {fname}")
fp.write(conv_res.document.export_to_markdown(strict_text=True))
_log.info(f"writing TXT output to {fname}")
conv_res.document.save_as_markdown(
filename=fname,
strict_text=True,
image_mode=ImageRefMode.PLACEHOLDER,
)
# Export Markdown format:
if export_md:
fname = output_dir / f"{doc_filename}.md"
with fname.open("w", encoding="utf8") as fp:
_log.info(f"writing Markdown output to {fname}")
fp.write(conv_res.document.export_to_markdown())
_log.info(f"writing Markdown output to {fname}")
conv_res.document.save_as_markdown(
filename=fname, image_mode=image_export_mode
)
# Export Document Tags format:
if export_doctags:
fname = output_dir / f"{doc_filename}.doctags"
with fname.open("w", encoding="utf8") as fp:
_log.info(f"writing Doc Tags output to {fname}")
fp.write(conv_res.document.export_to_document_tokens())
_log.info(f"writing Doc Tags output to {fname}")
conv_res.document.save_as_document_tokens(filename=fname)
else:
_log.warning(f"Document {conv_res.input.file} failed to convert.")
@ -162,6 +178,13 @@ def convert(
to_formats: List[OutputFormat] = typer.Option(
None, "--to", help="Specify output formats. Defaults to Markdown."
),
image_export_mode: Annotated[
ImageRefMode,
typer.Option(
...,
help="Image export mode for the document (only in case of JSON, Markdown or HTML). With `placeholder`, only the position of the image is marked in the output. In `embedded` mode, the image is embedded as base64 encoded string. In `referenced` mode, the image is exported in PNG format and referenced from the main exported document.",
),
] = ImageRefMode.EMBEDDED,
ocr: Annotated[
bool,
typer.Option(
@ -187,7 +210,7 @@ def convert(
] = None,
pdf_backend: Annotated[
PdfBackend, typer.Option(..., help="The PDF backend to use.")
] = PdfBackend.DLPARSE_V1,
] = PdfBackend.DLPARSE_V2,
table_mode: Annotated[
TableFormerMode,
typer.Option(..., help="The mode to use in the table structure model."),
@ -266,24 +289,45 @@ def convert(
with tempfile.TemporaryDirectory() as tempdir:
input_doc_paths: List[Path] = []
for src in input_sources:
source = resolve_source_to_path(source=src, workdir=Path(tempdir))
if not source.exists():
try:
# check if we can fetch some remote url
source = resolve_source_to_path(source=src, workdir=Path(tempdir))
input_doc_paths.append(source)
except FileNotFoundError:
err_console.print(
f"[red]Error: The input file {source} does not exist.[/red]"
f"[red]Error: The input file {src} does not exist.[/red]"
)
raise typer.Abort()
elif source.is_dir():
for fmt in from_formats:
for ext in FormatToExtensions[fmt]:
input_doc_paths.extend(list(source.glob(f"**/*.{ext}")))
input_doc_paths.extend(list(source.glob(f"**/*.{ext.upper()}")))
else:
input_doc_paths.append(source)
except IsADirectoryError:
# if the input matches to a file or a folder
try:
local_path = TypeAdapter(Path).validate_python(src)
if local_path.exists() and local_path.is_dir():
for fmt in from_formats:
for ext in FormatToExtensions[fmt]:
input_doc_paths.extend(
list(local_path.glob(f"**/*.{ext}"))
)
input_doc_paths.extend(
list(local_path.glob(f"**/*.{ext.upper()}"))
)
elif local_path.exists():
input_doc_paths.append(local_path)
else:
err_console.print(
f"[red]Error: The input file {src} does not exist.[/red]"
)
raise typer.Abort()
except Exception as err:
err_console.print(f"[red]Error: Cannot read the input {src}.[/red]")
_log.info(err) # will print more details if verbose is activated
raise typer.Abort()
if to_formats is None:
to_formats = [OutputFormat.MARKDOWN]
export_json = OutputFormat.JSON in to_formats
export_html = OutputFormat.HTML in to_formats
export_md = OutputFormat.MARKDOWN in to_formats
export_txt = OutputFormat.TEXT in to_formats
export_doctags = OutputFormat.DOCTAGS in to_formats
@ -317,6 +361,13 @@ def convert(
)
pipeline_options.table_structure_options.mode = table_mode
if image_export_mode != ImageRefMode.PLACEHOLDER:
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = (
True # FIXME: to be deprecated in verson 3
)
pipeline_options.images_scale = 2
if artifacts_path is not None:
pipeline_options.artifacts_path = artifacts_path
@ -329,11 +380,13 @@ def convert(
else:
raise RuntimeError(f"Unexpected PDF backend type {pdf_backend}")
pdf_format_option = PdfFormatOption(
pipeline_options=pipeline_options,
backend=backend, # pdf_backend
)
format_options: Dict[InputFormat, FormatOption] = {
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
backend=backend, # pdf_backend
)
InputFormat.PDF: pdf_format_option,
InputFormat.IMAGE: pdf_format_option,
}
doc_converter = DocumentConverter(
allowed_formats=from_formats,
@ -351,9 +404,11 @@ def convert(
conv_results,
output_dir=output,
export_json=export_json,
export_html=export_html,
export_md=export_md,
export_txt=export_txt,
export_doctags=export_doctags,
image_export_mode=image_export_mode,
)
end_time = time.time() - start_time

View File

@ -41,6 +41,7 @@ class InputFormat(str, Enum):
class OutputFormat(str, Enum):
MARKDOWN = "md"
JSON = "json"
HTML = "html"
TEXT = "text"
DOCTAGS = "doctags"

View File

@ -208,7 +208,11 @@ class PdfPipelineOptions(PipelineOptions):
table_structure_options: TableStructureOptions = TableStructureOptions()
ocr_options: Union[
EasyOcrOptions, TesseractCliOcrOptions, TesseractOcrOptions, OcrMacOptions
EasyOcrOptions,
TesseractCliOcrOptions,
TesseractOcrOptions,
OcrMacOptions,
RapidOcrOptions,
] = Field(EasyOcrOptions(), discriminator="kind")
images_scale: float = 1.0

View File

@ -9,7 +9,6 @@ from pydantic import BaseModel, ConfigDict, model_validator, validate_call
from docling.backend.abstract_backend import AbstractDocumentBackend
from docling.backend.asciidoc_backend import AsciiDocBackend
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend
from docling.backend.html_backend import HTMLDocumentBackend
from docling.backend.md_backend import MarkdownDocumentBackend
@ -90,7 +89,7 @@ class PdfFormatOption(FormatOption):
class ImageFormatOption(FormatOption):
pipeline_cls: Type = StandardPdfPipeline
backend: Type[AbstractDocumentBackend] = DoclingParseDocumentBackend
backend: Type[AbstractDocumentBackend] = DoclingParseV2DocumentBackend
def _get_default_option(format: InputFormat) -> FormatOption:
@ -114,10 +113,10 @@ def _get_default_option(format: InputFormat) -> FormatOption:
pipeline_cls=SimplePipeline, backend=HTMLDocumentBackend
),
InputFormat.IMAGE: FormatOption(
pipeline_cls=StandardPdfPipeline, backend=DoclingParseDocumentBackend
pipeline_cls=StandardPdfPipeline, backend=DoclingParseV2DocumentBackend
),
InputFormat.PDF: FormatOption(
pipeline_cls=StandardPdfPipeline, backend=DoclingParseDocumentBackend
pipeline_cls=StandardPdfPipeline, backend=DoclingParseV2DocumentBackend
),
}
if (options := format_to_default_options.get(format)) is not None:

View File

@ -99,7 +99,9 @@ class StandardPdfPipeline(PaginatedPipeline):
local_dir: Optional[Path] = None, force: bool = False
) -> Path:
from huggingface_hub import snapshot_download
from huggingface_hub.utils import disable_progress_bars
disable_progress_bars()
download_path = snapshot_download(
repo_id="ds4sd/docling-models",
force_download=force,

1
docling/py.typed Normal file
View File

@ -0,0 +1 @@

Binary file not shown.

Before

Width:  |  Height:  |  Size: 443 KiB

After

Width:  |  Height:  |  Size: 456 KiB

Binary file not shown.

65
docs/concepts/chunking.md Normal file
View File

@ -0,0 +1,65 @@
## Introduction
A *chunker* is a Docling abstraction that, given a
[`DoclingDocument`](./docling_document.md), returns a stream of chunks, each of which
captures some part of the document as a string accompanied by respective metadata.
To enable both flexibility for downstream applications and out-of-the-box utility,
Docling defines a chunker class hierarchy, providing a base type, `BaseChunker`, as well
as specific subclasses.
Docling integration with gen AI frameworks like LlamaIndex is done using the
`BaseChunker` interface, so users can easily plug in any built-in, self-defined, or
third-party `BaseChunker` implementation.
## Base Chunker
The `BaseChunker` base class API defines that any chunker should provide the following:
- `def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]`:
Returning the chunks for the provided document.
- `def serialize(self, chunk: BaseChunk) -> str`:
Returning the potentially metadata-enriched serialization of the chunk, typically
used to feed an embedding model (or generation model).
## Hybrid Chunker
!!! note "To access `HybridChunker`"
- If you are using the `docling` package, you can import as follows:
```python
from docling.chunking import HybridChunker
```
- If you are only using the `docling-core` package, you must ensure to install
the `chunking` extra, e.g.
```shell
pip install 'docling-core[chunking]'
```
and then you
can import as follows:
```python
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
```
The `HybridChunker` implementation uses a hybrid approach, applying tokenization-aware
refinements on top of document-based [hierarchical](#hierarchical-chunker) chunking.
More precisely:
- it starts from the result of the hierarchical chunker and, based on the user-provided
tokenizer (typically to be aligned to the embedding model tokenizer), it:
- does one pass where it splits chunks only when needed (i.e. oversized w.r.t.
tokens), &
- another pass where it merges chunks only when possible (i.e. undersized successive
chunks with same headings & captions) — users can opt out of this step via param
`merge_peers` (by default `True`)
👉 Example: see [here](../../examples/hybrid_chunking).
## Hierarchical Chunker
The `HierarchicalChunker` implementation uses the document structure information from
the [`DoclingDocument`](../docling_document) to create one chunk for each individual
detected document element, by default only merging together list items (can be opted out
via param `merge_list_items`). It also takes care of attaching all relevant document
metadata, including headers and captions.

View File

@ -0,0 +1,439 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Hybrid Chunking"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install -qU 'docling-core[chunking]' sentence-transformers transformers lancedb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conversion"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from docling.document_converter import DocumentConverter\n",
"\n",
"DOC_SOURCE = \"../../tests/data/md/wiki.md\"\n",
"\n",
"doc = DocumentConverter().convert(source=DOC_SOURCE).document"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Chunking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how `tokenizer` and `embed_model` further below are single-sourced from `EMBED_MODEL_ID`.\n",
"\n",
"This is important for making sure the chunker and the embedding model are using the same tokenizer."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoTokenizer\n",
"\n",
"from docling.chunking import HybridChunker\n",
"\n",
"EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
"MAX_TOKENS = 64\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(EMBED_MODEL_ID)\n",
"\n",
"chunker = HybridChunker(\n",
" tokenizer=tokenizer, # can also just pass model name instead of tokenizer instance\n",
" max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer`\n",
" # merge_peers=True, # optional, defaults to True\n",
")\n",
"chunk_iter = chunker.chunk(dl_doc=doc)\n",
"chunks = list(chunk_iter)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Points to notice:\n",
"- Where possible, we fit the limit of 64 tokens for the metadata-enriched serialization form (see chunk 2)\n",
"- Where neeeded, we stop before the limit, e.g. see cases of 63 as it would otherwise run into a comma (see chunk 6)\n",
"- Where possible, we merge undersized peer chunks (see chunk 0)\n",
"- \"Tail\" chunks trailing right after merges may still be undersized (see chunk 8)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== 0 ===\n",
"chunk.text (55 tokens):\n",
"'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\n",
"chunker.serialize(chunk) (56 tokens):\n",
"'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\n",
"\n",
"=== 1 ===\n",
"chunk.text (45 tokens):\n",
"'IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\n",
"chunker.serialize(chunk) (46 tokens):\n",
"'IBM\\nIBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\n",
"\n",
"=== 2 ===\n",
"chunk.text (63 tokens):\n",
"'IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\n",
"chunker.serialize(chunk) (64 tokens):\n",
"'IBM\\nIBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\n",
"\n",
"=== 3 ===\n",
"chunk.text (44 tokens):\n",
"\"IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n",
"chunker.serialize(chunk) (45 tokens):\n",
"\"IBM\\nIBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n",
"\n",
"=== 4 ===\n",
"chunk.text (63 tokens):\n",
"'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\n",
"chunker.serialize(chunk) (64 tokens):\n",
"'IBM\\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\n",
"\n",
"=== 5 ===\n",
"chunk.text (61 tokens):\n",
"'IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n",
"chunker.serialize(chunk) (62 tokens):\n",
"'IBM\\nIBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n",
"\n",
"=== 6 ===\n",
"chunk.text (62 tokens):\n",
"\"As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\n",
"chunker.serialize(chunk) (63 tokens):\n",
"\"IBM\\nAs one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\n",
"\n",
"=== 7 ===\n",
"chunk.text (63 tokens):\n",
"'language, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\n",
"chunker.serialize(chunk) (64 tokens):\n",
"'IBM\\nlanguage, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\n",
"\n",
"=== 8 ===\n",
"chunk.text (5 tokens):\n",
"'Awards.[16]'\n",
"chunker.serialize(chunk) (6 tokens):\n",
"'IBM\\nAwards.[16]'\n",
"\n",
"=== 9 ===\n",
"chunk.text (56 tokens):\n",
"'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\n",
"chunker.serialize(chunk) (60 tokens):\n",
"'IBM\\n1910s1950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\n",
"\n",
"=== 10 ===\n",
"chunk.text (60 tokens):\n",
"\"(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\n",
"chunker.serialize(chunk) (64 tokens):\n",
"\"IBM\\n1910s1950s\\n(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\n",
"\n",
"=== 11 ===\n",
"chunk.text (59 tokens):\n",
"'Computing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\n",
"chunker.serialize(chunk) (63 tokens):\n",
"'IBM\\n1910s1950s\\nComputing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\n",
"\n",
"=== 12 ===\n",
"chunk.text (13 tokens):\n",
"'D.C.; and Toronto, Canada.[22]'\n",
"chunker.serialize(chunk) (17 tokens):\n",
"'IBM\\n1910s1950s\\nD.C.; and Toronto, Canada.[22]'\n",
"\n",
"=== 13 ===\n",
"chunk.text (60 tokens):\n",
"'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\n",
"chunker.serialize(chunk) (64 tokens):\n",
"'IBM\\n1910s1950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\n",
"\n",
"=== 14 ===\n",
"chunk.text (59 tokens):\n",
"\"on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\n",
"chunker.serialize(chunk) (63 tokens):\n",
"\"IBM\\n1910s1950s\\non Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\n",
"\n",
"=== 15 ===\n",
"chunk.text (23 tokens):\n",
"\"practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\n",
"chunker.serialize(chunk) (27 tokens):\n",
"\"IBM\\n1910s1950s\\npractices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\n",
"\n",
"=== 16 ===\n",
"chunk.text (59 tokens):\n",
"'He implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\n",
"chunker.serialize(chunk) (63 tokens):\n",
"'IBM\\n1910s1950s\\nHe implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\n",
"\n",
"=== 17 ===\n",
"chunk.text (60 tokens):\n",
"'\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\n",
"chunker.serialize(chunk) (64 tokens):\n",
"'IBM\\n1910s1950s\\n\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\n",
"\n",
"=== 18 ===\n",
"chunk.text (57 tokens):\n",
"'clumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\n",
"chunker.serialize(chunk) (61 tokens):\n",
"'IBM\\n1910s1950s\\nclumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\n",
"\n",
"=== 19 ===\n",
"chunk.text (21 tokens):\n",
"'1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n",
"chunker.serialize(chunk) (25 tokens):\n",
"'IBM\\n1910s1950s\\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n",
"\n",
"=== 20 ===\n",
"chunk.text (22 tokens):\n",
"'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\n",
"chunker.serialize(chunk) (26 tokens):\n",
"'IBM\\n1960s1980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\n",
"\n"
]
}
],
"source": [
"for i, chunk in enumerate(chunks):\n",
" print(f\"=== {i} ===\")\n",
" txt_tokens = len(tokenizer.tokenize(chunk.text, max_length=None))\n",
" print(f\"chunk.text ({txt_tokens} tokens):\\n{repr(chunk.text)}\")\n",
"\n",
" ser_txt = chunker.serialize(chunk=chunk)\n",
" ser_tokens = len(tokenizer.tokenize(ser_txt, max_length=None))\n",
" print(f\"chunker.serialize(chunk) ({ser_tokens} tokens):\\n{repr(ser_txt)}\")\n",
"\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Vector Retrieval"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
"To disable this warning, you can either:\n",
"\t- Avoid using `tokenizers` before the fork if possible\n",
"\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
]
}
],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"\n",
"embed_model = SentenceTransformer(EMBED_MODEL_ID)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>vector</th>\n",
" <th>text</th>\n",
" <th>headings</th>\n",
" <th>captions</th>\n",
" <th>_distance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>[-0.1269039, -0.01948185, -0.07718097, -0.1116...</td>\n",
" <td>language, and the UPC barcode. The company has...</td>\n",
" <td>[IBM]</td>\n",
" <td>None</td>\n",
" <td>1.164613</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>[-0.10198064, 0.0055981805, -0.05095279, -0.13...</td>\n",
" <td>IBM originated with several technological inno...</td>\n",
" <td>[IBM, 1910s1950s]</td>\n",
" <td>None</td>\n",
" <td>1.245144</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>[-0.057121325, -0.034115084, -0.018113216, -0....</td>\n",
" <td>As one of the world's oldest and largest techn...</td>\n",
" <td>[IBM]</td>\n",
" <td>None</td>\n",
" <td>1.355586</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>[-0.04429054, -0.058111433, -0.009330196, -0.0...</td>\n",
" <td>IBM is the largest industrial research organiz...</td>\n",
" <td>[IBM]</td>\n",
" <td>None</td>\n",
" <td>1.398617</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>[-0.11920792, 0.053496413, -0.042391937, -0.03...</td>\n",
" <td>Awards.[16]</td>\n",
" <td>[IBM]</td>\n",
" <td>None</td>\n",
" <td>1.446295</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" vector \\\n",
"0 [-0.1269039, -0.01948185, -0.07718097, -0.1116... \n",
"1 [-0.10198064, 0.0055981805, -0.05095279, -0.13... \n",
"2 [-0.057121325, -0.034115084, -0.018113216, -0.... \n",
"3 [-0.04429054, -0.058111433, -0.009330196, -0.0... \n",
"4 [-0.11920792, 0.053496413, -0.042391937, -0.03... \n",
"\n",
" text headings \\\n",
"0 language, and the UPC barcode. The company has... [IBM] \n",
"1 IBM originated with several technological inno... [IBM, 1910s1950s] \n",
"2 As one of the world's oldest and largest techn... [IBM] \n",
"3 IBM is the largest industrial research organiz... [IBM] \n",
"4 Awards.[16] [IBM] \n",
"\n",
" captions _distance \n",
"0 None 1.164613 \n",
"1 None 1.245144 \n",
"2 None 1.355586 \n",
"3 None 1.398617 \n",
"4 None 1.446295 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pathlib import Path\n",
"from tempfile import mkdtemp\n",
"\n",
"import lancedb\n",
"\n",
"\n",
"def make_lancedb_index(db_uri, index_name, chunks, embedding_model):\n",
" db = lancedb.connect(db_uri)\n",
" data = []\n",
" for chunk in chunks:\n",
" embeddings = embedding_model.encode(chunker.serialize(chunk=chunk))\n",
" data_item = {\n",
" \"vector\": embeddings,\n",
" \"text\": chunk.text,\n",
" \"headings\": chunk.meta.headings,\n",
" \"captions\": chunk.meta.captions,\n",
" }\n",
" data.append(data_item)\n",
" tbl = db.create_table(index_name, data=data, exist_ok=True)\n",
" return tbl\n",
"\n",
"\n",
"db_uri = str(Path(mkdtemp()) / \"docling.db\")\n",
"index = make_lancedb_index(db_uri, doc.name, chunks, embed_model)\n",
"\n",
"sample_query = \"invent\"\n",
"sample_embedding = embed_model.encode(sample_query)\n",
"results = index.search(sample_embedding).limit(5)\n",
"\n",
"results.to_pandas()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -18,7 +18,7 @@ Docling parses documents and exports them to the desired format with ease and sp
## Features
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to Markdown and JSON
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format
* 🤖 Easy integration with 🦙 LlamaIndex & 🦜🔗 LangChain for powerful RAG / QA applications

View File

@ -1,7 +1,7 @@
Docling is available as an extraction backend in the [Bee][github] framework.
- 💻 [Bee GitHub][github]
- 📖 [Bee Docs][docs]
- 📖 [Bee docs][docs]
- 📦 [Bee NPM][package]
[github]: https://github.com/i-am-bee

View File

@ -0,0 +1,6 @@
Docling is available in [Cloudera](https://www.cloudera.com/) through the *RAG Studio*
Accelerator for Machine Learning Projects (AMP).
- 💻 [RAG Studio AMP GitHub][github]
[github]: https://github.com/cloudera/CML_AMP_RAG_Studio

View File

@ -1,13 +1,10 @@
## Get started
Docling is used by the [Data Prep Kit](https://ibm.github.io/data-prep-kit/) open-source toolkit for preparing unstructured data for LLM application development ranging from laptop scale to datacenter scale.
Below you find the Data Prep Kit modules powered by Docling.
## PDF ingestion to Parquet
## Components
### PDF ingestion to Parquet
- 💻 [PDF-to-Parquet GitHub](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/pdf2parquet)
- 📖 [PDF-to-Parquet Docs](https://ibm.github.io/data-prep-kit/transforms/language/pdf2parquet/python/)
- 📖 [PDF-to-Parquet docs](https://ibm.github.io/data-prep-kit/transforms/language/pdf2parquet/python/)
## Document chunking
### Document chunking
- 💻 [Doc Chunking GitHub](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_chunk)
- 📖 [Doc Chunking Docs](https://ibm.github.io/data-prep-kit/transforms/language/doc_chunk/python/)
- 📖 [Doc Chunking docs](https://ibm.github.io/data-prep-kit/transforms/language/doc_chunk/python/)

View File

@ -1,7 +1,7 @@
Docling is available as a file conversion method in [DocETL](https://github.com/ucbepic/docetl):
- 💻 [DocETL GitHub][github]
- 📖 [DocETL Docs][docs]
- 📖 [DocETL docs][docs]
- 📦 [DocETL PyPI][pypi]
[github]: https://github.com/ucbepic/docetl

View File

@ -1,14 +1,13 @@
Docling is powering document processing in [InstructLab](https://instructlab.ai/),
Docling is powering document processing in [InstructLab][home],
enabling users to unlock the knowledge hidden in documents and present it to
InstructLab's fine-tuning for aligning AI models to the user's specific data.
More details can be found in this [blog post][blog].
- 🏠 [InstructLab Home][home]
- 🏠 [InstructLab home][home]
- 💻 [InstructLab GitHub][github]
- 🧑🏻‍💻 [InstructLab UI][ui]
- 📖 [InstructLab Docs][docs]
<!-- - 📝 [Blog post]() -->
- 📖 [InstructLab docs][docs]
[home]: https://instructlab.ai
[github]: https://github.com/instructlab

View File

@ -1,8 +1,8 @@
Docling is available in [Kotaemon](https://cinnamon.github.io/kotaemon/) as the `DoclingReader` loader:
- 💻 [Kotaemon GitHub][github]
- 📖 [DoclingReader Docs][docs]
- ⚙️ [Docling Setup in Kotaemon][setup]
- 📖 [DoclingReader docs][docs]
- ⚙️ [Docling setup in Kotaemon][setup]
[github]: https://github.com/Cinnamon/kotaemon
[docs]: https://cinnamon.github.io/kotaemon/reference/loaders/docling_loader/

View File

@ -1,5 +1,3 @@
## Get started
Docling is available as an official [LlamaIndex](https://docs.llamaindex.ai/) extension.
To get started, check out the [step-by-step guide in LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/data_connectors/DoclingReaderDemo/).
@ -11,7 +9,7 @@ To get started, check out the [step-by-step guide in LlamaIndex](https://docs.ll
Reads document files and uses Docling to populate LlamaIndex `Document` objects — either serializing Docling's data model (losslessly, e.g. as JSON) or exporting to a simplified format (lossily, e.g. as Markdown).
- 💻 [Docling Reader GitHub](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-docling)
- 📖 [Docling Reader Docs](https://docs.llamaindex.ai/en/stable/api_reference/readers/docling/)
- 📖 [Docling Reader docs](https://docs.llamaindex.ai/en/stable/api_reference/readers/docling/)
- 📦 [Docling Reader PyPI](https://pypi.org/project/llama-index-readers-docling/)
### Docling Node Parser
@ -19,5 +17,5 @@ Reads document files and uses Docling to populate LlamaIndex `Document` objects
Reads LlamaIndex `Document` objects populated in Docling's format by Docling Reader and, using its knowledge of the Docling format, parses them to LlamaIndex `Node` objects for downstream usage in LlamaIndex applications, e.g. as chunks for embedding.
- 💻 [Docling Node Parser GitHub](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/node_parser/llama-index-node-parser-docling)
- 📖 [Docling Node Parser Docs](https://docs.llamaindex.ai/en/stable/api_reference/node_parser/docling/)
- 📖 [Docling Node Parser docs](https://docs.llamaindex.ai/en/stable/api_reference/node_parser/docling/)
- 📦 [Docling Node Parser PyPI](https://pypi.org/project/llama-index-node-parser-docling/)

View File

@ -1,9 +1,12 @@
Docling is available in [Prodigy][home] as a [Prodigy-PDF plugin][plugin] recipe.
- 🌐 [Prodigy Home][home]
- 🔌 [Prodigy-PDF Plugin][plugin]
- 🧑🏽‍🍳 [pdf-spans.manual Recipe][recipe]
More details can be found in this [blog post][blog].
- 🌐 [Prodigy home][home]
- 🔌 [Prodigy-PDF plugin][plugin]
- 🧑🏽‍🍳 [pdf-spans.manual recipe][recipe]
[home]: https://prodi.gy/
[plugin]: https://prodi.gy/docs/plugins#pdf
[recipe]: https://prodi.gy/docs/plugins#pdf-spans.manual
[blog]: https://explosion.ai/blog/pdfs-nlp-structured-data

View File

@ -0,0 +1,10 @@
Docling is powering document processing in [Red Hat Enterprise Linux AI][home] (RHEL AI),
enabling users to unlock the knowledge hidden in documents and present it to
InstructLab's fine-tuning for aligning AI models to the user's specific data.
More details can be found in this [blog post][blog].
- 🏠 [RHEL AI home][home]
[home]: https://www.redhat.com/en/technologies/linux-platforms/enterprise-linux/ai
[blog]: https://www.redhat.com/en/blog/docling-missing-document-processing-companion-generative-ai

View File

@ -1,11 +1,12 @@
# spaCy
Docling is available in [spaCy](https://spacy.io/) as the *spaCy Layout* plugin.
Docling is available in [spaCy](https://spacy.io/) as the "SpaCy Layout" plugin:
More details can be found in this [blog post][blog].
- 💻 [SpacyLayout GitHub][github]
- 📖 [SpacyLayout Docs][docs]
- 📖 [SpacyLayout docs][docs]
- 📦 [SpacyLayout PyPI][pypi]
[github]: https://github.com/explosion/spacy-layout
[docs]: https://github.com/explosion/spacy-layout?tab=readme-ov-file#readme
[pypi]: https://pypi.org/project/spacy-layout/
[blog]: https://explosion.ai/blog/pdfs-nlp-structured-data

View File

@ -0,0 +1,9 @@
Docling is available as a text extraction backend for [txtai](https://neuml.github.io/txtai/).
- 💻 [txtai GitHub][github]
- 📖 [txtai docs][docs]
- 📖 [txtai Docling backend][integration_docs]
[github]: https://github.com/neuml/txtai
[docs]: https://neuml.github.io/txtai
[integration_docs]: https://neuml.github.io/txtai/pipeline/data/filetohtml/#docling

View File

@ -53,7 +53,7 @@ theme:
- toc.follow
nav:
- Home:
- "🦆 Docling": index.md
- "Docling": index.md
- Installation: installation.md
- Usage: usage.md
- CLI: cli.md
@ -63,7 +63,7 @@ nav:
- Concepts: concepts/index.md
- Architecture: concepts/architecture.md
- Docling Document: concepts/docling_document.md
# - Chunking: concepts/chunking.md
- Chunking: concepts/chunking.md
- Examples:
- Examples: examples/index.md
- Conversion:
@ -81,20 +81,24 @@ nav:
- "RAG with LlamaIndex 🦙": examples/rag_llamaindex.ipynb
- "RAG with LangChain 🦜🔗": examples/rag_langchain.ipynb
- "Hybrid RAG with Qdrant": examples/hybrid_rag_qdrant.ipynb
# - Chunking:
- Chunking:
- "Hybrid chunking": examples/hybrid_chunking.ipynb
# - Chunking: examples/chunking.md
# - CLI:
# - CLI: examples/cli.md
- Integrations:
- Integrations: integrations/index.md
- "🐝 Bee": integrations/bee.md
- "Cloudera": integrations/cloudera.md
- "Data Prep Kit": integrations/data_prep_kit.md
- "DocETL": integrations/docetl.md
- "🐶 InstructLab": integrations/instructlab.md
- "Kotaemon": integrations/kotaemon.md
- "🦙 LlamaIndex": integrations/llamaindex.md
- "Prodigy": integrations/prodigy.md
- "Red Hat Enterprise Linux AI": integrations/rhel_ai.md
- "spaCy": integrations/spacy.md
- "txtai": integrations/txtai.md
# - "LangChain 🦜🔗": integrations/langchain.md
- API reference:
- Document Converter: api_reference/document_converter.md

189
poetry.lock generated
View File

@ -836,34 +836,49 @@ files = [
[[package]]
name = "deepsearch-glm"
version = "0.26.2"
version = "1.0.0"
description = "Graph Language Models"
optional = false
python-versions = "^3.9"
files = []
develop = false
python-versions = "<4.0,>=3.9"
files = [
{file = "deepsearch_glm-1.0.0-cp310-cp310-macosx_13_0_x86_64.whl", hash = "sha256:94792b57df7a1c4ba8b47ebd8f36ea0a090d4f27a4fba39bd7b166b6b537260a"},
{file = "deepsearch_glm-1.0.0-cp310-cp310-macosx_14_0_arm64.whl", hash = "sha256:ff46e352e96a2f56ce7ae4fdf04b271ee841c29ff159b1dec0e5ecaaadba8d4d"},
{file = "deepsearch_glm-1.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9d77d3d94d49641888aa15f3ad23e81158e791aa9d9608dd8168dc71788e56f3"},
{file = "deepsearch_glm-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:143de0fd111a570be12935d8799a2715fe1775d4dc4e256337860b429cee5d36"},
{file = "deepsearch_glm-1.0.0-cp310-cp310-win_amd64.whl", hash = "sha256:9f2872dd573cd2206ce7f9e2e6016c38b66d9ecbd983283ff5e8c6023813c311"},
{file = "deepsearch_glm-1.0.0-cp311-cp311-macosx_13_0_x86_64.whl", hash = "sha256:e64d94ff5209f0a11e8c75c6b28b033ef27b95a22c2fbcbd945e7fe8cc421545"},
{file = "deepsearch_glm-1.0.0-cp311-cp311-macosx_14_0_arm64.whl", hash = "sha256:a5702205677b768b51f881d15d933370f6ef3c826dfac3b9aa0b904d2e6c495a"},
{file = "deepsearch_glm-1.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0417a2ae998e1709f03458cfb9adb55423bb1328224eb055300796baa757879f"},
{file = "deepsearch_glm-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6f0e1efe9af0d28e9b473fe599246deb3a0be7c3d546a478da284747144d086a"},
{file = "deepsearch_glm-1.0.0-cp311-cp311-win_amd64.whl", hash = "sha256:807faf13eb0deea55a1951d479a85d5e20de0ff8b2e0b57b2f7939552759a426"},
{file = "deepsearch_glm-1.0.0-cp312-cp312-macosx_13_0_x86_64.whl", hash = "sha256:56d9575df9eceb8c2ae33e3d15e133924cc195714c3d268599b6f8414c1f6bb8"},
{file = "deepsearch_glm-1.0.0-cp312-cp312-macosx_14_0_arm64.whl", hash = "sha256:51f5c6522f60ba73eb12eeb7217bd98d871ba7c078337a4059d05878d8baf2d6"},
{file = "deepsearch_glm-1.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c6211eaf497ad7cfcb68f80f9b5387940be0204fe149a9fc03988a95145f410a"},
{file = "deepsearch_glm-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1b003bf457fce61ea4de79e2d7d0228a1ae349f677eb6570e745f79d4429804f"},
{file = "deepsearch_glm-1.0.0-cp312-cp312-win_amd64.whl", hash = "sha256:9d61f66048e6ab60fe9f84c823fd593bf8517755833bd9efb59156d77a2b42d0"},
{file = "deepsearch_glm-1.0.0-cp313-cp313-macosx_13_0_x86_64.whl", hash = "sha256:7d558e8b365c27ee665d0589165fd074fb252c73715f9cc6aeb4304a63683f37"},
{file = "deepsearch_glm-1.0.0-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:3199093a9472e5756214b9b6563f827c19c001c7dd8ae00e03eed1140c12930d"},
{file = "deepsearch_glm-1.0.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7f18d1ee68a0479592e0c714e6cbf9e2d0fa8edd692d580da64431c84cbef5c2"},
{file = "deepsearch_glm-1.0.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:62c1c0ea0a544219da15c017632f9e0be116ecdc335b865c6c5760429557fe23"},
{file = "deepsearch_glm-1.0.0-cp313-cp313-win_amd64.whl", hash = "sha256:962f393dcec2204de1a5cb0f635c65258bde2424ad2d4e0f5df770139c3958de"},
{file = "deepsearch_glm-1.0.0-cp39-cp39-macosx_13_0_x86_64.whl", hash = "sha256:4d328336950975c583d318a70e3511075d1ac1c599c2090a2a7928a4662fe8f2"},
{file = "deepsearch_glm-1.0.0-cp39-cp39-macosx_14_0_arm64.whl", hash = "sha256:748d077a4cacd714ff23a095c873549c176fa5ffe1a656be1bd11873148e58db"},
{file = "deepsearch_glm-1.0.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1c0953d1983e902327f0cc152ff8267056ec2699106eefc70a41eec6eebdbe1b"},
{file = "deepsearch_glm-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:105c50b2e5b8f9a6ea5fb0b755a9cd38a1fb12ecb07f1a13d1290ad3cdfeaa90"},
{file = "deepsearch_glm-1.0.0-cp39-cp39-win_amd64.whl", hash = "sha256:25bb899317f6af062083daa578f343c93a2b12755c174549fb58596de0bc7b9d"},
{file = "deepsearch_glm-1.0.0-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:e2315cc4ffe7032dada294a0cd72a47dbc6c0121fd07d4b5719f9a9e9519d091"},
{file = "deepsearch_glm-1.0.0-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:707b92f51bacbd0f799ee3351474766bf916ef82f97c1bcc0e7696532ba03535"},
{file = "deepsearch_glm-1.0.0.tar.gz", hash = "sha256:e8dce88ac519a693c260f28bd3c4ec409811e65ade84fb508f6c6e37ca065e62"},
]
[package.dependencies]
docling-core = "^2.0"
docutils = "!=0.21"
numpy = ">=1.24.4,<3.0.0"
pandas = ">=1.5.1,<3.0.0"
python-dotenv = "^1.0.0"
pywin32 = {version = "^307", markers = "sys_platform == \"win32\""}
requests = "^2.32.3"
rich = "^13.7.0"
tabulate = ">=0.8.9"
tqdm = "^4.64.0"
pywin32 = {version = ">=307,<308", markers = "sys_platform == \"win32\""}
[package.extras]
docling = ["docling-core (>=2.0,<3.0)", "pandas (>=1.5.1,<3.0.0)"]
pyplot = ["matplotlib (>=3.7.1,<4.0.0)"]
toolkit = ["deepsearch-toolkit (>=1.1.0,<2.0.0)"]
[package.source]
type = "git"
url = "ssh://git@github.com/DS4SD/deepsearch-glm.git"
reference = "cau/layout-processing-children-payloads"
resolved_reference = "8fac776c07fb7541d17ebc9db48c9900074f25b1"
toolkit = ["deepsearch-toolkit (>=1.1.0,<2.0.0)", "python-dotenv (>=1.0.0,<2.0.0)"]
utils = ["pandas (>=1.5.1,<3.0.0)", "python-dotenv (>=1.0.0,<2.0.0)", "requests (>=2.32.3,<3.0.0)", "rich (>=13.7.0,<14.0.0)", "tabulate (>=0.8.9)", "tqdm (>=4.64.0,<5.0.0)"]
[[package]]
name = "defusedxml"
@ -904,28 +919,29 @@ files = [
[[package]]
name = "docling-core"
version = "2.6.1"
version = "2.9.0"
description = "A python library to define and validate data types in Docling."
optional = false
python-versions = "^3.9"
files = []
develop = false
python-versions = "<4.0,>=3.9"
files = [
{file = "docling_core-2.9.0-py3-none-any.whl", hash = "sha256:b44b077db5d2ac8a900f30a15abe329c165b1f2eb7f1c90d1275c423c1c3d668"},
{file = "docling_core-2.9.0.tar.gz", hash = "sha256:1bf12fe67ee4852330e9bac33fe62b45598ff885481e03a88fa8e1bf48252424"},
]
[package.dependencies]
jsonref = "^1.1.0"
jsonschema = "^4.16.0"
pandas = "^2.1.4"
pillow = "^10.3.0"
pydantic = ">=2.6.0,<2.10"
jsonref = ">=1.1.0,<2.0.0"
jsonschema = ">=4.16.0,<5.0.0"
pandas = ">=2.1.4,<3.0.0"
pillow = ">=10.3.0,<11.0.0"
pydantic = ">=2.6.0,<2.10.0 || >2.10.0,<2.10.1 || >2.10.1,<2.10.2 || >2.10.2,<3.0.0"
pyyaml = ">=5.1,<7.0.0"
tabulate = "^0.9.0"
typing-extensions = "^4.12.2"
semchunk = {version = ">=2.2.0,<3.0.0", optional = true, markers = "extra == \"chunking\""}
tabulate = ">=0.9.0,<0.10.0"
transformers = {version = ">=4.34.0,<5.0.0", optional = true, markers = "extra == \"chunking\""}
typing-extensions = ">=4.12.2,<5.0.0"
[package.source]
type = "git"
url = "ssh://git@github.com/DS4SD/docling-core.git"
reference = "feat-add-legacy-convert"
resolved_reference = "4434b1073dc15fefb75f28c37299abd32d9c532f"
[package.extras]
chunking = ["semchunk (>=2.2.0,<3.0.0)", "transformers (>=4.34.0,<5.0.0)"]
[[package]]
name = "docling-ibm-models"
@ -956,25 +972,47 @@ resolved_reference = "c1bed7d5451ee16b7fb5b0bc5e847f599ed93aa7"
[[package]]
name = "docling-parse"
version = "2.1.2"
version = "3.0.0"
description = "Simple package to extract text with coordinates from programmatic PDFs"
optional = false
python-versions = "^3.9"
files = []
develop = false
python-versions = "<4.0,>=3.9"
files = [
{file = "docling_parse-3.0.0-cp310-cp310-macosx_13_0_x86_64.whl", hash = "sha256:8de583f9562549379b8878f4054c17a715ac492999187855a6178c258388d1c6"},
{file = "docling_parse-3.0.0-cp310-cp310-macosx_14_0_arm64.whl", hash = "sha256:0a504152836b52119c84ce6f2124006b2297eca9576c1e961745f774b8f55f59"},
{file = "docling_parse-3.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e73836d75127b168073e76a4170ec615ee49d6d46ac37d1a3f9d5c585b2c4363"},
{file = "docling_parse-3.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1fdff7e14e50c0f66350346082f1fdf6cbc0584bef809532075593fa0c2a2ab2"},
{file = "docling_parse-3.0.0-cp310-cp310-win_amd64.whl", hash = "sha256:f56ae44328f7242e7420330d3d737d5284ec256af8ecd0b02fe6e34719b3040a"},
{file = "docling_parse-3.0.0-cp311-cp311-macosx_13_0_x86_64.whl", hash = "sha256:f228587e0d3a8f46fec46934e324d74be90d7f1ad96579c775644b130f28acdb"},
{file = "docling_parse-3.0.0-cp311-cp311-macosx_14_0_arm64.whl", hash = "sha256:25da7fa46449386956906f04cad5e9bec87816c00146caaef1112c8cdda6b79c"},
{file = "docling_parse-3.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:787c200081af2fb2d267d8f404a1b57464ee2fbcda4abd8d7bab99244c1716cb"},
{file = "docling_parse-3.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:be7a28e7a3ae6e198722dbb29341956c565ab9d8fdbddaee91f81dc21d870dde"},
{file = "docling_parse-3.0.0-cp311-cp311-win_amd64.whl", hash = "sha256:4251888da7c0ff946ce77ea8f14a0896ffe24b79422155db5871b7ee1b9fbc0a"},
{file = "docling_parse-3.0.0-cp312-cp312-macosx_13_0_x86_64.whl", hash = "sha256:642e47bdf090b89766e035b74cc849abffe0df520f2907ff4dede5c819b31d4a"},
{file = "docling_parse-3.0.0-cp312-cp312-macosx_14_0_arm64.whl", hash = "sha256:731de22e279af1505f962dc10102b6405bcaac3d855657bf3542048e7182b440"},
{file = "docling_parse-3.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:afd553a715e6282fc5aadd3bfd402faab4e43b77f4952bd065e3941218118f39"},
{file = "docling_parse-3.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6cfb02830a918958a47144ca13ce985f09578a353c97da941935591e8917f432"},
{file = "docling_parse-3.0.0-cp312-cp312-win_amd64.whl", hash = "sha256:85ca7610e5debcfc37e7b6311f4fc7c62c9d0eeea11b8bf2b33a760e65dd64fe"},
{file = "docling_parse-3.0.0-cp313-cp313-macosx_13_0_x86_64.whl", hash = "sha256:9171180b509a41856d1e32e1486934eaf1460575a5d86fa3a8941cb01e2955ac"},
{file = "docling_parse-3.0.0-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:12c5fbeb41f491b75d77e055304fc931b723d28fab29e4c4cb2a113201a86918"},
{file = "docling_parse-3.0.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:83744522c1994ef2fe888865876515e28627ddfce396a119db3cb196a1a99a75"},
{file = "docling_parse-3.0.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9857d8982bb7a7b51e7cefdd01613a7979e66c9c3ed40ea151e979b0fc2fc5e3"},
{file = "docling_parse-3.0.0-cp313-cp313-win_amd64.whl", hash = "sha256:1ff51e5153d164b957bf6284987d805ff1b43559a0244265d1788c0034cb899a"},
{file = "docling_parse-3.0.0-cp39-cp39-macosx_13_0_x86_64.whl", hash = "sha256:a15efbef123b100a58425fa7073121e7bf0cb8433814bac200df416c4eb9e599"},
{file = "docling_parse-3.0.0-cp39-cp39-macosx_14_0_arm64.whl", hash = "sha256:1155d6ca8310e046e18c6a6dc7b7f57e0ed6c89791d3757db2a039f7f69694a6"},
{file = "docling_parse-3.0.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:159c12370d6dfbe3e572f43a6a2804ee81d7f073d0bd7e5ca08d9acd1876aa83"},
{file = "docling_parse-3.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:351f4d718485f44686d41d04b26867a429898dbb6ccfe43454adaae3a434d919"},
{file = "docling_parse-3.0.0-cp39-cp39-win_amd64.whl", hash = "sha256:9172c98615c85303a231b800dfb2e4c1e539b04e383dfc5d7f0dc5f708ea50fd"},
{file = "docling_parse-3.0.0-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:1ba1c3469a38b404123bb615e220c046496d5d47e161cc5af7ae749e8cf181ab"},
{file = "docling_parse-3.0.0-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:eb315b0af70757f2cba654b1629272ccb35a1a416facf552ff72fd89abe98967"},
{file = "docling_parse-3.0.0.tar.gz", hash = "sha256:62a50d0fc4bb437ba840fb0419a466361d93071f300ae5f0cebe9b842ef0c8d4"},
]
[package.dependencies]
autoflake = "^2.3.1"
pillow = "^10.4.0"
autoflake = ">=2.3.1,<3.0.0"
pillow = ">=10.4.0,<11.0.0"
pywin32 = {version = ">=305", markers = "sys_platform == \"win32\""}
tabulate = ">=0.9.0,<1.0.0"
[package.source]
type = "git"
url = "ssh://git@github.com/DS4SD/docling-parse.git"
reference = "dev/expose-cell-sanitisation-via-python"
resolved_reference = "8ea65ae3080db88f54f8a3f7b622e7b002c9b7f0"
[[package]]
name = "docutils"
version = "0.21.2"
@ -2817,6 +2855,32 @@ files = [
{file = "more_itertools-10.5.0-py3-none-any.whl", hash = "sha256:037b0d3203ce90cca8ab1defbbdac29d5f993fc20131f3664dc8d6acfa872aef"},
]
[[package]]
name = "mpire"
version = "2.10.2"
description = "A Python package for easy multiprocessing, but faster than multiprocessing"
optional = false
python-versions = "*"
files = [
{file = "mpire-2.10.2-py3-none-any.whl", hash = "sha256:d627707f7a8d02aa4c7f7d59de399dec5290945ddf7fbd36cbb1d6ebb37a51fb"},
{file = "mpire-2.10.2.tar.gz", hash = "sha256:f66a321e93fadff34585a4bfa05e95bd946cf714b442f51c529038eb45773d97"},
]
[package.dependencies]
multiprocess = [
{version = "*", optional = true, markers = "python_version < \"3.11\" and extra == \"dill\""},
{version = ">=0.70.15", optional = true, markers = "python_version >= \"3.11\" and extra == \"dill\""},
]
pygments = ">=2.0"
pywin32 = {version = ">=301", markers = "platform_system == \"Windows\""}
tqdm = ">=4.27"
[package.extras]
dashboard = ["flask"]
dill = ["multiprocess", "multiprocess (>=0.70.15)"]
docs = ["docutils (==0.17.1)", "sphinx (==3.2.1)", "sphinx-autodoc-typehints (==1.11.0)", "sphinx-rtd-theme (==0.5.0)", "sphinx-versions (==1.0.1)", "sphinxcontrib-images (==0.9.2)"]
testing = ["ipywidgets", "multiprocess", "multiprocess (>=0.70.15)", "numpy", "pywin32 (>=301)", "rich"]
[[package]]
name = "mpmath"
version = "1.3.0"
@ -3760,10 +3824,10 @@ files = [
numpy = [
{version = ">=1.21.0", markers = "python_version == \"3.9\" and platform_system == \"Darwin\" and platform_machine == \"arm64\""},
{version = ">=1.19.3", markers = "platform_system == \"Linux\" and platform_machine == \"aarch64\" and python_version >= \"3.8\" and python_version < \"3.10\" or python_version > \"3.9\" and python_version < \"3.10\" or python_version >= \"3.9\" and platform_system != \"Darwin\" and python_version < \"3.10\" or python_version >= \"3.9\" and platform_machine != \"arm64\" and python_version < \"3.10\""},
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
{version = ">=1.23.5", markers = "python_version >= \"3.11\" and python_version < \"3.12\""},
{version = ">=1.21.4", markers = "python_version >= \"3.10\" and platform_system == \"Darwin\" and python_version < \"3.11\""},
{version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version >= \"3.10\" and python_version < \"3.11\""},
{version = ">=1.23.5", markers = "python_version >= \"3.11\" and python_version < \"3.12\""},
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
]
[[package]]
@ -3786,10 +3850,10 @@ files = [
numpy = [
{version = ">=1.21.0", markers = "python_version == \"3.9\" and platform_system == \"Darwin\" and platform_machine == \"arm64\""},
{version = ">=1.19.3", markers = "platform_system == \"Linux\" and platform_machine == \"aarch64\" and python_version >= \"3.8\" and python_version < \"3.10\" or python_version > \"3.9\" and python_version < \"3.10\" or python_version >= \"3.9\" and platform_system != \"Darwin\" and python_version < \"3.10\" or python_version >= \"3.9\" and platform_machine != \"arm64\" and python_version < \"3.10\""},
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
{version = ">=1.23.5", markers = "python_version >= \"3.11\" and python_version < \"3.12\""},
{version = ">=1.21.4", markers = "python_version >= \"3.10\" and platform_system == \"Darwin\" and python_version < \"3.11\""},
{version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version >= \"3.10\" and python_version < \"3.11\""},
{version = ">=1.23.5", markers = "python_version >= \"3.11\" and python_version < \"3.12\""},
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
]
[[package]]
@ -3970,8 +4034,8 @@ files = [
[package.dependencies]
numpy = [
{version = ">=1.22.4", markers = "python_version < \"3.11\""},
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
{version = ">=1.23.2", markers = "python_version == \"3.11\""},
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
]
python-dateutil = ">=2.8.2"
pytz = ">=2020.1"
@ -6106,6 +6170,21 @@ files = [
cryptography = ">=2.0"
jeepney = ">=0.6"
[[package]]
name = "semchunk"
version = "2.2.0"
description = "A fast and lightweight Python library for splitting text into semantically meaningful chunks."
optional = false
python-versions = ">=3.9"
files = [
{file = "semchunk-2.2.0-py3-none-any.whl", hash = "sha256:7db19ca90ddb48f99265e789e07a7bb111ae25185f9cc3d44b94e1e61b9067fc"},
{file = "semchunk-2.2.0.tar.gz", hash = "sha256:4de761ce614036fa3bea61adbe47e3ade7c96ac9b062f223b3ac353dbfd26743"},
]
[package.dependencies]
mpire = {version = "*", extras = ["dill"]}
tqdm = "*"
[[package]]
name = "semver"
version = "2.13.0"
@ -7644,4 +7723,4 @@ tesserocr = ["tesserocr"]
[metadata]
lock-version = "2.0"
python-versions = "^3.9"
content-hash = "0d9d498f50601c95a8616797441f00597acdea1e6a70d3b9642c17ffacc1bb45"
content-hash = "6917af8d76aa1f85a159f0ab9546478b4bef194ae726c79196bac087c7368fef"

View File

@ -1,6 +1,6 @@
[tool.poetry]
name = "docling"
version = "2.8.3" # DO NOT EDIT, updated automatically
version = "2.9.0" # DO NOT EDIT, updated automatically
description = "SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications."
authors = ["Christoph Auer <cau@zurich.ibm.com>", "Michele Dolfi <dol@zurich.ibm.com>", "Maxim Lysak <mly@zurich.ibm.com>", "Nikos Livathinos <nli@zurich.ibm.com>", "Ahmed Nassar <ahn@zurich.ibm.com>", "Panos Vagenas <pva@zurich.ibm.com>", "Peter Staar <taa@zurich.ibm.com>"]
license = "MIT"
@ -25,11 +25,11 @@ packages = [{include = "docling"}]
# actual dependencies:
######################
python = "^3.9"
pydantic = ">=2.0.0,<2.10"
docling-core = { git = "ssh://git@github.com/DS4SD/docling-core.git", branch = "feat-add-legacy-convert" }
docling-ibm-models = { git = "ssh://git@github.com/DS4SD/docling-ibm-models.git", branch = "nli/performance" }
deepsearch-glm = { git = "ssh://git@github.com/DS4SD/deepsearch-glm.git", branch = "cau/layout-processing-children-payloads" }
docling-parse = { git = "ssh://git@github.com/DS4SD/docling-parse.git", branch = "dev/expose-cell-sanitisation-via-python" }
deepsearch-glm = "^1.0.0"
docling-parse = "^3.0.0"
docling-core = { version = "^2.9.0", extras = ["chunking"] }
pydantic = "^2.0.0"
filetype = "^1.2.0"
pypdfium2 = "^4.30.0"
pydantic-settings = "^2.3.0"

23
tests/data/md/wiki.md Normal file
View File

@ -0,0 +1,23 @@
# IBM
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.
It is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.
IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.
IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed "International Business Machines" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]
IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s, IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.
As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming language, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing Awards.[16]
## 1910s1950s
IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the Computing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington, D.C.; and Toronto, Canada.[22]
Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:105 He implemented sales conventions, "generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker".[25][26] His favorite slogan, "THINK", became a mantra for each company's employees.[25] During Watson's first four years, revenues reached $9 million ($158 million today) and the company's operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the clumsy hyphenated name "Computing-Tabulating-Recording Company" and chose to replace it with the more expansive title "International Business Machines" which had previously been used as the name of CTR's Canadian Division;[27] the name was changed on February 14, 1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.
## 1960s1980s
In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.