{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Docling","text":"
Docling simplifies document processing, parsing diverse formats \u2014 including advanced PDF understanding \u2014 and providing seamless integrations with the gen AI ecosystem.
"},{"location":"#features","title":"Features","text":"Do you want to leverage the power of AI and get a live support on Docling? Try out the Chat with Dosu functionalities provided by our friends at Dosu.
"},{"location":"#lf-ai-data","title":"LF AI & Data","text":"Docling is hosted as a project in the LF AI & Data Foundation.
"},{"location":"#ibm-open-source-ai","title":"IBM \u2764\ufe0f Open Source AI","text":"The project was started by the AI for knowledge team at IBM Research Zurich.
"},{"location":"v2/","title":"V2","text":""},{"location":"v2/#whats-new","title":"What's new","text":"Docling v2 introduces several new features:
We updated the command line syntax of Docling v2 to support many formats. Examples are seen below.
# Convert a single file to Markdown (default)\ndocling myfile.pdf\n\n# Convert a single file to Markdown and JSON, without OCR\ndocling myfile.pdf --to json --to md --no-ocr\n\n# Convert PDF files in input directory to Markdown (default)\ndocling ./input/dir --from pdf\n\n# Convert PDF and Word files in input directory to Markdown and JSON\ndocling ./input/dir --from pdf --from docx --to md --to json --output ./scratch\n\n# Convert all supported files in input directory to Markdown, but abort on first error\ndocling ./input/dir --output ./scratch --abort-on-error\n
Notable changes from Docling v1:
--from
and --to
arguments, to define input and output formats respectively.--abort-on-error
will abort any batch conversion as soon an error is encountered--backend
option for PDFs was removedDocumentConverter
","text":"To accommodate many input formats, we changed the way you need to set up your DocumentConverter
object. You can now define a list of allowed formats on the DocumentConverter
initialization, and specify custom options per-format if desired. By default, all supported formats are allowed. If you don't provide format_options
, defaults will be used for all allowed_formats
.
Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend. They are provided as format-specific types, such as PdfFormatOption
or WordFormatOption
, as seen below.
from docling.document_converter import DocumentConverter\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import (\n DocumentConverter,\n PdfFormatOption,\n WordFormatOption,\n)\nfrom docling.pipeline.simple_pipeline import SimplePipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\n\n## Default initialization still works as before:\n# doc_converter = DocumentConverter()\n\n\n# previous `PipelineOptions` is now `PdfPipelineOptions`\npipeline_options = PdfPipelineOptions()\npipeline_options.do_ocr = False\npipeline_options.do_table_structure = True\n#...\n\n## Custom options are now defined per format.\ndoc_converter = (\n DocumentConverter( # all of the below is optional, has internal defaults.\n allowed_formats=[\n InputFormat.PDF,\n InputFormat.IMAGE,\n InputFormat.DOCX,\n InputFormat.HTML,\n InputFormat.PPTX,\n ], # whitelist formats, non-matching files are ignored.\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options, # pipeline options go here.\n backend=PyPdfiumDocumentBackend # optional: pick an alternative backend\n ),\n InputFormat.DOCX: WordFormatOption(\n pipeline_cls=SimplePipeline # default for office formats and HTML\n ),\n },\n )\n)\n
Note: If you work only with defaults, all remains the same as in Docling v1.
More options are shown in the following example units:
We have simplified the way you can feed input to the DocumentConverter
and renamed the conversion methods for better semantics. You can now call the conversion directly with a single file, or a list of input files, or DocumentStream
objects, without constructing a DocumentConversionInput
object first.
DocumentConverter.convert
now converts a single file input (previously DocumentConverter.convert_single
).DocumentConverter.convert_all
now converts many files at once (previously DocumentConverter.convert
)....\nfrom docling.datamodel.document import ConversionResult\n## Convert a single file (from URL or local path)\nconv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Convert several files at once:\n\ninput_files = [\n \"tests/data/html/wiki_duck.html\",\n \"tests/data/docx/word_sample.docx\",\n \"tests/data/docx/lorem_ipsum.docx\",\n \"tests/data/pptx/powerpoint_sample.pptx\",\n \"tests/data/2305.03393v1-pg9-img.png\",\n \"tests/data/pdf/2206.01062.pdf\",\n]\n\n# Directly pass list of files or streams to `convert_all`\nconv_results_iter = doc_converter.convert_all(input_files) # previously `convert`\n
Through the raises_on_error
argument, you can also control if the conversion should raise exceptions when first encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status. By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed). ...\nconv_results_iter = doc_converter.convert_all(input_files, raises_on_error=False) # previously `convert`\n
"},{"location":"v2/#access-document-structures","title":"Access document structures","text":"We have simplified how you can access and export the converted document data, too. Our universal document representation is now available in conversion results as a DoclingDocument
object. DoclingDocument
provides a neat set of APIs to construct, iterate and export content in the document, as shown below.
conv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Inspect the converted document:\nconv_result.document.print_element_tree()\n\n## Iterate the elements in reading order, including hierarchy level:\nfor item, level in conv_result.document.iterate_items():\n if isinstance(item, TextItem):\n print(item.text)\n elif isinstance(item, TableItem):\n table_df: pd.DataFrame = item.export_to_dataframe()\n print(table_df.to_markdown())\n elif ...:\n #...\n
Note: While it is deprecated, you can still work with the Docling v1 document representation, it is available as:
conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type\n
"},{"location":"v2/#export-into-json-markdown-doctags","title":"Export into JSON, Markdown, Doctags","text":"Note: All render_...
methods in ConversionResult
have been removed in Docling v2, and are now available on DoclingDocument
as:
DoclingDocument.export_to_dict
DoclingDocument.export_to_markdown
DoclingDocument.export_to_document_tokens
conv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Export to desired format:\nprint(json.dumps(conv_res.document.export_to_dict()))\nprint(conv_res.document.export_to_markdown())\nprint(conv_res.document.export_to_document_tokens())\n
Note: While it is deprecated, you can still export Docling v1 JSON format. This is available through the same methods as on the DoclingDocument
type:
## Export legacy document representation to desired format, for v1 compatibility:\nprint(json.dumps(conv_res.legacy_document.export_to_dict()))\nprint(conv_res.legacy_document.export_to_markdown())\nprint(conv_res.legacy_document.export_to_document_tokens())\n
"},{"location":"v2/#reload-a-doclingdocument-stored-as-json","title":"Reload a DoclingDocument
stored as JSON","text":"You can save and reload a DoclingDocument
to disk in JSON format using the following codes:
# Save to disk:\ndoc: DoclingDocument = conv_res.document # produced from conversion result...\n\nwith Path(\"./doc.json\").open(\"w\") as fp:\n fp.write(json.dumps(doc.export_to_dict())) # use `export_to_dict` to ensure consistency\n\n# Load from disk:\nwith Path(\"./doc.json\").open(\"r\") as fp:\n doc_dict = json.loads(fp.read())\n doc = DoclingDocument.model_validate(doc_dict) # use standard pydantic API to populate doc\n
"},{"location":"v2/#chunking","title":"Chunking","text":"Docling v2 defines new base classes for chunking:
BaseMeta
for chunk metadataBaseChunk
containing the chunk text and metadata, andBaseChunker
for chunkers, producing chunks out of a DoclingDocument
.Additionally, it provides an updated HierarchicalChunker
implementation, which leverages the new DoclingDocument
and provides a new, richer chunk output format, including:
For an example, check out Chunking usage.
"},{"location":"concepts/","title":"Concepts","text":"Use the navigation on the left to browse through some core Docling concepts.
"},{"location":"concepts/architecture/","title":"Architecture","text":"In a nutshell, Docling's architecture is outlined in the diagram above.
For each document format, the document converter knows which format-specific backend to employ for parsing the document and which pipeline to use for orchestrating the execution, along with any relevant options.
Tip
While the document converter holds a default mapping, this configuration is parametrizable, so e.g. for the PDF format, different backends and different pipeline options can be used \u2014 see Usage.
The conversion result contains the Docling document, Docling's fundamental document representation.
Some typical scenarios for using a Docling document include directly calling its export methods, such as for markdown, dictionary etc., or having it serialized by a serializer or chunked by a chunker.
For more details on Docling's architecture, check out the Docling Technical Report.
Note
The components illustrated with dashed outline indicate base classes that can be subclassed for specialized implementations.
"},{"location":"concepts/chunking/","title":"Chunking","text":""},{"location":"concepts/chunking/#introduction","title":"Introduction","text":"Chunking approaches
Starting from a DoclingDocument
, there are in principle two possible chunking approaches:
DoclingDocument
to Markdown (or similar format) and then performing user-defined chunking as a post-processing step, orDoclingDocument
This page is about the latter, i.e. using native Docling chunkers. For an example of using approach (1) check out e.g. this recipe looking at the Markdown export mode.
A chunker is a Docling abstraction that, given a DoclingDocument
, returns a stream of chunks, each of which captures some part of the document as a string accompanied by respective metadata.
To enable both flexibility for downstream applications and out-of-the-box utility, Docling defines a chunker class hierarchy, providing a base type, BaseChunker
, as well as specific subclasses.
Docling integration with gen AI frameworks like LlamaIndex is done using the BaseChunker
interface, so users can easily plug in any built-in, self-defined, or third-party BaseChunker
implementation.
The BaseChunker
base class API defines that any chunker should provide the following:
def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]
: Returning the chunks for the provided document.def contextualize(self, chunk: BaseChunk) -> str
: Returning the potentially metadata-enriched serialization of the chunk, typically used to feed an embedding model (or generation model).To access HybridChunker
docling
package, you can import as follows: from docling.chunking import HybridChunker\n
docling-core
package, you must ensure to install the chunking
extra if you want to use HuggingFace tokenizers, e.g. pip install 'docling-core[chunking]'\n
or the chunking-openai
extra if you prefer Open AI tokenizers (tiktoken), e.g. pip install 'docling-core[chunking-openai]'\n
and then you can import as follows: from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\n
The HybridChunker
implementation uses a hybrid approach, applying tokenization-aware refinements on top of document-based hierarchical chunking.
More precisely:
merge_peers
(by default True
)\ud83d\udc49 Usage examples:
The HierarchicalChunker
implementation uses the document structure information from the DoclingDocument
to create one chunk for each individual detected document element, by default only merging together list items (can be opted out via param merge_list_items
). It also takes care of attaching all relevant document metadata, including headers and captions.
Confidence grades were introduced in v2.34.0 to help users understand how well a conversion performed and guide decisions about post-processing workflows. They are available in the confidence
field of the ConversionResult
object returned by the document converter.
Complex layouts, poor scan quality, or challenging formatting can lead to suboptimal document conversion results that may require additional attention or alternative conversion pipelines.
Confidence scores provide a quantitative assessment of document conversion quality. Each confidence report includes a numerical score (0.0 to 1.0) measuring conversion accuracy, and a quality grade (poor, fair, good, excellent) for quick interpretation.
Focus on quality grades!
Users can and should safely focus on the document-level grade fields \u2014 mean_grade
and low_grade
\u2014 to assess overall conversion quality. Numerical scores are used internally and are for informational purposes only; their computation and weighting may change in the future.
Use cases for confidence grades include:
A confidence report contains scores and grades:
POOR
FAIR
GOOD
EXCELLENT
Each confidence report includes four component scores and grades:
layout_score
: Overall quality of document element recognition ocr_score
: Quality of OCR-extracted contentparse_score
: 10th percentile score of digital text cells (emphasizes problem areas)table_score
: Table extraction quality (not yet implemented)Two aggregate grades provide overall document quality assessment:
mean_grade
: Average of the four component scoreslow_grade
: 5th percentile score (highlights worst-performing areas)Confidence grades are calculated at two levels:
pages
fieldConfidenceReport
With Docling v2, we introduce a unified document representation format called DoclingDocument
. It is defined as a pydantic datatype, which can express several features common to documents, such as:
The definition of the Pydantic types is implemented in the module docling_core.types.doc
, more details in source code definitions.
It also brings a set of document construction APIs to build up a DoclingDocument
from scratch.
To illustrate the features of the DoclingDocument
format, in the subsections below we consider the DoclingDocument
converted from tests/data/word_sample.docx
and we present some side-by-side comparisons, where the left side shows snippets from the converted document serialized as YAML and the right one shows the corresponding parts of the original MS Word.
A DoclingDocument
exposes top-level fields for the document content, organized in two categories. The first category is the content items, which are stored in these fields:
texts
: All items that have a text representation (paragraph, section heading, equation, ...). Base class is TextItem
.tables
: All tables, type TableItem
. Can carry structure annotations.pictures
: All pictures, type PictureItem
. Can carry structure annotations.key_value_items
: All key-value items.All of the above fields are lists and store items inheriting from the DocItem
type. They can express different data structures depending on their type, and reference parents and children through JSON pointers.
The second category is content structure, which is encapsulated in:
body
: The root node of a tree-structure for the main document bodyfurniture
: The root node of a tree-structure for all items that don't belong into the body (headers, footers, ...)groups
: A set of items that don't represent content, but act as containers for other content items (e.g. a list, a chapter)All of the above fields are only storing NodeItem
instances, which reference children and parents through JSON pointers.
The reading order of the document is encapsulated through the body
tree and the order of children in each item in the tree.
Below example shows how all items in the first page are nested below the title
item (#/texts/1
).
Below example shows how all items under the heading \"Let's swim\" (#/texts/5
) are nested as children. The children of \"Let's swim\" are both text items and groups, which contain the list elements. The group items are stored in the top-level groups
field.
Docling allows to be extended with third-party plugins which extend the choice of options provided in several steps of the pipeline.
Plugins are loaded via the pluggy system which allows third-party developers to register the new capabilities using the setuptools entrypoint.
The actual entrypoint definition might vary, depending on the packaging system you are using. Here are a few examples:
pyproject.tomlpoetry v1 pyproject.tomlsetup.cfgsetup.py[project.entry-points.\"docling\"]\nyour_plugin_name = \"your_package.module\"\n
[tool.poetry.plugins.\"docling\"]\nyour_plugin_name = \"your_package.module\"\n
[options.entry_points]\ndocling =\n your_plugin_name = your_package.module\n
from setuptools import setup\n\nsetup(\n # ...,\n entry_points = {\n 'docling': [\n 'your_plugin_name = \"your_package.module\"'\n ]\n }\n)\n
your_plugin_name
is the name you choose for your plugin. This must be unique among the broader Docling ecosystem.your_package.module
is the reference to the module in your package which is responsible for the plugin registration.The OCR factory allows to provide more OCR engines to the Docling users.
The content of your_package.module
registers the OCR engines with a code similar to:
# Factory registration\ndef ocr_engines():\n return {\n \"ocr_engines\": [\n YourOcrModel,\n ]\n }\n
where YourOcrModel
must implement the BaseOcrModel
and provide an options class derived from OcrOptions
.
If you look for an example, the default Docling plugins is a good starting point.
"},{"location":"concepts/plugins/#third-party-plugins","title":"Third-party plugins","text":"When the plugin is not provided by the main docling
package but by a third-party package this have to be enabled explicitly via the allow_external_plugins
option.
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions()\npipeline_options.allow_external_plugins = True # <-- enabled the external plugins\npipeline_options.ocr_options = YourOptions # <-- your options here\n\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options\n )\n }\n)\n
"},{"location":"concepts/plugins/#using-the-docling-cli","title":"Using the docling
CLI","text":"Similarly, when using the docling
users have to enable external plugins before selecting the new one.
# Show the external plugins\ndocling --show-external-plugins\n\n# Run docling with the new plugin\ndocling --allow-external-plugins --ocr-engine=NAME\n
"},{"location":"concepts/serialization/","title":"Serialization","text":""},{"location":"concepts/serialization/#introduction","title":"Introduction","text":"A document serializer (AKA simply serializer) is a Docling abstraction that is initialized with a given DoclingDocument
and returns a textual representation for that document.
Besides the document serializer, Docling defines similar abstractions for several document subcomponents, for example: text serializer, table serializer, picture serializer, list serializer, inline serializer, and more.
Last but not least, a serializer provider is a wrapper that abstracts the document serialization strategy from the document instance.
"},{"location":"concepts/serialization/#base-classes","title":"Base classes","text":"To enable both flexibility for downstream applications and out-of-the-box utility, Docling defines a serialization class hierarchy, providing:
BaseDocSerializer
, as well as BaseTextSerializer
, BaseTableSerializer
etc, and BaseSerializerProvider
, andMarkdownDocSerializer
.You can review all methods required to define the above base classes here.
From a client perspective, the most relevant is BaseDocSerializer.serialize()
, which returns the textual representation,\u00a0as well as relevant metadata on which document components contributed to that serialization.
DoclingDocument
export methods","text":"Docling provides predefined serializers for Markdown, HTML, and DocTags.
The respective DoclingDocument
export methods (e.g. export_to_markdown()
) are provided as user shorthands \u2014 internally directly instantiating and delegating to respective serializers.
For an example showcasing how to use serializers, see here.
"},{"location":"examples/","title":"Examples","text":"Use the navigation on the left to browse through examples covering a range of possible workflows and use cases.
"},{"location":"examples/advanced_chunking_and_serialization/","title":"Advanced chunking & serialization","text":"In this notebook we show how to customize the serialization strategies that come into play during chunking.
We will work with a document that contains some picture annotations:
In\u00a0[1]: Copied!from docling_core.types.doc.document import DoclingDocument\n\nSOURCE = \"./data/2408.09869v3_enriched.json\"\n\ndoc = DoclingDocument.load_from_json(SOURCE)\nfrom docling_core.types.doc.document import DoclingDocument SOURCE = \"./data/2408.09869v3_enriched.json\" doc = DoclingDocument.load_from_json(SOURCE)
Below we define the chunker (for more details check out Hybrid Chunking):
In\u00a0[2]: Copied!from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\nfrom docling_core.transforms.chunker.tokenizer.base import BaseTokenizer\nfrom docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\nfrom transformers import AutoTokenizer\n\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n\ntokenizer: BaseTokenizer = HuggingFaceTokenizer(\n tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n)\nchunker = HybridChunker(tokenizer=tokenizer)\nfrom docling_core.transforms.chunker.hybrid_chunker import HybridChunker from docling_core.transforms.chunker.tokenizer.base import BaseTokenizer from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer from transformers import AutoTokenizer EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" tokenizer: BaseTokenizer = HuggingFaceTokenizer( tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID), ) chunker = HybridChunker(tokenizer=tokenizer) In\u00a0[3]: Copied!
print(f\"{tokenizer.get_max_tokens()=}\")\nprint(f\"{tokenizer.get_max_tokens()=}\")
tokenizer.get_max_tokens()=512\n
Defining some helper methods:
In\u00a0[4]: Copied!from typing import Iterable, Optional\n\nfrom docling_core.transforms.chunker.base import BaseChunk\nfrom docling_core.transforms.chunker.hierarchical_chunker import DocChunk\nfrom docling_core.types.doc.labels import DocItemLabel\nfrom rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console(\n width=200, # for getting Markdown tables rendered nicely\n)\n\n\ndef find_n_th_chunk_with_label(\n iter: Iterable[BaseChunk], n: int, label: DocItemLabel\n) -> Optional[DocChunk]:\n num_found = -1\n for i, chunk in enumerate(iter):\n doc_chunk = DocChunk.model_validate(chunk)\n for it in doc_chunk.meta.doc_items:\n if it.label == label:\n num_found += 1\n if num_found == n:\n return i, chunk\n return None, None\n\n\ndef print_chunk(chunks, chunk_pos):\n chunk = chunks[chunk_pos]\n ctx_text = chunker.contextualize(chunk=chunk)\n num_tokens = tokenizer.count_tokens(text=ctx_text)\n doc_items_refs = [it.self_ref for it in chunk.meta.doc_items]\n title = f\"{chunk_pos=} {num_tokens=} {doc_items_refs=}\"\n console.print(Panel(ctx_text, title=title))\nfrom typing import Iterable, Optional from docling_core.transforms.chunker.base import BaseChunk from docling_core.transforms.chunker.hierarchical_chunker import DocChunk from docling_core.types.doc.labels import DocItemLabel from rich.console import Console from rich.panel import Panel console = Console( width=200, # for getting Markdown tables rendered nicely ) def find_n_th_chunk_with_label( iter: Iterable[BaseChunk], n: int, label: DocItemLabel ) -> Optional[DocChunk]: num_found = -1 for i, chunk in enumerate(iter): doc_chunk = DocChunk.model_validate(chunk) for it in doc_chunk.meta.doc_items: if it.label == label: num_found += 1 if num_found == n: return i, chunk return None, None def print_chunk(chunks, chunk_pos): chunk = chunks[chunk_pos] ctx_text = chunker.contextualize(chunk=chunk) num_tokens = tokenizer.count_tokens(text=ctx_text) doc_items_refs = [it.self_ref for it in chunk.meta.doc_items] title = f\"{chunk_pos=} {num_tokens=} {doc_items_refs=}\" console.print(Panel(ctx_text, title=title))
Below we inspect the first chunk containing a table \u2014 using the default serialization strategy:
In\u00a0[5]: Copied!chunker = HybridChunker(tokenizer=tokenizer)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\nchunker = HybridChunker(tokenizer=tokenizer) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE) print_chunk( chunks=chunks, chunk_pos=i, )
Token indices sequence length is longer than the specified maximum sequence length for this model (652 > 512). Running this sequence through the model will result in indexing errors\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=13 num_tokens=426 doc_items_refs=['#/texts/72', '#/tables/0'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 4 Performance \u2502\n\u2502 Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u2502\n\u2502 (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. \u2502\n\u2502 \u2502\n\u2502 Apple M3 Max, Thread budget. = 4. Apple M3 Max, native backend.TTS = 177 s 167 s. Apple M3 Max, native backend.Pages/s = 1.27 1.34. Apple M3 Max, native backend.Mem = 6.20 GB. Apple M3 Max, \u2502\n\u2502 pypdfium backend.TTS = 103 s 92 s. Apple M3 Max, pypdfium backend.Pages/s = 2.18 2.45. Apple M3 Max, pypdfium backend.Mem = 2.56 GB. (16 cores) Intel(R) Xeon E5-2690, Thread budget. = 16 4 16. (16 \u2502\n\u2502 cores) Intel(R) Xeon E5-2690, native backend.TTS = 375 s 244 s. (16 cores) Intel(R) Xeon E5-2690, native backend.Pages/s = 0.60 0.92. (16 cores) Intel(R) Xeon E5-2690, native backend.Mem = 6.16 \u2502\n\u2502 GB. (16 cores) Intel(R) Xeon E5-2690, pypdfium backend.TTS = 239 s 143 s. (16 cores) Intel(R) Xeon E5-2690, pypdfium backend.Pages/s = 0.94 1.57. (16 cores) Intel(R) Xeon E5-2690, pypdfium \u2502\n\u2502 backend.Mem = 2.42 GB \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\nINFO: As you see above, using the
HybridChunker
can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here. We can configure a different serialization strategy. In the example below, we specify a different table serializer that serializes tables to Markdown instead of the triplet notation used by default:
In\u00a0[6]: Copied!from docling_core.transforms.chunker.hierarchical_chunker import (\n ChunkingDocSerializer,\n ChunkingSerializerProvider,\n)\nfrom docling_core.transforms.serializer.markdown import MarkdownTableSerializer\n\n\nclass MDTableSerializerProvider(ChunkingSerializerProvider):\n def get_serializer(self, doc):\n return ChunkingDocSerializer(\n doc=doc,\n table_serializer=MarkdownTableSerializer(), # configuring a different table serializer\n )\n\n\nchunker = HybridChunker(\n tokenizer=tokenizer,\n serializer_provider=MDTableSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\nfrom docling_core.transforms.chunker.hierarchical_chunker import ( ChunkingDocSerializer, ChunkingSerializerProvider, ) from docling_core.transforms.serializer.markdown import MarkdownTableSerializer class MDTableSerializerProvider(ChunkingSerializerProvider): def get_serializer(self, doc): return ChunkingDocSerializer( doc=doc, table_serializer=MarkdownTableSerializer(), # configuring a different table serializer ) chunker = HybridChunker( tokenizer=tokenizer, serializer_provider=MDTableSerializerProvider(), ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE) print_chunk( chunks=chunks, chunk_pos=i, )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=13 num_tokens=431 doc_items_refs=['#/texts/72', '#/tables/0'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 4 Performance \u2502\n\u2502 Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u2502\n\u2502 (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. \u2502\n\u2502 \u2502\n\u2502 | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | \u2502\n\u2502 |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| \u2502\n\u2502 | | | TTS | Pages/s | Mem | TTS | Pages/s | Mem | \u2502\n\u2502 | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB | \u2502\n\u2502 | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB | \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
Below we inspect the first chunk containing a picture.
Even when using the default strategy, we can modify the relevant parameters, e.g. which placeholder is used for pictures:
In\u00a0[7]: Copied!from docling_core.transforms.serializer.markdown import MarkdownParams\n\n\nclass ImgPlaceholderSerializerProvider(ChunkingSerializerProvider):\n def get_serializer(self, doc):\n return ChunkingDocSerializer(\n doc=doc,\n params=MarkdownParams(\n image_placeholder=\"<!-- image -->\",\n ),\n )\n\n\nchunker = HybridChunker(\n tokenizer=tokenizer,\n serializer_provider=ImgPlaceholderSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\nfrom docling_core.transforms.serializer.markdown import MarkdownParams class ImgPlaceholderSerializerProvider(ChunkingSerializerProvider): def get_serializer(self, doc): return ChunkingDocSerializer( doc=doc, params=MarkdownParams( image_placeholder=\"\", ), ) chunker = HybridChunker( tokenizer=tokenizer, serializer_provider=ImgPlaceholderSerializerProvider(), ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE) print_chunk( chunks=chunks, chunk_pos=i, )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=0 num_tokens=117 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 <!-- image --> \u2502\n\u2502 Version 1.0 \u2502\n\u2502 Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta \u2502\n\u2502 Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar \u2502\n\u2502 AI4K Group, IBM Research R\u00a8 uschlikon, Switzerland \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
Below we define and use our custom picture serialization strategy which leverages picture annotations:
In\u00a0[8]: Copied!from typing import Any\n\nfrom docling_core.transforms.serializer.base import (\n BaseDocSerializer,\n SerializationResult,\n)\nfrom docling_core.transforms.serializer.common import create_ser_result\nfrom docling_core.transforms.serializer.markdown import MarkdownPictureSerializer\nfrom docling_core.types.doc.document import (\n PictureClassificationData,\n PictureDescriptionData,\n PictureItem,\n PictureMoleculeData,\n)\nfrom typing_extensions import override\n\n\nclass AnnotationPictureSerializer(MarkdownPictureSerializer):\n @override\n def serialize(\n self,\n *,\n item: PictureItem,\n doc_serializer: BaseDocSerializer,\n doc: DoclingDocument,\n **kwargs: Any,\n ) -> SerializationResult:\n text_parts: list[str] = []\n for annotation in item.annotations:\n if isinstance(annotation, PictureClassificationData):\n predicted_class = (\n annotation.predicted_classes[0].class_name\n if annotation.predicted_classes\n else None\n )\n if predicted_class is not None:\n text_parts.append(f\"Picture type: {predicted_class}\")\n elif isinstance(annotation, PictureMoleculeData):\n text_parts.append(f\"SMILES: {annotation.smi}\")\n elif isinstance(annotation, PictureDescriptionData):\n text_parts.append(f\"Picture description: {annotation.text}\")\n\n text_res = \"\\n\".join(text_parts)\n text_res = doc_serializer.post_process(text=text_res)\n return create_ser_result(text=text_res, span_source=item)\nfrom typing import Any from docling_core.transforms.serializer.base import ( BaseDocSerializer, SerializationResult, ) from docling_core.transforms.serializer.common import create_ser_result from docling_core.transforms.serializer.markdown import MarkdownPictureSerializer from docling_core.types.doc.document import ( PictureClassificationData, PictureDescriptionData, PictureItem, PictureMoleculeData, ) from typing_extensions import override class AnnotationPictureSerializer(MarkdownPictureSerializer): @override def serialize( self, *, item: PictureItem, doc_serializer: BaseDocSerializer, doc: DoclingDocument, **kwargs: Any, ) -> SerializationResult: text_parts: list[str] = [] for annotation in item.annotations: if isinstance(annotation, PictureClassificationData): predicted_class = ( annotation.predicted_classes[0].class_name if annotation.predicted_classes else None ) if predicted_class is not None: text_parts.append(f\"Picture type: {predicted_class}\") elif isinstance(annotation, PictureMoleculeData): text_parts.append(f\"SMILES: {annotation.smi}\") elif isinstance(annotation, PictureDescriptionData): text_parts.append(f\"Picture description: {annotation.text}\") text_res = \"\\n\".join(text_parts) text_res = doc_serializer.post_process(text=text_res) return create_ser_result(text=text_res, span_source=item) In\u00a0[9]: Copied!
class ImgAnnotationSerializerProvider(ChunkingSerializerProvider):\n def get_serializer(self, doc: DoclingDocument):\n return ChunkingDocSerializer(\n doc=doc,\n picture_serializer=AnnotationPictureSerializer(), # configuring a different picture serializer\n )\n\n\nchunker = HybridChunker(\n tokenizer=tokenizer,\n serializer_provider=ImgAnnotationSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\nclass ImgAnnotationSerializerProvider(ChunkingSerializerProvider): def get_serializer(self, doc: DoclingDocument): return ChunkingDocSerializer( doc=doc, picture_serializer=AnnotationPictureSerializer(), # configuring a different picture serializer ) chunker = HybridChunker( tokenizer=tokenizer, serializer_provider=ImgAnnotationSerializerProvider(), ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE) print_chunk( chunks=chunks, chunk_pos=i, )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=0 num_tokens=128 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 Picture description: In this image we can see a cartoon image of a duck holding a paper. \u2502\n\u2502 Version 1.0 \u2502\n\u2502 Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta \u2502\n\u2502 Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar \u2502\n\u2502 AI4K Group, IBM Research R\u00a8 uschlikon, Switzerland \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/advanced_chunking_and_serialization/#advanced-chunking-serialization","title":"Advanced chunking & serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#overview","title":"Overview\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#setup","title":"Setup\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#table-serialization","title":"Table serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-the-default-strategy","title":"Using the default strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#configuring-a-different-strategy","title":"Configuring a different strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#picture-serialization","title":"Picture serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-the-default-strategy","title":"Using the default strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-a-custom-strategy","title":"Using a custom strategy\u00b6","text":""},{"location":"examples/backend_csv/","title":"Conversion of CSV files","text":"In\u00a0[59]: Copied!
from pathlib import Path\n\nfrom docling.document_converter import DocumentConverter\n\n# Convert CSV to Docling document\nconverter = DocumentConverter()\nresult = converter.convert(Path(\"../../tests/data/csv/csv-comma.csv\"))\noutput = result.document.export_to_markdown()\nfrom pathlib import Path from docling.document_converter import DocumentConverter # Convert CSV to Docling document converter = DocumentConverter() result = converter.convert(Path(\"../../tests/data/csv/csv-comma.csv\")) output = result.document.export_to_markdown()
This code generates the following output:
Index Customer Id First Name Last Name Company City Country Phone 1 Phone 2 Email Subscription Date Website 1 DD37Cf93aecA6Dc Sheryl Baxter Rasmussen Group East Leonard Chile 229.077.5154 397.884.0519x718 zunigavanessa@smith.info 2020-08-24 http://www.stephenson.com/ 2 1Ef7b82A4CAAD10 Preston Lozano, Dr Vega-Gentry East Jimmychester Djibouti 5153435776 686-620-1820x944 vmata@colon.com 2021-04-23 http://www.hobbs.com/ 3 6F94879bDAfE5a6 Roy Berry Murillo-Perry Isabelborough Antigua and Barbuda +1-539-402-0259 (496)978-3969x58947 beckycarr@hogan.com 2020-03-25 http://www.lawrence.com/ 4 5Cef8BFA16c5e3c Linda Olsen Dominguez, Mcmillan and Donovan Bensonview Dominican Republic 001-808-617-6467x12895 +1-813-324-8756 stanleyblackwell@benson.org 2020-06-02 http://www.good-lyons.com/ 5 053d585Ab6b3159 Joanna Bender Martin, Lang and Andrade West Priscilla Slovakia (Slovak Republic) 001-234-203-0635x76146 001-199-446-3860x3486 colinalvarado@miles.net 2021-04-17 https://goodwin-ingram.com/"},{"location":"examples/backend_csv/#conversion-of-csv-files","title":"Conversion of CSV files\u00b6","text":"This example shows how to convert CSV files to a structured Docling Document.
,
;
|
[tab]
This is an example of using Docling for converting structured data (XML) into a unified document representation format, DoclingDocument
, and leverage its riched structured content for RAG applications.
Data used in this example consist of patents from the United States Patent and Trademark Office (USPTO) and medical articles from PubMed Central\u00ae (PMC).
In this notebook, we accomplish the following:
For more details on document chunking with Docling, refer to the Chunking documentation. For RAG with Docling and LlamaIndex, also check the example RAG with LlamaIndex.
In\u00a0[1]: Copied!from docling.document_converter import DocumentConverter\n\n# a sample PMC article:\nsource = \"../../tests/data/jats/elife-56337.nxml\"\nconverter = DocumentConverter()\nresult = converter.convert(source)\nprint(result.status)\nfrom docling.document_converter import DocumentConverter # a sample PMC article: source = \"../../tests/data/jats/elife-56337.nxml\" converter = DocumentConverter() result = converter.convert(source) print(result.status)
ConversionStatus.SUCCESS\n
Once the document is converted, it can be exported to any format supported by Docling. For instance, to markdown (showing here the first lines only):
In\u00a0[2]: Copied!md_doc = result.document.export_to_markdown()\n\ndelim = \"\\n\"\nprint(delim.join(md_doc.split(delim)[:8]))\nmd_doc = result.document.export_to_markdown() delim = \"\\n\" print(delim.join(md_doc.split(delim)[:8]))
# KRAB-zinc finger protein gene expansion in response to active retrotransposons in the murine lineage\n\nGernot Wolf, Alberto de Iaco, Ming-An Sun, Melania Bruno, Matthew Tinkham, Don Hoang, Apratim Mitra, Sherry Ralls, Didier Trono, Todd S Macfarlan\n\nThe Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health, Bethesda, United States; School of Life Sciences, \u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL), Lausanne, Switzerland\n\n## Abstract\n\n
If the XML file is not supported, a ConversionError
message will be raised.
from io import BytesIO\n\nfrom docling.datamodel.base_models import DocumentStream\nfrom docling.exceptions import ConversionError\n\nxml_content = (\n b'<?xml version=\"1.0\" encoding=\"UTF-8\"?><!DOCTYPE docling_test SYSTEM '\n b'\"test.dtd\"><docling>Random content</docling>'\n)\nstream = DocumentStream(name=\"docling_test.xml\", stream=BytesIO(xml_content))\ntry:\n result = converter.convert(stream)\nexcept ConversionError as ce:\n print(ce)\nfrom io import BytesIO from docling.datamodel.base_models import DocumentStream from docling.exceptions import ConversionError xml_content = ( b' Random content' ) stream = DocumentStream(name=\"docling_test.xml\", stream=BytesIO(xml_content)) try: result = converter.convert(stream) except ConversionError as ce: print(ce)
Input document docling_test.xml does not match any allowed format.\n
File format not allowed: docling_test.xml\n
You can always refer to the Usage documentation page for a list of supported formats.
Requirements can be installed as shown below. The --no-warn-conflicts
argument is meant for Colab's pre-populated Python environment, feel free to remove for stricter usage.
%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv\n%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv
Note: you may need to restart the kernel to use updated packages.\n
This notebook uses HuggingFace's Inference API. For an increased LLM quota, a token can be provided via the environment variable HF_TOKEN
.
If you're running this notebook in Google Colab, make sure you add your API key as a secret.
In\u00a0[5]: Copied!import os\nfrom warnings import filterwarnings\n\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\nfilterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")\nimport os from warnings import filterwarnings from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")
We can now define the main parameters:
In\u00a0[6]: Copied!from pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n\nEMBED_MODEL_ID = \"BAAI/bge-small-en-v1.5\"\nEMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID)\nTEMP_DIR = Path(mkdtemp())\nMILVUS_URI = str(TEMP_DIR / \"docling.db\")\nGEN_MODEL = HuggingFaceInferenceAPI(\n token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n)\nembed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\nfrom pathlib import Path from tempfile import mkdtemp from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI EMBED_MODEL_ID = \"BAAI/bge-small-en-v1.5\" EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID) TEMP_DIR = Path(mkdtemp()) MILVUS_URI = str(TEMP_DIR / \"docling.db\") GEN_MODEL = HuggingFaceInferenceAPI( token=_get_env_from_colab_or_os(\"HF_TOKEN\"), model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", ) embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\")) # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"
In this notebook we will use XML data from collections supported by Docling:
.tar.gz
files. Each file contains the full article data in XML format, among other supplementary files like images or spreadsheets.The raw files will be downloaded form the source and saved in a temporary directory.
In\u00a0[7]: Copied!import tarfile\nfrom io import BytesIO\n\nimport requests\n\n# PMC article PMC11703268\nurl: str = \"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz\"\n\nprint(f\"Downloading {url}...\")\nbuf = BytesIO(requests.get(url).content)\nprint(\"Extracting and storing the XML file containing the article text...\")\nwith tarfile.open(fileobj=buf, mode=\"r:gz\") as tar_file:\n for tarinfo in tar_file:\n if tarinfo.isreg():\n file_path = Path(tarinfo.name)\n if file_path.suffix == \".nxml\":\n with open(TEMP_DIR / file_path.name, \"wb\") as file_obj:\n file_obj.write(tar_file.extractfile(tarinfo).read())\n print(f\"Stored XML file {file_path.name}\")\nimport tarfile from io import BytesIO import requests # PMC article PMC11703268 url: str = \"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz\" print(f\"Downloading {url}...\") buf = BytesIO(requests.get(url).content) print(\"Extracting and storing the XML file containing the article text...\") with tarfile.open(fileobj=buf, mode=\"r:gz\") as tar_file: for tarinfo in tar_file: if tarinfo.isreg(): file_path = Path(tarinfo.name) if file_path.suffix == \".nxml\": with open(TEMP_DIR / file_path.name, \"wb\") as file_obj: file_obj.write(tar_file.extractfile(tarinfo).read()) print(f\"Stored XML file {file_path.name}\")
Downloading https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz...\nExtracting and storing the XML file containing the article text...\nStored XML file nihpp-2024.12.26.630351v1.nxml\nIn\u00a0[8]: Copied!
import zipfile\n\n# Patent grants from December 17-23, 2024\nurl: str = (\n \"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip\"\n)\nXML_SPLITTER: str = '<?xml version=\"1.0\"'\ndoc_num: int = 0\n\nprint(f\"Downloading {url}...\")\nbuf = BytesIO(requests.get(url).content)\nprint(\"Parsing zip file, splitting into XML sections, and exporting to files...\")\nwith zipfile.ZipFile(buf) as zf:\n res = zf.testzip()\n if res:\n print(\"Error validating zip file\")\n else:\n with zf.open(zf.namelist()[0]) as xf:\n is_patent = False\n patent_buffer = BytesIO()\n for xf_line in xf:\n decoded_line = xf_line.decode(errors=\"ignore\").rstrip()\n xml_index = decoded_line.find(XML_SPLITTER)\n if xml_index != -1:\n if (\n xml_index > 0\n ): # cases like </sequence-cwu><?xml version=\"1.0\"...\n patent_buffer.write(xf_line[:xml_index])\n patent_buffer.write(b\"\\r\\n\")\n xf_line = xf_line[xml_index:]\n if patent_buffer.getbuffer().nbytes > 0 and is_patent:\n doc_num += 1\n patent_id = f\"ipg241217-{doc_num}\"\n with open(TEMP_DIR / f\"{patent_id}.xml\", \"wb\") as file_obj:\n file_obj.write(patent_buffer.getbuffer())\n is_patent = False\n patent_buffer = BytesIO()\n elif decoded_line.startswith(\"<!DOCTYPE\"):\n is_patent = True\n patent_buffer.write(xf_line)\nimport zipfile # Patent grants from December 17-23, 2024 url: str = ( \"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip\" ) XML_SPLITTER: str = ' 0 ): # cases like 0 and is_patent: doc_num += 1 patent_id = f\"ipg241217-{doc_num}\" with open(TEMP_DIR / f\"{patent_id}.xml\", \"wb\") as file_obj: file_obj.write(patent_buffer.getbuffer()) is_patent = False patent_buffer = BytesIO() elif decoded_line.startswith(\"
Downloading https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip...\nParsing zip file, splitting into XML sections, and exporting to files...\nIn\u00a0[9]: Copied!
print(f\"Fetched and exported {doc_num} documents.\")\nprint(f\"Fetched and exported {doc_num} documents.\")
Fetched and exported 4014 documents.\nIn\u00a0[11]: Copied!
from tqdm.notebook import tqdm\n\nfrom docling.backend.xml.jats_backend import JatsDocumentBackend\nfrom docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.document import InputDocument\n\n# check PMC\nin_doc = InputDocument(\n path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\",\n format=InputFormat.XML_JATS,\n backend=JatsDocumentBackend,\n)\nbackend = JatsDocumentBackend(\n in_doc=in_doc, path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"\n)\nprint(f\"Document {in_doc.file.name} is a valid PMC article? {backend.is_valid()}\")\n\n# check USPTO\nin_doc = InputDocument(\n path_or_stream=TEMP_DIR / \"ipg241217-1.xml\",\n format=InputFormat.XML_USPTO,\n backend=PatentUsptoDocumentBackend,\n)\nbackend = PatentUsptoDocumentBackend(\n in_doc=in_doc, path_or_stream=TEMP_DIR / \"ipg241217-1.xml\"\n)\nprint(f\"Document {in_doc.file.name} is a valid patent? {backend.is_valid()}\")\n\npatent_valid = 0\npbar = tqdm(TEMP_DIR.glob(\"*.xml\"), total=doc_num)\nfor in_path in pbar:\n in_doc = InputDocument(\n path_or_stream=in_path,\n format=InputFormat.XML_USPTO,\n backend=PatentUsptoDocumentBackend,\n )\n backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path)\n patent_valid += int(backend.is_valid())\n\nprint(f\"Found {patent_valid} patents out of {doc_num} XML files.\")\nfrom tqdm.notebook import tqdm from docling.backend.xml.jats_backend import JatsDocumentBackend from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend from docling.datamodel.base_models import InputFormat from docling.datamodel.document import InputDocument # check PMC in_doc = InputDocument( path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\", format=InputFormat.XML_JATS, backend=JatsDocumentBackend, ) backend = JatsDocumentBackend( in_doc=in_doc, path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\" ) print(f\"Document {in_doc.file.name} is a valid PMC article? {backend.is_valid()}\") # check USPTO in_doc = InputDocument( path_or_stream=TEMP_DIR / \"ipg241217-1.xml\", format=InputFormat.XML_USPTO, backend=PatentUsptoDocumentBackend, ) backend = PatentUsptoDocumentBackend( in_doc=in_doc, path_or_stream=TEMP_DIR / \"ipg241217-1.xml\" ) print(f\"Document {in_doc.file.name} is a valid patent? {backend.is_valid()}\") patent_valid = 0 pbar = tqdm(TEMP_DIR.glob(\"*.xml\"), total=doc_num) for in_path in pbar: in_doc = InputDocument( path_or_stream=in_path, format=InputFormat.XML_USPTO, backend=PatentUsptoDocumentBackend, ) backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path) patent_valid += int(backend.is_valid()) print(f\"Found {patent_valid} patents out of {doc_num} XML files.\")
Document nihpp-2024.12.26.630351v1.nxml is a valid PMC article? True\nDocument ipg241217-1.xml is a valid patent? True\n
0%| | 0/4014 [00:00<?, ?it/s]
Found 3928 patents out of 4014 XML files.\n
Calling the function convert()
will convert the input document into a DoclingDocument
doc = backend.convert()\n\nclaims_sec = next(item for item in doc.texts if item.text == \"CLAIMS\")\nprint(f'Patent \"{doc.texts[0].text}\" has {len(claims_sec.children)} claims')\ndoc = backend.convert() claims_sec = next(item for item in doc.texts if item.text == \"CLAIMS\") print(f'Patent \"{doc.texts[0].text}\" has {len(claims_sec.children)} claims')
Patent \"Semiconductor package\" has 19 claims\n
\u270f\ufe0f Tip: in general, there is no need to use the backend converters to parse USPTO or JATS (PubMed) XML files. The generic DocumentConverter
object tries to guess the input document format and applies the corresponding backend parser. The conversion shown in Simple Conversion is the recommended usage for the supported XML files.
The DoclingDocument
format of the converted patents has a rich hierarchical structure, inherited from the original XML document and preserved by the Docling custom backend. In this notebook, we will leverage:
SimpleDirectoryReader
pattern to iterate over the exported XML files created in section Fetch the data.DoclingReader
and DoclingNodeParser
, to ingest the patent chunks into a Milvus vector store.HierarchicalChunker
implementation, which applies a document-based hierarchical chunking, to leverage the patent structures like sections and paragraphs within sections.Refer to other possible implementations and usage patterns in the Chunking documentation and the RAG with LlamaIndex notebook.
In\u00a0[13]: Copied!from llama_index.core import SimpleDirectoryReader\nfrom llama_index.readers.docling import DoclingReader\n\nreader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\ndir_reader = SimpleDirectoryReader(\n input_dir=TEMP_DIR,\n exclude=[\"docling.db\", \"*.nxml\"],\n file_extractor={\".xml\": reader},\n filename_as_id=True,\n num_files_limit=100,\n)\nfrom llama_index.core import SimpleDirectoryReader from llama_index.readers.docling import DoclingReader reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) dir_reader = SimpleDirectoryReader( input_dir=TEMP_DIR, exclude=[\"docling.db\", \"*.nxml\"], file_extractor={\".xml\": reader}, filename_as_id=True, num_files_limit=100, ) In\u00a0[14]: Copied!
from llama_index.node_parser.docling import DoclingNodeParser\n\nnode_parser = DoclingNodeParser()\nfrom llama_index.node_parser.docling import DoclingNodeParser node_parser = DoclingNodeParser() In\u00a0[\u00a0]: Copied!
from llama_index.core import StorageContext, VectorStoreIndex\nfrom llama_index.vector_stores.milvus import MilvusVectorStore\n\nvector_store = MilvusVectorStore(\n uri=MILVUS_URI,\n dim=embed_dim,\n overwrite=True,\n)\n\nindex = VectorStoreIndex.from_documents(\n documents=dir_reader.load_data(show_progress=True),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n show_progress=True,\n)\nfrom llama_index.core import StorageContext, VectorStoreIndex from llama_index.vector_stores.milvus import MilvusVectorStore vector_store = MilvusVectorStore( uri=MILVUS_URI, dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=dir_reader.load_data(show_progress=True), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, show_progress=True, )
Finally, add the PMC article to the vector store directly from the reader.
In\u00a0[14]: Copied!index.from_documents(\n documents=reader.load_data(TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\nindex.from_documents( documents=reader.load_data(TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) Out[14]:
<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x373a7f7d0>
The retriever can be used to identify highly relevant documents:
In\u00a0[15]: Copied!retriever = index.as_retriever(similarity_top_k=3)\nresults = retriever.retrieve(\"What patents are related to fitness devices?\")\n\nfor item in results:\n print(item)\nretriever = index.as_retriever(similarity_top_k=3) results = retriever.retrieve(\"What patents are related to fitness devices?\") for item in results: print(item)
Node ID: 5afd36c0-a739-4a88-a51c-6d0f75358db5\nText: The portable fitness monitoring device 102 may be a device such\nas, for example, a mobile phone, a personal digital assistant, a music\nfile player (e.g. and MP3 player), an intelligent article for wearing\n(e.g. a fitness monitoring garment, wrist band, or watch), a dongle\n(e.g. a small hardware device that protects software) that includes a\nfitn...\nScore: 0.772\n\nNode ID: f294b5fd-9089-43cb-8c4e-d1095a634ff1\nText: US Patent Application US 20120071306 entitled \u201cPortable\nMultipurpose Whole Body Exercise Device\u201d discloses a portable\nmultipurpose whole body exercise device which can be used for general\nfitness, Pilates-type, core strengthening, therapeutic, and\nrehabilitative exercises as well as stretching and physical therapy\nand which includes storable acc...\nScore: 0.749\n\nNode ID: 8251c7ef-1165-42e1-8c91-c99c8a711bf7\nText: Program products, methods, and systems for providing fitness\nmonitoring services of the present invention can include any software\napplication executed by one or more computing devices. A computing\ndevice can be any type of computing device having one or more\nprocessors. For example, a computing device can be a workstation,\nmobile device (e.g., ...\nScore: 0.744\n\n
With the query engine, we can run the question-answering with the RAG pattern on the set of indexed documents.
First, we can prompt the LLM directly:
In\u00a0[16]: Copied!from llama_index.core.base.llms.types import ChatMessage, MessageRole\nfrom rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console()\nquery = \"Do mosquitoes in high altitude expand viruses over large distances?\"\n\nusr_msg = ChatMessage(role=MessageRole.USER, content=query)\nresponse = GEN_MODEL.chat(messages=[usr_msg])\n\nconsole.print(Panel(query, title=\"Prompt\", border_style=\"bold red\"))\nconsole.print(\n Panel(\n response.message.content.strip(),\n title=\"Generated Content\",\n border_style=\"bold green\",\n )\n)\nfrom llama_index.core.base.llms.types import ChatMessage, MessageRole from rich.console import Console from rich.panel import Panel console = Console() query = \"Do mosquitoes in high altitude expand viruses over large distances?\" usr_msg = ChatMessage(role=MessageRole.USER, content=query) response = GEN_MODEL.chat(messages=[usr_msg]) console.print(Panel(query, title=\"Prompt\", border_style=\"bold red\")) console.print( Panel( response.message.content.strip(), title=\"Generated Content\", border_style=\"bold green\", ) )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Do mosquitoes in high altitude expand viruses over large distances? \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Mosquitoes can be found at high altitudes, but their ability to transmit viruses over long distances is not \u2502\n\u2502 primarily dependent on altitude. Mosquitoes are vectors for various diseases, such as malaria, dengue fever, \u2502\n\u2502 and Zika virus, and their transmission range is more closely related to their movement, the presence of a host, \u2502\n\u2502 and environmental conditions that support their survival and reproduction. \u2502\n\u2502 \u2502\n\u2502 At high altitudes, the environment can be less suitable for mosquitoes due to factors such as colder \u2502\n\u2502 temperatures, lower humidity, and stronger winds, which can limit their population size and distribution. \u2502\n\u2502 However, some species of mosquitoes have adapted to high-altitude environments and can still transmit diseases \u2502\n\u2502 in these areas. \u2502\n\u2502 \u2502\n\u2502 It is possible for mosquitoes to be transported by wind or human activities to higher altitudes, but this is \u2502\n\u2502 not a significant factor in their ability to transmit viruses over long distances. Instead, long-distance \u2502\n\u2502 transmission of viruses is more often associated with human travel and transportation, which can rapidly spread \u2502\n\u2502 infected mosquitoes or humans to new areas, leading to the spread of disease. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
Now, we can compare the response when the model is prompted with the indexed PMC article as supporting context:
In\u00a0[17]: Copied!from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters\n\nfilters = MetadataFilters(\n filters=[\n ExactMatchFilter(key=\"filename\", value=\"nihpp-2024.12.26.630351v1.nxml\"),\n ]\n)\n\nquery_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3)\nresult = query_engine.query(query)\n\nconsole.print(\n Panel(\n result.response.strip(),\n title=\"Generated Content with RAG\",\n border_style=\"bold green\",\n )\n)\nfrom llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters filters = MetadataFilters( filters=[ ExactMatchFilter(key=\"filename\", value=\"nihpp-2024.12.26.630351v1.nxml\"), ] ) query_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3) result = query_engine.query(query) console.print( Panel( result.response.strip(), title=\"Generated Content with RAG\", border_style=\"bold green\", ) )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content with RAG \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female \u2502\n\u2502 mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with \u2502\n\u2502 arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with \u2502\n\u2502 flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens, \u2502\n\u2502 including three arboviruses that affect humans (dengue, West Nile, and M\u2019Poko viruses). The study provides \u2502\n\u2502 compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n"},{"location":"examples/backend_xml_rag/#conversion-of-custom-xml","title":"Conversion of custom XML\u00b6","text":""},{"location":"examples/backend_xml_rag/#overview","title":"Overview\u00b6","text":""},{"location":"examples/backend_xml_rag/#simple-conversion","title":"Simple conversion\u00b6","text":"
The XML file format defines and stores data in a format that is both human-readable and machine-readable. Because of this flexibility, Docling requires custom backend processors to interpret XML definitions and convert them into DoclingDocument
objects.
Some public data collections in XML format are already supported by Docling (USTPO patents and PMC articles). In these cases, the document conversion is straightforward and the same as with any other supported format, such as PDF or HTML. The execution example in Simple Conversion is the recommended usage of Docling for a single file:
"},{"location":"examples/backend_xml_rag/#end-to-end-application","title":"End-to-end application\u00b6","text":"This section describes a step-by-step application for processing XML files from supported public collections and use them for question-answering.
"},{"location":"examples/backend_xml_rag/#setup","title":"Setup\u00b6","text":""},{"location":"examples/backend_xml_rag/#fetch-the-data","title":"Fetch the data\u00b6","text":""},{"location":"examples/backend_xml_rag/#pmc-articles","title":"PMC articles\u00b6","text":"The OA file is a manifest file of all the PMC articles, including the URL path to download the source files. In this notebook we will use as example the article Pathogens spread by high-altitude windborne mosquitoes, which is available in the archive file PMC11703268.tar.gz.
"},{"location":"examples/backend_xml_rag/#uspto-patents","title":"USPTO patents\u00b6","text":"Since each USPTO file is a concatenation of several patents, we need to split its content into valid XML pieces. The following code downloads a sample zip file, split its content in sections, and dumps each section as an XML file. For simplicity, this pipeline is shown here in a sequential manner, but it could be parallelized.
"},{"location":"examples/backend_xml_rag/#using-the-backend-converter-optional","title":"Using the backend converter (optional)\u00b6","text":"PubMedDocumentBackend
and PatentUsptoDocumentBackend
aim at handling the parsing of PMC articles and USPTO patents, respectively.is_valid()
to check if the input document is supported by the this backend.Note that DoclingReader
uses Docling's DocumentConverter
by default and therefore it will recognize the format of the XML files and leverage the PatentUsptoDocumentBackend
automatically.
For demonstration purposes, we limit the scope of the analysis to the first 100 patents.
"},{"location":"examples/backend_xml_rag/#set-the-node-parser","title":"Set the node parser\u00b6","text":"Note that the HierarchicalChunker
is the default chunking implementation of the DoclingNodeParser
.
import json\nimport logging\nimport time\nfrom collections.abc import Iterable\nfrom pathlib import Path\nimport json import logging import time from collections.abc import Iterable from pathlib import Path In\u00a0[\u00a0]: Copied!
import yaml\nfrom docling_core.types.doc import ImageRefMode\nimport yaml from docling_core.types.doc import ImageRefMode In\u00a0[\u00a0]: Copied!
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied!
_log = logging.getLogger(__name__)\n_log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied!
USE_V2 = True\nUSE_LEGACY = False\nUSE_V2 = True USE_LEGACY = False In\u00a0[\u00a0]: Copied!
def export_documents(\n conv_results: Iterable[ConversionResult],\n output_dir: Path,\n):\n output_dir.mkdir(parents=True, exist_ok=True)\n\n success_count = 0\n failure_count = 0\n partial_success_count = 0\n\n for conv_res in conv_results:\n if conv_res.status == ConversionStatus.SUCCESS:\n success_count += 1\n doc_filename = conv_res.input.file.stem\n\n if USE_V2:\n conv_res.document.save_as_json(\n output_dir / f\"{doc_filename}.json\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n conv_res.document.save_as_html(\n output_dir / f\"{doc_filename}.html\",\n image_mode=ImageRefMode.EMBEDDED,\n )\n conv_res.document.save_as_document_tokens(\n output_dir / f\"{doc_filename}.doctags.txt\"\n )\n conv_res.document.save_as_markdown(\n output_dir / f\"{doc_filename}.md\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n conv_res.document.save_as_markdown(\n output_dir / f\"{doc_filename}.txt\",\n image_mode=ImageRefMode.PLACEHOLDER,\n strict_text=True,\n )\n\n # Export Docling document format to YAML:\n with (output_dir / f\"{doc_filename}.yaml\").open(\"w\") as fp:\n fp.write(yaml.safe_dump(conv_res.document.export_to_dict()))\n\n # Export Docling document format to doctags:\n with (output_dir / f\"{doc_filename}.doctags.txt\").open(\"w\") as fp:\n fp.write(conv_res.document.export_to_document_tokens())\n\n # Export Docling document format to markdown:\n with (output_dir / f\"{doc_filename}.md\").open(\"w\") as fp:\n fp.write(conv_res.document.export_to_markdown())\n\n # Export Docling document format to text:\n with (output_dir / f\"{doc_filename}.txt\").open(\"w\") as fp:\n fp.write(conv_res.document.export_to_markdown(strict_text=True))\n\n if USE_LEGACY:\n # Export Deep Search document JSON format:\n with (output_dir / f\"{doc_filename}.legacy.json\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(json.dumps(conv_res.legacy_document.export_to_dict()))\n\n # Export Text format:\n with (output_dir / f\"{doc_filename}.legacy.txt\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(\n conv_res.legacy_document.export_to_markdown(strict_text=True)\n )\n\n # Export Markdown format:\n with (output_dir / f\"{doc_filename}.legacy.md\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(conv_res.legacy_document.export_to_markdown())\n\n # Export Document Tags format:\n with (output_dir / f\"{doc_filename}.legacy.doctags.txt\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(conv_res.legacy_document.export_to_document_tokens())\n\n elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS:\n _log.info(\n f\"Document {conv_res.input.file} was partially converted with the following errors:\"\n )\n for item in conv_res.errors:\n _log.info(f\"\\t{item.error_message}\")\n partial_success_count += 1\n else:\n _log.info(f\"Document {conv_res.input.file} failed to convert.\")\n failure_count += 1\n\n _log.info(\n f\"Processed {success_count + partial_success_count + failure_count} docs, \"\n f\"of which {failure_count} failed \"\n f\"and {partial_success_count} were partially converted.\"\n )\n return success_count, partial_success_count, failure_count\ndef export_documents( conv_results: Iterable[ConversionResult], output_dir: Path, ): output_dir.mkdir(parents=True, exist_ok=True) success_count = 0 failure_count = 0 partial_success_count = 0 for conv_res in conv_results: if conv_res.status == ConversionStatus.SUCCESS: success_count += 1 doc_filename = conv_res.input.file.stem if USE_V2: conv_res.document.save_as_json( output_dir / f\"{doc_filename}.json\", image_mode=ImageRefMode.PLACEHOLDER, ) conv_res.document.save_as_html( output_dir / f\"{doc_filename}.html\", image_mode=ImageRefMode.EMBEDDED, ) conv_res.document.save_as_document_tokens( output_dir / f\"{doc_filename}.doctags.txt\" ) conv_res.document.save_as_markdown( output_dir / f\"{doc_filename}.md\", image_mode=ImageRefMode.PLACEHOLDER, ) conv_res.document.save_as_markdown( output_dir / f\"{doc_filename}.txt\", image_mode=ImageRefMode.PLACEHOLDER, strict_text=True, ) # Export Docling document format to YAML: with (output_dir / f\"{doc_filename}.yaml\").open(\"w\") as fp: fp.write(yaml.safe_dump(conv_res.document.export_to_dict())) # Export Docling document format to doctags: with (output_dir / f\"{doc_filename}.doctags.txt\").open(\"w\") as fp: fp.write(conv_res.document.export_to_document_tokens()) # Export Docling document format to markdown: with (output_dir / f\"{doc_filename}.md\").open(\"w\") as fp: fp.write(conv_res.document.export_to_markdown()) # Export Docling document format to text: with (output_dir / f\"{doc_filename}.txt\").open(\"w\") as fp: fp.write(conv_res.document.export_to_markdown(strict_text=True)) if USE_LEGACY: # Export Deep Search document JSON format: with (output_dir / f\"{doc_filename}.legacy.json\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write(json.dumps(conv_res.legacy_document.export_to_dict())) # Export Text format: with (output_dir / f\"{doc_filename}.legacy.txt\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write( conv_res.legacy_document.export_to_markdown(strict_text=True) ) # Export Markdown format: with (output_dir / f\"{doc_filename}.legacy.md\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write(conv_res.legacy_document.export_to_markdown()) # Export Document Tags format: with (output_dir / f\"{doc_filename}.legacy.doctags.txt\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write(conv_res.legacy_document.export_to_document_tokens()) elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS: _log.info( f\"Document {conv_res.input.file} was partially converted with the following errors:\" ) for item in conv_res.errors: _log.info(f\"\\t{item.error_message}\") partial_success_count += 1 else: _log.info(f\"Document {conv_res.input.file} failed to convert.\") failure_count += 1 _log.info( f\"Processed {success_count + partial_success_count + failure_count} docs, \" f\"of which {failure_count} failed \" f\"and {partial_success_count} were partially converted.\" ) return success_count, partial_success_count, failure_count In\u00a0[\u00a0]: Copied!
def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_paths = [\n data_folder / \"pdf/2206.01062.pdf\",\n data_folder / \"pdf/2203.01017v2.pdf\",\n data_folder / \"pdf/2305.03393v1.pdf\",\n data_folder / \"pdf/redp5110_sampled.pdf\",\n ]\n\n # buf = BytesIO((data_folder / \"pdf/2206.01062.pdf\").open(\"rb\").read())\n # docs = [DocumentStream(name=\"my_doc.pdf\", stream=buf)]\n # input = DocumentConversionInput.from_streams(docs)\n\n # # Turn on inline debug visualizations:\n # settings.debug.visualize_layout = True\n # settings.debug.visualize_ocr = True\n # settings.debug.visualize_tables = True\n # settings.debug.visualize_cells = True\n\n pipeline_options = PdfPipelineOptions()\n pipeline_options.generate_page_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options, backend=DoclingParseV4DocumentBackend\n )\n }\n )\n\n start_time = time.time()\n\n conv_results = doc_converter.convert_all(\n input_doc_paths,\n raises_on_error=False, # to let conversion run through all and examine results at the end\n )\n success_count, partial_success_count, failure_count = export_documents(\n conv_results, output_dir=Path(\"scratch\")\n )\n\n end_time = time.time() - start_time\n\n _log.info(f\"Document conversion complete in {end_time:.2f} seconds.\")\n\n if failure_count > 0:\n raise RuntimeError(\n f\"The example failed converting {failure_count} on {len(input_doc_paths)}.\"\n )\ndef main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_paths = [ data_folder / \"pdf/2206.01062.pdf\", data_folder / \"pdf/2203.01017v2.pdf\", data_folder / \"pdf/2305.03393v1.pdf\", data_folder / \"pdf/redp5110_sampled.pdf\", ] # buf = BytesIO((data_folder / \"pdf/2206.01062.pdf\").open(\"rb\").read()) # docs = [DocumentStream(name=\"my_doc.pdf\", stream=buf)] # input = DocumentConversionInput.from_streams(docs) # # Turn on inline debug visualizations: # settings.debug.visualize_layout = True # settings.debug.visualize_ocr = True # settings.debug.visualize_tables = True # settings.debug.visualize_cells = True pipeline_options = PdfPipelineOptions() pipeline_options.generate_page_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, backend=DoclingParseV4DocumentBackend ) } ) start_time = time.time() conv_results = doc_converter.convert_all( input_doc_paths, raises_on_error=False, # to let conversion run through all and examine results at the end ) success_count, partial_success_count, failure_count = export_documents( conv_results, output_dir=Path(\"scratch\") ) end_time = time.time() - start_time _log.info(f\"Document conversion complete in {end_time:.2f} seconds.\") if failure_count > 0: raise RuntimeError( f\"The example failed converting {failure_count} on {len(input_doc_paths)}.\" ) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/compare_vlm_models/","title":"Compare VLM models","text":"In\u00a0[\u00a0]: Copied!
import json\nimport sys\nimport time\nfrom pathlib import Path\nimport json import sys import time from pathlib import Path In\u00a0[\u00a0]: Copied!
from docling_core.types.doc import DocItemLabel, ImageRefMode\nfrom docling_core.types.doc.document import DEFAULT_EXPORT_LABELS\nfrom tabulate import tabulate\nfrom docling_core.types.doc import DocItemLabel, ImageRefMode from docling_core.types.doc.document import DEFAULT_EXPORT_LABELS from tabulate import tabulate In\u00a0[\u00a0]: Copied!
from docling.datamodel import vlm_model_specs\nfrom docling.datamodel.accelerator_options import AcceleratorDevice\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import (\n InferenceFramework,\n InlineVlmOptions,\n ResponseFormat,\n TransformersModelType,\n TransformersPromptStyle,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\nfrom docling.datamodel import vlm_model_specs from docling.datamodel.accelerator_options import AcceleratorDevice from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import ( InferenceFramework, InlineVlmOptions, ResponseFormat, TransformersModelType, TransformersPromptStyle, ) from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline In\u00a0[\u00a0]: Copied!
def convert(sources: list[Path], converter: DocumentConverter):\n model_id = pipeline_options.vlm_options.repo_id.replace(\"/\", \"_\")\n framework = pipeline_options.vlm_options.inference_framework\n for source in sources:\n print(\"================================================\")\n print(\"Processing...\")\n print(f\"Source: {source}\")\n print(\"---\")\n print(f\"Model: {model_id}\")\n print(f\"Framework: {framework}\")\n print(\"================================================\")\n print(\"\")\n\n res = converter.convert(source)\n\n print(\"\")\n\n fname = f\"{res.input.file.stem}-{model_id}-{framework}\"\n\n inference_time = 0.0\n for i, page in enumerate(res.pages):\n inference_time += page.predictions.vlm_response.generation_time\n print(\"\")\n print(\n f\" ---------- Predicted page {i} in {pipeline_options.vlm_options.response_format} in {page.predictions.vlm_response.generation_time} [sec]:\"\n )\n print(page.predictions.vlm_response.text)\n print(\" ---------- \")\n\n print(\"===== Final output of the converted document =======\")\n\n with (out_path / f\"{fname}.json\").open(\"w\") as fp:\n fp.write(json.dumps(res.document.export_to_dict()))\n\n res.document.save_as_json(\n out_path / f\"{fname}.json\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n print(f\" => produced {out_path / fname}.json\")\n\n res.document.save_as_markdown(\n out_path / f\"{fname}.md\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n print(f\" => produced {out_path / fname}.md\")\n\n res.document.save_as_html(\n out_path / f\"{fname}.html\",\n image_mode=ImageRefMode.EMBEDDED,\n labels=[*DEFAULT_EXPORT_LABELS, DocItemLabel.FOOTNOTE],\n split_page_view=True,\n )\n print(f\" => produced {out_path / fname}.html\")\n\n pg_num = res.document.num_pages()\n print(\"\")\n print(\n f\"Total document prediction time: {inference_time:.2f} seconds, pages: {pg_num}\"\n )\n print(\"====================================================\")\n\n return [\n source,\n model_id,\n str(framework),\n pg_num,\n inference_time,\n ]\ndef convert(sources: list[Path], converter: DocumentConverter): model_id = pipeline_options.vlm_options.repo_id.replace(\"/\", \"_\") framework = pipeline_options.vlm_options.inference_framework for source in sources: print(\"================================================\") print(\"Processing...\") print(f\"Source: {source}\") print(\"---\") print(f\"Model: {model_id}\") print(f\"Framework: {framework}\") print(\"================================================\") print(\"\") res = converter.convert(source) print(\"\") fname = f\"{res.input.file.stem}-{model_id}-{framework}\" inference_time = 0.0 for i, page in enumerate(res.pages): inference_time += page.predictions.vlm_response.generation_time print(\"\") print( f\" ---------- Predicted page {i} in {pipeline_options.vlm_options.response_format} in {page.predictions.vlm_response.generation_time} [sec]:\" ) print(page.predictions.vlm_response.text) print(\" ---------- \") print(\"===== Final output of the converted document =======\") with (out_path / f\"{fname}.json\").open(\"w\") as fp: fp.write(json.dumps(res.document.export_to_dict())) res.document.save_as_json( out_path / f\"{fname}.json\", image_mode=ImageRefMode.PLACEHOLDER, ) print(f\" => produced {out_path / fname}.json\") res.document.save_as_markdown( out_path / f\"{fname}.md\", image_mode=ImageRefMode.PLACEHOLDER, ) print(f\" => produced {out_path / fname}.md\") res.document.save_as_html( out_path / f\"{fname}.html\", image_mode=ImageRefMode.EMBEDDED, labels=[*DEFAULT_EXPORT_LABELS, DocItemLabel.FOOTNOTE], split_page_view=True, ) print(f\" => produced {out_path / fname}.html\") pg_num = res.document.num_pages() print(\"\") print( f\"Total document prediction time: {inference_time:.2f} seconds, pages: {pg_num}\" ) print(\"====================================================\") return [ source, model_id, str(framework), pg_num, inference_time, ] In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n sources = [\n \"tests/data/pdf/2305.03393v1-pg9.pdf\",\n ]\n\n out_path = Path(\"scratch\")\n out_path.mkdir(parents=True, exist_ok=True)\n\n ## Definiton of more inline models\n llava_qwen = InlineVlmOptions(\n repo_id=\"llava-hf/llava-interleave-qwen-0.5b-hf\",\n # prompt=\"Read text in the image.\",\n prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\",\n # prompt=\"Parse the reading order of this document.\",\n response_format=ResponseFormat.MARKDOWN,\n inference_framework=InferenceFramework.TRANSFORMERS,\n transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,\n supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU],\n scale=2.0,\n temperature=0.0,\n )\n\n # Note that this is not the expected way of using the Dolphin model, but it shows the usage of a raw prompt.\n dolphin_oneshot = InlineVlmOptions(\n repo_id=\"ByteDance/Dolphin\",\n prompt=\"<s>Read text in the image. <Answer/>\",\n response_format=ResponseFormat.MARKDOWN,\n inference_framework=InferenceFramework.TRANSFORMERS,\n transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,\n transformers_prompt_style=TransformersPromptStyle.RAW,\n supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU],\n scale=2.0,\n temperature=0.0,\n )\n\n ## Use VlmPipeline\n pipeline_options = VlmPipelineOptions()\n pipeline_options.generate_page_images = True\n\n ## On GPU systems, enable flash_attention_2 with CUDA:\n # pipeline_options.accelerator_options.device = AcceleratorDevice.CUDA\n # pipeline_options.accelerator_options.cuda_use_flash_attention2 = True\n\n vlm_models = [\n ## DocTags / SmolDocling models\n vlm_model_specs.SMOLDOCLING_MLX,\n vlm_model_specs.SMOLDOCLING_TRANSFORMERS,\n ## Markdown models (using MLX framework)\n vlm_model_specs.QWEN25_VL_3B_MLX,\n vlm_model_specs.PIXTRAL_12B_MLX,\n vlm_model_specs.GEMMA3_12B_MLX,\n ## Markdown models (using Transformers framework)\n vlm_model_specs.GRANITE_VISION_TRANSFORMERS,\n vlm_model_specs.PHI4_TRANSFORMERS,\n vlm_model_specs.PIXTRAL_12B_TRANSFORMERS,\n ## More inline models\n dolphin_oneshot,\n llava_qwen,\n ]\n\n # Remove MLX models if not on Mac\n if sys.platform != \"darwin\":\n vlm_models = [\n m for m in vlm_models if m.inference_framework != InferenceFramework.MLX\n ]\n\n rows = []\n for vlm_options in vlm_models:\n pipeline_options.vlm_options = vlm_options\n\n ## Set up pipeline for PDF or image inputs\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n InputFormat.IMAGE: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n },\n )\n\n row = convert(sources=sources, converter=converter)\n rows.append(row)\n\n print(\n tabulate(\n rows, headers=[\"source\", \"model_id\", \"framework\", \"num_pages\", \"time\"]\n )\n )\n\n print(\"see if memory gets released ...\")\n time.sleep(10)\nif __name__ == \"__main__\": sources = [ \"tests/data/pdf/2305.03393v1-pg9.pdf\", ] out_path = Path(\"scratch\") out_path.mkdir(parents=True, exist_ok=True) ## Definiton of more inline models llava_qwen = InlineVlmOptions( repo_id=\"llava-hf/llava-interleave-qwen-0.5b-hf\", # prompt=\"Read text in the image.\", prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\", # prompt=\"Parse the reading order of this document.\", response_format=ResponseFormat.MARKDOWN, inference_framework=InferenceFramework.TRANSFORMERS, transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT, supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU], scale=2.0, temperature=0.0, ) # Note that this is not the expected way of using the Dolphin model, but it shows the usage of a raw prompt. dolphin_oneshot = InlineVlmOptions( repo_id=\"ByteDance/Dolphin\", prompt=\"Read text in the image. \", response_format=ResponseFormat.MARKDOWN, inference_framework=InferenceFramework.TRANSFORMERS, transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT, transformers_prompt_style=TransformersPromptStyle.RAW, supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU], scale=2.0, temperature=0.0, ) ## Use VlmPipeline pipeline_options = VlmPipelineOptions() pipeline_options.generate_page_images = True ## On GPU systems, enable flash_attention_2 with CUDA: # pipeline_options.accelerator_options.device = AcceleratorDevice.CUDA # pipeline_options.accelerator_options.cuda_use_flash_attention2 = True vlm_models = [ ## DocTags / SmolDocling models vlm_model_specs.SMOLDOCLING_MLX, vlm_model_specs.SMOLDOCLING_TRANSFORMERS, ## Markdown models (using MLX framework) vlm_model_specs.QWEN25_VL_3B_MLX, vlm_model_specs.PIXTRAL_12B_MLX, vlm_model_specs.GEMMA3_12B_MLX, ## Markdown models (using Transformers framework) vlm_model_specs.GRANITE_VISION_TRANSFORMERS, vlm_model_specs.PHI4_TRANSFORMERS, vlm_model_specs.PIXTRAL_12B_TRANSFORMERS, ## More inline models dolphin_oneshot, llava_qwen, ] # Remove MLX models if not on Mac if sys.platform != \"darwin\": vlm_models = [ m for m in vlm_models if m.inference_framework != InferenceFramework.MLX ] rows = [] for vlm_options in vlm_models: pipeline_options.vlm_options = vlm_options ## Set up pipeline for PDF or image inputs converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=VlmPipeline, pipeline_options=pipeline_options, ), InputFormat.IMAGE: PdfFormatOption( pipeline_cls=VlmPipeline, pipeline_options=pipeline_options, ), }, ) row = convert(sources=sources, converter=converter) rows.append(row) print( tabulate( rows, headers=[\"source\", \"model_id\", \"framework\", \"num_pages\", \"time\"] ) ) print(\"see if memory gets released ...\") time.sleep(10)"},{"location":"examples/compare_vlm_models/#compare-vlm-models","title":"Compare VLM models\u00b6","text":"
This example runs the VLM pipeline with different vision-language models. Their runtime as well output quality is compared.
"},{"location":"examples/custom_convert/","title":"Custom conversion","text":"In\u00a0[\u00a0]: Copied!import json\nimport logging\nimport time\nfrom pathlib import Path\nimport json import logging import time from pathlib import Path In\u00a0[\u00a0]: Copied!
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied!
_log = logging.getLogger(__name__)\n_log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied!
def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n ###########################################################################\n\n # The following sections contain a combination of PipelineOptions\n # and PDF Backends for various configurations.\n # Uncomment one section at the time to see the differences in the output.\n\n # PyPdfium without EasyOCR\n # --------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = False\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = False\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(\n # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend\n # )\n # }\n # )\n\n # PyPdfium with EasyOCR\n # -----------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(\n # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend\n # )\n # }\n # )\n\n # Docling Parse without EasyOCR\n # -------------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = False\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with EasyOCR\n # ----------------------\n pipeline_options = PdfPipelineOptions()\n pipeline_options.do_ocr = True\n pipeline_options.do_table_structure = True\n pipeline_options.table_structure_options.do_cell_matching = True\n pipeline_options.ocr_options.lang = [\"es\"]\n pipeline_options.accelerator_options = AcceleratorOptions(\n num_threads=4, device=AcceleratorDevice.AUTO\n )\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n # Docling Parse with EasyOCR (CPU only)\n # ----------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.ocr_options.use_gpu = False # <-- set this.\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with Tesseract\n # ----------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n # pipeline_options.ocr_options = TesseractOcrOptions()\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with Tesseract CLI\n # ----------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n # pipeline_options.ocr_options = TesseractCliOcrOptions()\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with ocrmac(Mac only)\n # ----------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n # pipeline_options.ocr_options = OcrMacOptions()\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n ###########################################################################\n\n start_time = time.time()\n conv_result = doc_converter.convert(input_doc_path)\n end_time = time.time() - start_time\n\n _log.info(f\"Document converted in {end_time:.2f} seconds.\")\n\n ## Export results\n output_dir = Path(\"scratch\")\n output_dir.mkdir(parents=True, exist_ok=True)\n doc_filename = conv_result.input.file.stem\n\n # Export Deep Search document JSON format:\n with (output_dir / f\"{doc_filename}.json\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(json.dumps(conv_result.document.export_to_dict()))\n\n # Export Text format:\n with (output_dir / f\"{doc_filename}.txt\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(conv_result.document.export_to_text())\n\n # Export Markdown format:\n with (output_dir / f\"{doc_filename}.md\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(conv_result.document.export_to_markdown())\n\n # Export Document Tags format:\n with (output_dir / f\"{doc_filename}.doctags\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(conv_result.document.export_to_document_tokens())\ndef main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" ########################################################################### # The following sections contain a combination of PipelineOptions # and PDF Backends for various configurations. # Uncomment one section at the time to see the differences in the output. # PyPdfium without EasyOCR # -------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = False # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = False # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption( # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend # ) # } # ) # PyPdfium with EasyOCR # ----------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption( # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend # ) # } # ) # Docling Parse without EasyOCR # ------------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = False # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with EasyOCR # ---------------------- pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True pipeline_options.ocr_options.lang = [\"es\"] pipeline_options.accelerator_options = AcceleratorOptions( num_threads=4, device=AcceleratorDevice.AUTO ) doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) # Docling Parse with EasyOCR (CPU only) # ---------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.ocr_options.use_gpu = False # <-- set this. # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with Tesseract # ---------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # pipeline_options.ocr_options = TesseractOcrOptions() # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with Tesseract CLI # ---------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # pipeline_options.ocr_options = TesseractCliOcrOptions() # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with ocrmac(Mac only) # ---------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # pipeline_options.ocr_options = OcrMacOptions() # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) ########################################################################### start_time = time.time() conv_result = doc_converter.convert(input_doc_path) end_time = time.time() - start_time _log.info(f\"Document converted in {end_time:.2f} seconds.\") ## Export results output_dir = Path(\"scratch\") output_dir.mkdir(parents=True, exist_ok=True) doc_filename = conv_result.input.file.stem # Export Deep Search document JSON format: with (output_dir / f\"{doc_filename}.json\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(json.dumps(conv_result.document.export_to_dict())) # Export Text format: with (output_dir / f\"{doc_filename}.txt\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(conv_result.document.export_to_text()) # Export Markdown format: with (output_dir / f\"{doc_filename}.md\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(conv_result.document.export_to_markdown()) # Export Document Tags format: with (output_dir / f\"{doc_filename}.doctags\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(conv_result.document.export_to_document_tokens()) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/develop_formula_understanding/","title":"Formula enrichment","text":"
WARNING This example demonstrates only how to develop a new enrichment model. It does not run the actual formula understanding model.
In\u00a0[\u00a0]: Copied!import logging\nfrom collections.abc import Iterable\nfrom pathlib import Path\nimport logging from collections.abc import Iterable from pathlib import Path In\u00a0[\u00a0]: Copied!
from docling_core.types.doc import DocItemLabel, DoclingDocument, NodeItem, TextItem\nfrom docling_core.types.doc import DocItemLabel, DoclingDocument, NodeItem, TextItem In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.models.base_model import BaseItemAndImageEnrichmentModel\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\nfrom docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.models.base_model import BaseItemAndImageEnrichmentModel from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline In\u00a0[\u00a0]: Copied!
class ExampleFormulaUnderstandingPipelineOptions(PdfPipelineOptions):\n do_formula_understanding: bool = True\nclass ExampleFormulaUnderstandingPipelineOptions(PdfPipelineOptions): do_formula_understanding: bool = True In\u00a0[\u00a0]: Copied!
# A new enrichment model using both the document element and its image as input\nclass ExampleFormulaUnderstandingEnrichmentModel(BaseItemAndImageEnrichmentModel):\n images_scale = 2.6\n\n def __init__(self, enabled: bool):\n self.enabled = enabled\n\n def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:\n return (\n self.enabled\n and isinstance(element, TextItem)\n and element.label == DocItemLabel.FORMULA\n )\n\n def __call__(\n self,\n doc: DoclingDocument,\n element_batch: Iterable[ItemAndImageEnrichmentElement],\n ) -> Iterable[NodeItem]:\n if not self.enabled:\n return\n\n for enrich_element in element_batch:\n enrich_element.image.show()\n\n yield enrich_element.item\n# A new enrichment model using both the document element and its image as input class ExampleFormulaUnderstandingEnrichmentModel(BaseItemAndImageEnrichmentModel): images_scale = 2.6 def __init__(self, enabled: bool): self.enabled = enabled def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool: return ( self.enabled and isinstance(element, TextItem) and element.label == DocItemLabel.FORMULA ) def __call__( self, doc: DoclingDocument, element_batch: Iterable[ItemAndImageEnrichmentElement], ) -> Iterable[NodeItem]: if not self.enabled: return for enrich_element in element_batch: enrich_element.image.show() yield enrich_element.item In\u00a0[\u00a0]: Copied!
# How the pipeline can be extended.\nclass ExampleFormulaUnderstandingPipeline(StandardPdfPipeline):\n def __init__(self, pipeline_options: ExampleFormulaUnderstandingPipelineOptions):\n super().__init__(pipeline_options)\n self.pipeline_options: ExampleFormulaUnderstandingPipelineOptions\n\n self.enrichment_pipe = [\n ExampleFormulaUnderstandingEnrichmentModel(\n enabled=self.pipeline_options.do_formula_understanding\n )\n ]\n\n if self.pipeline_options.do_formula_understanding:\n self.keep_backend = True\n\n @classmethod\n def get_default_options(cls) -> ExampleFormulaUnderstandingPipelineOptions:\n return ExampleFormulaUnderstandingPipelineOptions()\n# How the pipeline can be extended. class ExampleFormulaUnderstandingPipeline(StandardPdfPipeline): def __init__(self, pipeline_options: ExampleFormulaUnderstandingPipelineOptions): super().__init__(pipeline_options) self.pipeline_options: ExampleFormulaUnderstandingPipelineOptions self.enrichment_pipe = [ ExampleFormulaUnderstandingEnrichmentModel( enabled=self.pipeline_options.do_formula_understanding ) ] if self.pipeline_options.do_formula_understanding: self.keep_backend = True @classmethod def get_default_options(cls) -> ExampleFormulaUnderstandingPipelineOptions: return ExampleFormulaUnderstandingPipelineOptions() In\u00a0[\u00a0]: Copied!
# Example main. In the final version, we simply have to set do_formula_understanding to true.\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2203.01017v2.pdf\"\n\n pipeline_options = ExampleFormulaUnderstandingPipelineOptions()\n pipeline_options.do_formula_understanding = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=ExampleFormulaUnderstandingPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n doc_converter.convert(input_doc_path)\n# Example main. In the final version, we simply have to set do_formula_understanding to true. def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2203.01017v2.pdf\" pipeline_options = ExampleFormulaUnderstandingPipelineOptions() pipeline_options.do_formula_understanding = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=ExampleFormulaUnderstandingPipeline, pipeline_options=pipeline_options, ) } ) doc_converter.convert(input_doc_path) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/develop_picture_enrichment/","title":"Figure enrichment","text":"
WARNING This example demonstrates only how to develop a new enrichment model. It does not run the actual picture classifier model.
In\u00a0[\u00a0]: Copied!import logging\nfrom collections.abc import Iterable\nfrom pathlib import Path\nfrom typing import Any\nimport logging from collections.abc import Iterable from pathlib import Path from typing import Any In\u00a0[\u00a0]: Copied!
from docling_core.types.doc import (\n DoclingDocument,\n NodeItem,\n PictureClassificationClass,\n PictureClassificationData,\n PictureItem,\n)\nfrom docling_core.types.doc import ( DoclingDocument, NodeItem, PictureClassificationClass, PictureClassificationData, PictureItem, ) In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.models.base_model import BaseEnrichmentModel\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.models.base_model import BaseEnrichmentModel from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline In\u00a0[\u00a0]: Copied!
class ExamplePictureClassifierPipelineOptions(PdfPipelineOptions):\n do_picture_classifer: bool = True\nclass ExamplePictureClassifierPipelineOptions(PdfPipelineOptions): do_picture_classifer: bool = True In\u00a0[\u00a0]: Copied!
class ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel):\n def __init__(self, enabled: bool):\n self.enabled = enabled\n\n def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:\n return self.enabled and isinstance(element, PictureItem)\n\n def __call__(\n self, doc: DoclingDocument, element_batch: Iterable[NodeItem]\n ) -> Iterable[Any]:\n if not self.enabled:\n return\n\n for element in element_batch:\n assert isinstance(element, PictureItem)\n\n # uncomment this to interactively visualize the image\n # element.get_image(doc).show()\n\n element.annotations.append(\n PictureClassificationData(\n provenance=\"example_classifier-0.0.1\",\n predicted_classes=[\n PictureClassificationClass(class_name=\"dummy\", confidence=0.42)\n ],\n )\n )\n\n yield element\nclass ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel): def __init__(self, enabled: bool): self.enabled = enabled def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool: return self.enabled and isinstance(element, PictureItem) def __call__( self, doc: DoclingDocument, element_batch: Iterable[NodeItem] ) -> Iterable[Any]: if not self.enabled: return for element in element_batch: assert isinstance(element, PictureItem) # uncomment this to interactively visualize the image # element.get_image(doc).show() element.annotations.append( PictureClassificationData( provenance=\"example_classifier-0.0.1\", predicted_classes=[ PictureClassificationClass(class_name=\"dummy\", confidence=0.42) ], ) ) yield element In\u00a0[\u00a0]: Copied!
class ExamplePictureClassifierPipeline(StandardPdfPipeline):\n def __init__(self, pipeline_options: ExamplePictureClassifierPipelineOptions):\n super().__init__(pipeline_options)\n self.pipeline_options: ExamplePictureClassifierPipeline\n\n self.enrichment_pipe = [\n ExamplePictureClassifierEnrichmentModel(\n enabled=pipeline_options.do_picture_classifer\n )\n ]\n\n @classmethod\n def get_default_options(cls) -> ExamplePictureClassifierPipelineOptions:\n return ExamplePictureClassifierPipelineOptions()\nclass ExamplePictureClassifierPipeline(StandardPdfPipeline): def __init__(self, pipeline_options: ExamplePictureClassifierPipelineOptions): super().__init__(pipeline_options) self.pipeline_options: ExamplePictureClassifierPipeline self.enrichment_pipe = [ ExamplePictureClassifierEnrichmentModel( enabled=pipeline_options.do_picture_classifer ) ] @classmethod def get_default_options(cls) -> ExamplePictureClassifierPipelineOptions: return ExamplePictureClassifierPipelineOptions() In\u00a0[\u00a0]: Copied!
def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n pipeline_options = ExamplePictureClassifierPipelineOptions()\n pipeline_options.images_scale = 2.0\n pipeline_options.generate_picture_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=ExamplePictureClassifierPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n result = doc_converter.convert(input_doc_path)\n\n for element, _level in result.document.iterate_items():\n if isinstance(element, PictureItem):\n print(\n f\"The model populated the `data` portion of picture {element.self_ref}:\\n{element.annotations}\"\n )\ndef main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" pipeline_options = ExamplePictureClassifierPipelineOptions() pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=ExamplePictureClassifierPipeline, pipeline_options=pipeline_options, ) } ) result = doc_converter.convert(input_doc_path) for element, _level in result.document.iterate_items(): if isinstance(element, PictureItem): print( f\"The model populated the `data` portion of picture {element.self_ref}:\\n{element.annotations}\" ) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/enrich_doclingdocument/","title":"Enrich DoclingDocument","text":"In\u00a0[\u00a0]: Copied!
from pathlib import Path\nfrom typing import Iterable, Optional\nfrom pathlib import Path from typing import Iterable, Optional In\u00a0[\u00a0]: Copied!
from docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem\nfrom rich.pretty import pprint\nfrom docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem from rich.pretty import pprint In\u00a0[\u00a0]: Copied!
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\nfrom docling.datamodel.accelerator_options import AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement\nfrom docling.datamodel.document import InputDocument\nfrom docling.models.base_model import BaseItemAndImageEnrichmentModel\nfrom docling.models.document_picture_classifier import (\n DocumentPictureClassifier,\n DocumentPictureClassifierOptions,\n)\nfrom docling.utils.utils import chunkify\nfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend from docling.datamodel.accelerator_options import AcceleratorOptions from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement from docling.datamodel.document import InputDocument from docling.models.base_model import BaseItemAndImageEnrichmentModel from docling.models.document_picture_classifier import ( DocumentPictureClassifier, DocumentPictureClassifierOptions, ) from docling.utils.utils import chunkify In\u00a0[\u00a0]: Copied!
BATCH_SIZE = 4\nBATCH_SIZE = 4 In\u00a0[\u00a0]: Copied!
def prepare_element(\n doc: DoclingDocument,\n backend: PyPdfiumDocumentBackend,\n model: BaseItemAndImageEnrichmentModel,\n element: NodeItem,\n) -> Optional[ItemAndImageEnrichmentElement]:\n if not model.is_processable(doc=doc, element=element):\n return None\n\n assert isinstance(element, DocItem)\n element_prov = element.prov[0]\n\n bbox = element_prov.bbox\n width = bbox.r - bbox.l\n height = bbox.t - bbox.b\n\n expanded_bbox = BoundingBox(\n l=bbox.l - width * model.expansion_factor,\n t=bbox.t + height * model.expansion_factor,\n r=bbox.r + width * model.expansion_factor,\n b=bbox.b - height * model.expansion_factor,\n coord_origin=bbox.coord_origin,\n )\n\n page_ix = element_prov.page_no - 1\n page_backend = backend.load_page(page_no=page_ix)\n cropped_image = page_backend.get_page_image(\n scale=model.images_scale, cropbox=expanded_bbox\n )\n return ItemAndImageEnrichmentElement(item=element, image=cropped_image)\ndef prepare_element( doc: DoclingDocument, backend: PyPdfiumDocumentBackend, model: BaseItemAndImageEnrichmentModel, element: NodeItem, ) -> Optional[ItemAndImageEnrichmentElement]: if not model.is_processable(doc=doc, element=element): return None assert isinstance(element, DocItem) element_prov = element.prov[0] bbox = element_prov.bbox width = bbox.r - bbox.l height = bbox.t - bbox.b expanded_bbox = BoundingBox( l=bbox.l - width * model.expansion_factor, t=bbox.t + height * model.expansion_factor, r=bbox.r + width * model.expansion_factor, b=bbox.b - height * model.expansion_factor, coord_origin=bbox.coord_origin, ) page_ix = element_prov.page_no - 1 page_backend = backend.load_page(page_no=page_ix) cropped_image = page_backend.get_page_image( scale=model.images_scale, cropbox=expanded_bbox ) return ItemAndImageEnrichmentElement(item=element, image=cropped_image) In\u00a0[\u00a0]: Copied!
def enrich_document(\n doc: DoclingDocument,\n backend: PyPdfiumDocumentBackend,\n model: BaseItemAndImageEnrichmentModel,\n) -> DoclingDocument:\n def _prepare_elements(\n doc: DoclingDocument,\n backend: PyPdfiumDocumentBackend,\n model: BaseItemAndImageEnrichmentModel,\n ) -> Iterable[NodeItem]:\n for doc_element, _level in doc.iterate_items():\n prepared_element = prepare_element(\n doc=doc, backend=backend, model=model, element=doc_element\n )\n if prepared_element is not None:\n yield prepared_element\n\n for element_batch in chunkify(\n _prepare_elements(doc, backend, model),\n BATCH_SIZE,\n ):\n for element in model(doc=doc, element_batch=element_batch): # Must exhaust!\n pass\n\n return doc\ndef enrich_document( doc: DoclingDocument, backend: PyPdfiumDocumentBackend, model: BaseItemAndImageEnrichmentModel, ) -> DoclingDocument: def _prepare_elements( doc: DoclingDocument, backend: PyPdfiumDocumentBackend, model: BaseItemAndImageEnrichmentModel, ) -> Iterable[NodeItem]: for doc_element, _level in doc.iterate_items(): prepared_element = prepare_element( doc=doc, backend=backend, model=model, element=doc_element ) if prepared_element is not None: yield prepared_element for element_batch in chunkify( _prepare_elements(doc, backend, model), BATCH_SIZE, ): for element in model(doc=doc, element_batch=element_batch): # Must exhaust! pass return doc In\u00a0[\u00a0]: Copied!
def main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_pdf_path = data_folder / \"pdf/2206.01062.pdf\"\n\n input_doc_path = data_folder / \"groundtruth/docling_v2/2206.01062.json\"\n\n doc = DoclingDocument.load_from_json(input_doc_path)\n\n in_pdf_doc = InputDocument(\n input_pdf_path,\n format=InputFormat.PDF,\n backend=PyPdfiumDocumentBackend,\n filename=input_pdf_path.name,\n )\n backend = in_pdf_doc._backend\n\n model = DocumentPictureClassifier(\n enabled=True,\n artifacts_path=None,\n options=DocumentPictureClassifierOptions(),\n accelerator_options=AcceleratorOptions(),\n )\n\n doc = enrich_document(doc=doc, backend=backend, model=model)\n\n for pic in doc.pictures[:5]:\n print(pic.self_ref)\n pprint(pic.annotations)\ndef main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_pdf_path = data_folder / \"pdf/2206.01062.pdf\" input_doc_path = data_folder / \"groundtruth/docling_v2/2206.01062.json\" doc = DoclingDocument.load_from_json(input_doc_path) in_pdf_doc = InputDocument( input_pdf_path, format=InputFormat.PDF, backend=PyPdfiumDocumentBackend, filename=input_pdf_path.name, ) backend = in_pdf_doc._backend model = DocumentPictureClassifier( enabled=True, artifacts_path=None, options=DocumentPictureClassifierOptions(), accelerator_options=AcceleratorOptions(), ) doc = enrich_document(doc=doc, backend=backend, model=model) for pic in doc.pictures[:5]: print(pic.self_ref) pprint(pic.annotations) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/enrich_doclingdocument/#enrich-doclingdocument","title":"Enrich DoclingDocument\u00b6","text":"
This example allows to run Docling enrichment models on documents which have been already converted and stored as serialized DoclingDocument JSON files.
"},{"location":"examples/enrich_doclingdocument/#load-modules","title":"Load modules\u00b6","text":""},{"location":"examples/enrich_doclingdocument/#define-batch-size-used-for-processing","title":"Define batch size used for processing\u00b6","text":""},{"location":"examples/enrich_doclingdocument/#from-docitem-to-the-model-inputs","title":"From DocItem to the model inputs\u00b6","text":"The following function is responsible for taking an item and applying the required pre-processing for the model. In this case we generate a cropped image from the document backend.
"},{"location":"examples/enrich_doclingdocument/#iterate-through-the-document","title":"Iterate through the document\u00b6","text":"This block defines the enrich_document()
which is responsible for iterating through the document and batch the selected document items for running through the model.
The main()
function which initializes the document and model objects for calling enrich_document()
.
import logging\nimport time\nfrom pathlib import Path\nimport logging import time from pathlib import Path In\u00a0[\u00a0]: Copied!
from docling_core.types.doc import ImageRefMode, PictureItem, TableItem\nfrom docling_core.types.doc import ImageRefMode, PictureItem, TableItem In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied!
_log = logging.getLogger(__name__)\n_log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied!
IMAGE_RESOLUTION_SCALE = 2.0\nIMAGE_RESOLUTION_SCALE = 2.0 In\u00a0[\u00a0]: Copied!
def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\")\n\n # Important: For operating with page images, we must keep them, otherwise the DocumentConverter\n # will destroy them for cleaning up memory.\n # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images.\n # scale=1 correspond of a standard 72 DPI image\n # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched\n # with the image field\n pipeline_options = PdfPipelineOptions()\n pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n pipeline_options.generate_page_images = True\n pipeline_options.generate_picture_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n start_time = time.time()\n\n conv_res = doc_converter.convert(input_doc_path)\n\n output_dir.mkdir(parents=True, exist_ok=True)\n doc_filename = conv_res.input.file.stem\n\n # Save page images\n for page_no, page in conv_res.document.pages.items():\n page_no = page.page_no\n page_image_filename = output_dir / f\"{doc_filename}-{page_no}.png\"\n with page_image_filename.open(\"wb\") as fp:\n page.image.pil_image.save(fp, format=\"PNG\")\n\n # Save images of figures and tables\n table_counter = 0\n picture_counter = 0\n for element, _level in conv_res.document.iterate_items():\n if isinstance(element, TableItem):\n table_counter += 1\n element_image_filename = (\n output_dir / f\"{doc_filename}-table-{table_counter}.png\"\n )\n with element_image_filename.open(\"wb\") as fp:\n element.get_image(conv_res.document).save(fp, \"PNG\")\n\n if isinstance(element, PictureItem):\n picture_counter += 1\n element_image_filename = (\n output_dir / f\"{doc_filename}-picture-{picture_counter}.png\"\n )\n with element_image_filename.open(\"wb\") as fp:\n element.get_image(conv_res.document).save(fp, \"PNG\")\n\n # Save markdown with embedded pictures\n md_filename = output_dir / f\"{doc_filename}-with-images.md\"\n conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n # Save markdown with externally referenced pictures\n md_filename = output_dir / f\"{doc_filename}-with-image-refs.md\"\n conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)\n\n # Save HTML with externally referenced pictures\n html_filename = output_dir / f\"{doc_filename}-with-image-refs.html\"\n conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED)\n\n end_time = time.time() - start_time\n\n _log.info(f\"Document converted and figures exported in {end_time:.2f} seconds.\")\ndef main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") # Important: For operating with page images, we must keep them, otherwise the DocumentConverter # will destroy them for cleaning up memory. # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images. # scale=1 correspond of a standard 72 DPI image # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched # with the image field pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) start_time = time.time() conv_res = doc_converter.convert(input_doc_path) output_dir.mkdir(parents=True, exist_ok=True) doc_filename = conv_res.input.file.stem # Save page images for page_no, page in conv_res.document.pages.items(): page_no = page.page_no page_image_filename = output_dir / f\"{doc_filename}-{page_no}.png\" with page_image_filename.open(\"wb\") as fp: page.image.pil_image.save(fp, format=\"PNG\") # Save images of figures and tables table_counter = 0 picture_counter = 0 for element, _level in conv_res.document.iterate_items(): if isinstance(element, TableItem): table_counter += 1 element_image_filename = ( output_dir / f\"{doc_filename}-table-{table_counter}.png\" ) with element_image_filename.open(\"wb\") as fp: element.get_image(conv_res.document).save(fp, \"PNG\") if isinstance(element, PictureItem): picture_counter += 1 element_image_filename = ( output_dir / f\"{doc_filename}-picture-{picture_counter}.png\" ) with element_image_filename.open(\"wb\") as fp: element.get_image(conv_res.document).save(fp, \"PNG\") # Save markdown with embedded pictures md_filename = output_dir / f\"{doc_filename}-with-images.md\" conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED) # Save markdown with externally referenced pictures md_filename = output_dir / f\"{doc_filename}-with-image-refs.md\" conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED) # Save HTML with externally referenced pictures html_filename = output_dir / f\"{doc_filename}-with-image-refs.html\" conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED) end_time = time.time() - start_time _log.info(f\"Document converted and figures exported in {end_time:.2f} seconds.\") In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/export_multimodal/","title":"Multimodal export","text":"In\u00a0[\u00a0]: Copied!
import datetime\nimport logging\nimport time\nfrom pathlib import Path\nimport datetime import logging import time from pathlib import Path In\u00a0[\u00a0]: Copied!
import pandas as pd\nimport pandas as pd In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.utils.export import generate_multimodal_pages\nfrom docling.utils.utils import create_hash\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.utils.export import generate_multimodal_pages from docling.utils.utils import create_hash In\u00a0[\u00a0]: Copied!
_log = logging.getLogger(__name__)\n_log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied!
IMAGE_RESOLUTION_SCALE = 2.0\nIMAGE_RESOLUTION_SCALE = 2.0 In\u00a0[\u00a0]: Copied!
def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\")\n\n # Important: For operating with page images, we must keep them, otherwise the DocumentConverter\n # will destroy them for cleaning up memory.\n # This is done by setting AssembleOptions.images_scale, which also defines the scale of images.\n # scale=1 correspond of a standard 72 DPI image\n pipeline_options = PdfPipelineOptions()\n pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n pipeline_options.generate_page_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n start_time = time.time()\n\n conv_res = doc_converter.convert(input_doc_path)\n\n output_dir.mkdir(parents=True, exist_ok=True)\n\n rows = []\n for (\n content_text,\n content_md,\n content_dt,\n page_cells,\n page_segments,\n page,\n ) in generate_multimodal_pages(conv_res):\n dpi = page._default_image_scale * 72\n\n rows.append(\n {\n \"document\": conv_res.input.file.name,\n \"hash\": conv_res.input.document_hash,\n \"page_hash\": create_hash(\n conv_res.input.document_hash + \":\" + str(page.page_no - 1)\n ),\n \"image\": {\n \"width\": page.image.width,\n \"height\": page.image.height,\n \"bytes\": page.image.tobytes(),\n },\n \"cells\": page_cells,\n \"contents\": content_text,\n \"contents_md\": content_md,\n \"contents_dt\": content_dt,\n \"segments\": page_segments,\n \"extra\": {\n \"page_num\": page.page_no + 1,\n \"width_in_points\": page.size.width,\n \"height_in_points\": page.size.height,\n \"dpi\": dpi,\n },\n }\n )\n\n # Generate one parquet from all documents\n df_result = pd.json_normalize(rows)\n now = datetime.datetime.now()\n output_filename = output_dir / f\"multimodal_{now:%Y-%m-%d_%H%M%S}.parquet\"\n df_result.to_parquet(output_filename)\n\n end_time = time.time() - start_time\n\n _log.info(\n f\"Document converted and multimodal pages generated in {end_time:.2f} seconds.\"\n )\n\n # This block demonstrates how the file can be opened with the HF datasets library\n # from datasets import Dataset\n # from PIL import Image\n # multimodal_df = pd.read_parquet(output_filename)\n\n # # Convert pandas DataFrame to Hugging Face Dataset and load bytes into image\n # dataset = Dataset.from_pandas(multimodal_df)\n # def transforms(examples):\n # examples[\"image\"] = Image.frombytes('RGB', (examples[\"image.width\"], examples[\"image.height\"]), examples[\"image.bytes\"], 'raw')\n # return examples\n # dataset = dataset.map(transforms)\ndef main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") # Important: For operating with page images, we must keep them, otherwise the DocumentConverter # will destroy them for cleaning up memory. # This is done by setting AssembleOptions.images_scale, which also defines the scale of images. # scale=1 correspond of a standard 72 DPI image pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) start_time = time.time() conv_res = doc_converter.convert(input_doc_path) output_dir.mkdir(parents=True, exist_ok=True) rows = [] for ( content_text, content_md, content_dt, page_cells, page_segments, page, ) in generate_multimodal_pages(conv_res): dpi = page._default_image_scale * 72 rows.append( { \"document\": conv_res.input.file.name, \"hash\": conv_res.input.document_hash, \"page_hash\": create_hash( conv_res.input.document_hash + \":\" + str(page.page_no - 1) ), \"image\": { \"width\": page.image.width, \"height\": page.image.height, \"bytes\": page.image.tobytes(), }, \"cells\": page_cells, \"contents\": content_text, \"contents_md\": content_md, \"contents_dt\": content_dt, \"segments\": page_segments, \"extra\": { \"page_num\": page.page_no + 1, \"width_in_points\": page.size.width, \"height_in_points\": page.size.height, \"dpi\": dpi, }, } ) # Generate one parquet from all documents df_result = pd.json_normalize(rows) now = datetime.datetime.now() output_filename = output_dir / f\"multimodal_{now:%Y-%m-%d_%H%M%S}.parquet\" df_result.to_parquet(output_filename) end_time = time.time() - start_time _log.info( f\"Document converted and multimodal pages generated in {end_time:.2f} seconds.\" ) # This block demonstrates how the file can be opened with the HF datasets library # from datasets import Dataset # from PIL import Image # multimodal_df = pd.read_parquet(output_filename) # # Convert pandas DataFrame to Hugging Face Dataset and load bytes into image # dataset = Dataset.from_pandas(multimodal_df) # def transforms(examples): # examples[\"image\"] = Image.frombytes('RGB', (examples[\"image.width\"], examples[\"image.height\"]), examples[\"image.bytes\"], 'raw') # return examples # dataset = dataset.map(transforms) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/export_tables/","title":"Table export","text":"In\u00a0[\u00a0]: Copied!
import logging\nimport time\nfrom pathlib import Path\nimport logging import time from pathlib import Path In\u00a0[\u00a0]: Copied!
import pandas as pd\nimport pandas as pd In\u00a0[\u00a0]: Copied!
from docling.document_converter import DocumentConverter\nfrom docling.document_converter import DocumentConverter In\u00a0[\u00a0]: Copied!
_log = logging.getLogger(__name__)\n_log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied!
def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\")\n\n doc_converter = DocumentConverter()\n\n start_time = time.time()\n\n conv_res = doc_converter.convert(input_doc_path)\n\n output_dir.mkdir(parents=True, exist_ok=True)\n\n doc_filename = conv_res.input.file.stem\n\n # Export tables\n for table_ix, table in enumerate(conv_res.document.tables):\n table_df: pd.DataFrame = table.export_to_dataframe()\n print(f\"## Table {table_ix}\")\n print(table_df.to_markdown())\n\n # Save the table as csv\n element_csv_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.csv\"\n _log.info(f\"Saving CSV table to {element_csv_filename}\")\n table_df.to_csv(element_csv_filename)\n\n # Save the table as html\n element_html_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.html\"\n _log.info(f\"Saving HTML table to {element_html_filename}\")\n with element_html_filename.open(\"w\") as fp:\n fp.write(table.export_to_html(doc=conv_res.document))\n\n end_time = time.time() - start_time\n\n _log.info(f\"Document converted and tables exported in {end_time:.2f} seconds.\")\ndef main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") doc_converter = DocumentConverter() start_time = time.time() conv_res = doc_converter.convert(input_doc_path) output_dir.mkdir(parents=True, exist_ok=True) doc_filename = conv_res.input.file.stem # Export tables for table_ix, table in enumerate(conv_res.document.tables): table_df: pd.DataFrame = table.export_to_dataframe() print(f\"## Table {table_ix}\") print(table_df.to_markdown()) # Save the table as csv element_csv_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.csv\" _log.info(f\"Saving CSV table to {element_csv_filename}\") table_df.to_csv(element_csv_filename) # Save the table as html element_html_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.html\" _log.info(f\"Saving HTML table to {element_html_filename}\") with element_html_filename.open(\"w\") as fp: fp.write(table.export_to_html(doc=conv_res.document)) end_time = time.time() - start_time _log.info(f\"Document converted and tables exported in {end_time:.2f} seconds.\") In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/full_page_ocr/","title":"Force full page OCR","text":"In\u00a0[\u00a0]: Copied!
from pathlib import Path\nfrom pathlib import Path In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n TesseractCliOcrOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, TesseractCliOcrOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied!
def main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n pipeline_options = PdfPipelineOptions()\n pipeline_options.do_ocr = True\n pipeline_options.do_table_structure = True\n pipeline_options.table_structure_options.do_cell_matching = True\n\n # Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions\n # ocr_options = EasyOcrOptions(force_full_page_ocr=True)\n # ocr_options = TesseractOcrOptions(force_full_page_ocr=True)\n # ocr_options = OcrMacOptions(force_full_page_ocr=True)\n # ocr_options = RapidOcrOptions(force_full_page_ocr=True)\n ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)\n pipeline_options.ocr_options = ocr_options\n\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n\n doc = converter.convert(input_doc_path).document\n md = doc.export_to_markdown()\n print(md)\ndef main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True # Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions # ocr_options = EasyOcrOptions(force_full_page_ocr=True) # ocr_options = TesseractOcrOptions(force_full_page_ocr=True) # ocr_options = OcrMacOptions(force_full_page_ocr=True) # ocr_options = RapidOcrOptions(force_full_page_ocr=True) ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True) pipeline_options.ocr_options = ocr_options converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(input_doc_path).document md = doc.export_to_markdown() print(md) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/hybrid_chunking/","title":"Hybrid chunking","text":"
Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking.
For more details, see here.
In\u00a0[1]: Copied!%pip install -qU pip docling transformers\n%pip install -qU pip docling transformers
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
DOC_SOURCE = \"../../tests/data/md/wiki.md\"\nDOC_SOURCE = \"../../tests/data/md/wiki.md\"
We first convert the document:
In\u00a0[3]: Copied!from docling.document_converter import DocumentConverter\n\ndoc = DocumentConverter().convert(source=DOC_SOURCE).document\nfrom docling.document_converter import DocumentConverter doc = DocumentConverter().convert(source=DOC_SOURCE).document
For a basic chunking scenario, we can just instantiate a HybridChunker
, which will use the default parameters.
from docling.chunking import HybridChunker\n\nchunker = HybridChunker()\nchunk_iter = chunker.chunk(dl_doc=doc)\nfrom docling.chunking import HybridChunker chunker = HybridChunker() chunk_iter = chunker.chunk(dl_doc=doc)
Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors\n
\ud83d\udc49 NOTE: As you see above, using the HybridChunker
can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here.
Note that the text you would typically want to embed is the context-enriched one as returned by the contextualize()
method:
for i, chunk in enumerate(chunk_iter):\n print(f\"=== {i} ===\")\n print(f\"chunk.text:\\n{f'{chunk.text[:300]}\u2026'!r}\")\n\n enriched_text = chunker.contextualize(chunk=chunk)\n print(f\"chunker.contextualize(chunk):\\n{f'{enriched_text[:300]}\u2026'!r}\")\n\n print()\nfor i, chunk in enumerate(chunk_iter): print(f\"=== {i} ===\") print(f\"chunk.text:\\n{f'{chunk.text[:300]}\u2026'!r}\") enriched_text = chunker.contextualize(chunk=chunk) print(f\"chunker.contextualize(chunk):\\n{f'{enriched_text[:300]}\u2026'!r}\") print()
=== 0 ===\nchunk.text:\n'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Aver\u2026'\nchunker.contextualize(chunk):\n'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial \u2026'\n\n=== 1 ===\nchunk.text:\n'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willa\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1910s\u20131950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889\u2026'\n\n=== 2 ===\nchunk.text:\n'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson,\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1910s\u20131950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John \u2026'\n\n=== 3 ===\nchunk.text:\n'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1960s\u20131980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.\u2026'\n\nIn\u00a0[6]: Copied!
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\nfrom transformers import AutoTokenizer\n\nfrom docling.chunking import HybridChunker\n\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nMAX_TOKENS = 64 # set to a small number for illustrative purposes\n\ntokenizer = HuggingFaceTokenizer(\n tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case\n)\nfrom docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer from transformers import AutoTokenizer from docling.chunking import HybridChunker EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" MAX_TOKENS = 64 # set to a small number for illustrative purposes tokenizer = HuggingFaceTokenizer( tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID), max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case )
\ud83d\udc49 Alternatively, OpenAI tokenizers can be used as shown in the example below (uncomment to use \u2014 requires installing docling-core[chunking-openai]
):
# import tiktoken\n\n# from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer\n\n# tokenizer = OpenAITokenizer(\n# tokenizer=tiktoken.encoding_for_model(\"gpt-4o\"),\n# max_tokens=128 * 1024, # context window length required for OpenAI tokenizers\n# )\n# import tiktoken # from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer # tokenizer = OpenAITokenizer( # tokenizer=tiktoken.encoding_for_model(\"gpt-4o\"), # max_tokens=128 * 1024, # context window length required for OpenAI tokenizers # )
We can now instantiate our chunker:
In\u00a0[8]: Copied!chunker = HybridChunker(\n tokenizer=tokenizer,\n merge_peers=True, # optional, defaults to True\n)\nchunk_iter = chunker.chunk(dl_doc=doc)\nchunks = list(chunk_iter)\nchunker = HybridChunker( tokenizer=tokenizer, merge_peers=True, # optional, defaults to True ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter)
Points to notice looking at the output chunks below:
for i, chunk in enumerate(chunks):\n print(f\"=== {i} ===\")\n txt_tokens = tokenizer.count_tokens(chunk.text)\n print(f\"chunk.text ({txt_tokens} tokens):\\n{chunk.text!r}\")\n\n ser_txt = chunker.contextualize(chunk=chunk)\n ser_tokens = tokenizer.count_tokens(ser_txt)\n print(f\"chunker.contextualize(chunk) ({ser_tokens} tokens):\\n{ser_txt!r}\")\n\n print()\nfor i, chunk in enumerate(chunks): print(f\"=== {i} ===\") txt_tokens = tokenizer.count_tokens(chunk.text) print(f\"chunk.text ({txt_tokens} tokens):\\n{chunk.text!r}\") ser_txt = chunker.contextualize(chunk=chunk) ser_tokens = tokenizer.count_tokens(ser_txt) print(f\"chunker.contextualize(chunk) ({ser_tokens} tokens):\\n{ser_txt!r}\") print()
=== 0 ===\nchunk.text (55 tokens):\n'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\nchunker.contextualize(chunk) (56 tokens):\n'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\n\n=== 1 ===\nchunk.text (45 tokens):\n'IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\nchunker.contextualize(chunk) (46 tokens):\n'IBM\\nIBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\n\n=== 2 ===\nchunk.text (63 tokens):\n'IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nIBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\n\n=== 3 ===\nchunk.text (44 tokens):\n\"IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\nchunker.contextualize(chunk) (45 tokens):\n\"IBM\\nIBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n\n=== 4 ===\nchunk.text (63 tokens):\n'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, \u2014 its DOS software provided by Microsoft, \u2014 which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, \u2014 its DOS software provided by Microsoft, \u2014 which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\n\n=== 5 ===\nchunk.text (61 tokens):\n'IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\nchunker.contextualize(chunk) (62 tokens):\n'IBM\\nIBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n\n=== 6 ===\nchunk.text (62 tokens):\n\"As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\nchunker.contextualize(chunk) (63 tokens):\n\"IBM\\nAs one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\n\n=== 7 ===\nchunk.text (63 tokens):\n'language, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nlanguage, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\n\n=== 8 ===\nchunk.text (5 tokens):\n'Awards.[16]'\nchunker.contextualize(chunk) (6 tokens):\n'IBM\\nAwards.[16]'\n\n=== 9 ===\nchunk.text (56 tokens):\n'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\nchunker.contextualize(chunk) (60 tokens):\n'IBM\\n1910s\u20131950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\n\n=== 10 ===\nchunk.text (60 tokens):\n\"(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\nchunker.contextualize(chunk) (64 tokens):\n\"IBM\\n1910s\u20131950s\\n(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\n\n=== 11 ===\nchunk.text (59 tokens):\n'Computing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\nchunker.contextualize(chunk) (63 tokens):\n'IBM\\n1910s\u20131950s\\nComputing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\n\n=== 12 ===\nchunk.text (13 tokens):\n'D.C.; and Toronto, Canada.[22]'\nchunker.contextualize(chunk) (17 tokens):\n'IBM\\n1910s\u20131950s\\nD.C.; and Toronto, Canada.[22]'\n\n=== 13 ===\nchunk.text (60 tokens):\n'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\n1910s\u20131950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\n\n=== 14 ===\nchunk.text (59 tokens):\n\"on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\nchunker.contextualize(chunk) (63 tokens):\n\"IBM\\n1910s\u20131950s\\non Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\n\n=== 15 ===\nchunk.text (23 tokens):\n\"practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\nchunker.contextualize(chunk) (27 tokens):\n\"IBM\\n1910s\u20131950s\\npractices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\n\n=== 16 ===\nchunk.text (59 tokens):\n'He implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\nchunker.contextualize(chunk) (63 tokens):\n'IBM\\n1910s\u20131950s\\nHe implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\n\n=== 17 ===\nchunk.text (60 tokens):\n'\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\n1910s\u20131950s\\n\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\n\n=== 18 ===\nchunk.text (57 tokens):\n'clumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\nchunker.contextualize(chunk) (61 tokens):\n'IBM\\n1910s\u20131950s\\nclumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\n\n=== 19 ===\nchunk.text (21 tokens):\n'1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\nchunker.contextualize(chunk) (25 tokens):\n'IBM\\n1910s\u20131950s\\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n\n=== 20 ===\nchunk.text (22 tokens):\n'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\nchunker.contextualize(chunk) (26 tokens):\n'IBM\\n1960s\u20131980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\n\n"},{"location":"examples/hybrid_chunking/#hybrid-chunking","title":"Hybrid chunking\u00b6","text":""},{"location":"examples/hybrid_chunking/#overview","title":"Overview\u00b6","text":""},{"location":"examples/hybrid_chunking/#setup","title":"Setup\u00b6","text":""},{"location":"examples/hybrid_chunking/#basic-usage","title":"Basic usage\u00b6","text":""},{"location":"examples/hybrid_chunking/#configuring-tokenization","title":"Configuring tokenization\u00b6","text":"
For more control on the chunking, we can parametrize tokenization as shown below.
In a RAG / retrieval context, it is important to make sure that the chunker and embedding model are using the same tokenizer.
\ud83d\udc49 HuggingFace transformers tokenizers can be used as shown in the following example:
"},{"location":"examples/inspect_picture_content/","title":"Inspect picture content","text":"In\u00a0[\u00a0]: Copied!from docling_core.types.doc import TextItem\nfrom docling_core.types.doc import TextItem In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied!
source = \"tests/data/pdf/amt_handbook_sample.pdf\"\nsource = \"tests/data/pdf/amt_handbook_sample.pdf\" In\u00a0[\u00a0]: Copied!
pipeline_options = PdfPipelineOptions()\npipeline_options.images_scale = 2\npipeline_options.generate_page_images = True\npipeline_options = PdfPipelineOptions() pipeline_options.images_scale = 2 pipeline_options.generate_page_images = True In\u00a0[\u00a0]: Copied!
doc_converter = DocumentConverter(\n format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}\n)\ndoc_converter = DocumentConverter( format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} ) In\u00a0[\u00a0]: Copied!
result = doc_converter.convert(source)\nresult = doc_converter.convert(source) In\u00a0[\u00a0]: Copied!
doc = result.document\ndoc = result.document In\u00a0[\u00a0]: Copied!
for picture in doc.pictures:\n # picture.get_image(doc).show() # display the picture\n print(picture.caption_text(doc), \" contains these elements:\")\n\n for item, level in doc.iterate_items(root=picture, traverse_pictures=True):\n if isinstance(item, TextItem):\n print(item.text)\n\n print(\"\\n\")\nfor picture in doc.pictures: # picture.get_image(doc).show() # display the picture print(picture.caption_text(doc), \" contains these elements:\") for item, level in doc.iterate_items(root=picture, traverse_pictures=True): if isinstance(item, TextItem): print(item.text) print(\"\\n\")"},{"location":"examples/minimal/","title":"Simple conversion","text":"In\u00a0[\u00a0]: Copied!
from docling.document_converter import DocumentConverter\nfrom docling.document_converter import DocumentConverter In\u00a0[\u00a0]: Copied!
source = \"https://arxiv.org/pdf/2408.09869\" # document per local path or URL\nsource = \"https://arxiv.org/pdf/2408.09869\" # document per local path or URL In\u00a0[\u00a0]: Copied!
converter = DocumentConverter()\ndoc = converter.convert(source).document\nconverter = DocumentConverter() doc = converter.convert(source).document In\u00a0[\u00a0]: Copied!
print(doc.export_to_markdown())\n# output: ## Docling Technical Report [...]\"\nprint(doc.export_to_markdown()) # output: ## Docling Technical Report [...]\""},{"location":"examples/minimal_asr_pipeline/","title":"ASR pipeline with Whisper","text":"In\u00a0[\u00a0]: Copied!
from pathlib import Path\nfrom pathlib import Path In\u00a0[\u00a0]: Copied!
from docling_core.types.doc import DoclingDocument\nfrom docling_core.types.doc import DoclingDocument In\u00a0[\u00a0]: Copied!
from docling.datamodel import asr_model_specs\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import AsrPipelineOptions\nfrom docling.document_converter import AudioFormatOption, DocumentConverter\nfrom docling.pipeline.asr_pipeline import AsrPipeline\nfrom docling.datamodel import asr_model_specs from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import AsrPipelineOptions from docling.document_converter import AudioFormatOption, DocumentConverter from docling.pipeline.asr_pipeline import AsrPipeline In\u00a0[\u00a0]: Copied!
def get_asr_converter():\n \"\"\"Create a DocumentConverter configured for ASR with whisper_turbo model.\"\"\"\n pipeline_options = AsrPipelineOptions()\n pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO\n\n converter = DocumentConverter(\n format_options={\n InputFormat.AUDIO: AudioFormatOption(\n pipeline_cls=AsrPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n return converter\ndef get_asr_converter(): \"\"\"Create a DocumentConverter configured for ASR with whisper_turbo model.\"\"\" pipeline_options = AsrPipelineOptions() pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO converter = DocumentConverter( format_options={ InputFormat.AUDIO: AudioFormatOption( pipeline_cls=AsrPipeline, pipeline_options=pipeline_options, ) } ) return converter In\u00a0[\u00a0]: Copied!
def asr_pipeline_conversion(audio_path: Path) -> DoclingDocument:\n \"\"\"ASR pipeline conversion using whisper_turbo\"\"\"\n # Check if the test audio file exists\n assert audio_path.exists(), f\"Test audio file not found: {audio_path}\"\n\n converter = get_asr_converter()\n\n # Convert the audio file\n result: ConversionResult = converter.convert(audio_path)\n\n # Verify conversion was successful\n assert result.status == ConversionStatus.SUCCESS, (\n f\"Conversion failed with status: {result.status}\"\n )\n return result.document\ndef asr_pipeline_conversion(audio_path: Path) -> DoclingDocument: \"\"\"ASR pipeline conversion using whisper_turbo\"\"\" # Check if the test audio file exists assert audio_path.exists(), f\"Test audio file not found: {audio_path}\" converter = get_asr_converter() # Convert the audio file result: ConversionResult = converter.convert(audio_path) # Verify conversion was successful assert result.status == ConversionStatus.SUCCESS, ( f\"Conversion failed with status: {result.status}\" ) return result.document In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n audio_path = Path(\"tests/data/audio/sample_10s.mp3\")\n\n doc = asr_pipeline_conversion(audio_path=audio_path)\n print(doc.export_to_markdown())\n\n # Expected output:\n #\n # [time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde\n #\n # [time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.\nif __name__ == \"__main__\": audio_path = Path(\"tests/data/audio/sample_10s.mp3\") doc = asr_pipeline_conversion(audio_path=audio_path) print(doc.export_to_markdown()) # Expected output: # # [time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde # # [time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain."},{"location":"examples/minimal_vlm_pipeline/","title":"VLM pipeline with SmolDocling","text":"In\u00a0[\u00a0]: Copied!
from docling.datamodel import vlm_model_specs\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\nfrom docling.datamodel import vlm_model_specs from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( VlmPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline In\u00a0[\u00a0]: Copied!
source = \"https://arxiv.org/pdf/2501.17887\"\nsource = \"https://arxiv.org/pdf/2501.17887\" In\u00a0[\u00a0]: Copied!
converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n ),\n }\n)\nconverter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=VlmPipeline, ), } ) In\u00a0[\u00a0]: Copied!
doc = converter.convert(source=source).document\ndoc = converter.convert(source=source).document In\u00a0[\u00a0]: Copied!
print(doc.export_to_markdown())\nprint(doc.export_to_markdown()) In\u00a0[\u00a0]: Copied!
pipeline_options = VlmPipelineOptions(\n vlm_options=vlm_model_specs.SMOLDOCLING_MLX,\n)\npipeline_options = VlmPipelineOptions( vlm_options=vlm_model_specs.SMOLDOCLING_MLX, ) In\u00a0[\u00a0]: Copied!
converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n }\n)\nconverter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=VlmPipeline, pipeline_options=pipeline_options, ), } ) In\u00a0[\u00a0]: Copied!
doc = converter.convert(source=source).document\ndoc = converter.convert(source=source).document In\u00a0[\u00a0]: Copied!
print(doc.export_to_markdown())\nprint(doc.export_to_markdown())"},{"location":"examples/minimal_vlm_pipeline/#using-simple-default-values","title":"USING SIMPLE DEFAULT VALUES\u00b6","text":"
For more options see the compare_vlm_models.py example.
"},{"location":"examples/pictures_description/","title":"Annotate picture with local VLM","text":"In\u00a0[\u00a0]: Copied!%pip install -q docling[vlm] ipython\n%pip install -q docling[vlm] ipython
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[1]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[2]: Copied!
# The source document\nDOC_SOURCE = \"https://arxiv.org/pdf/2501.17887\"\n# The source document DOC_SOURCE = \"https://arxiv.org/pdf/2501.17887\" In\u00a0[3]: Copied!
from docling.datamodel.pipeline_options import granite_picture_description\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = (\n granite_picture_description # <-- the model choice\n)\npipeline_options.picture_description_options.prompt = (\n \"Describe the image in three sentences. Be consise and accurate.\"\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n)\ndoc = converter.convert(DOC_SOURCE).document\nfrom docling.datamodel.pipeline_options import granite_picture_description pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = ( granite_picture_description # <-- the model choice ) pipeline_options.picture_description_options.prompt = ( \"Describe the image in three sentences. Be consise and accurate.\" ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(DOC_SOURCE).document
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.\n
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]In\u00a0[4]: Copied!
from docling_core.types.doc.document import PictureDescriptionData\nfrom IPython import display\n\nhtml_buffer = []\n# display the first 5 pictures and their captions and annotations:\nfor pic in doc.pictures[:5]:\n html_item = (\n f\"<h3>Picture <code>{pic.self_ref}</code></h3>\"\n f'<img src=\"{pic.image.uri!s}\" /><br />'\n f\"<h4>Caption</h4>{pic.caption_text(doc=doc)}<br />\"\n )\n for annotation in pic.annotations:\n if not isinstance(annotation, PictureDescriptionData):\n continue\n html_item += (\n f\"<h4>Annotations ({annotation.provenance})</h4>{annotation.text}<br />\\n\"\n )\n html_buffer.append(html_item)\ndisplay.HTML(\"<hr />\".join(html_buffer))\nfrom docling_core.types.doc.document import PictureDescriptionData from IPython import display html_buffer = [] # display the first 5 pictures and their captions and annotations: for pic in doc.pictures[:5]: html_item = ( f\"Picture
{pic.self_ref}
\" f'' f\"Caption{pic.caption_text(doc=doc)}\" ) for annotation in pic.annotations: if not isinstance(annotation, PictureDescriptionData): continue html_item += ( f\"Annotations ({annotation.provenance}){annotation.text}\\n\" ) html_buffer.append(html_item) display.HTML(\"\".join(html_buffer)) Out[4]: Picture #/pictures/0
CaptionFigure 1: Sketch of Docling's pipelines and usage model. Both PDF pipeline and simple pipeline build up a DoclingDocument representation, which can be further enriched. Downstream applications can utilize Docling's API to inspect, export, or chunk the document for various purposes.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a poster with some text and images. Picture #/pictures/1
CaptionFigure 2: Dataset categories and sample counts for documents and pages.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a pie chart. In the pie chart we can see the categories and the number of documents in each category. Picture #/pictures/2
CaptionFigure 3: Distribution of conversion times for all documents, ordered by number of pages in a document, on all system configurations. Every dot represents one document. Log/log scale is used to even the spacing, since both number of pages and conversion times have long-tail distributions.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a graph. On the x-axis we can see the number of pages. On the y-axis we can see the seconds. Picture #/pictures/3
CaptionFigure 4: Contributions of PDF backend and AI models to the conversion time of a page (in seconds per page). Lower is better. Left: Ranges of time contributions for each model to pages it was applied on (i.e., OCR was applied only on pages with bitmaps, table structure was applied only on pages with tables). Right: Average time contribution to a page in the benchmark dataset (factoring in zero-time contribution for OCR and table structure models on pages without bitmaps or tables) .Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a bar chart and a line chart. In the bar chart we can see the values of Pdf Parse, OCR, Layout, Table Structure, Page Total and Page. In the line chart we can see the values of Pdf Parse, OCR, Layout, Table Structure, Page Total and Page. Picture #/pictures/4
CaptionFigure 5: Conversion time in seconds per page on our dataset in three scenarios, across all assets and system configurations. Lower bars are better. The configuration includes OCR and table structure recognition ( fast table option on Docling and MinerU, hi res in unstructured, as shown in table 1).Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a bar chart. In the chart we can see the CPU, Max, GPU, and sec/page. In\u00a0[7]: Copied! from docling.datamodel.pipeline_options import smolvlm_picture_description\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = (\n smolvlm_picture_description # <-- the model choice\n)\npipeline_options.picture_description_options.prompt = (\n \"Describe the image in three sentences. Be consise and accurate.\"\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n)\ndoc = converter.convert(DOC_SOURCE).document\nfrom docling.datamodel.pipeline_options import smolvlm_picture_description pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = ( smolvlm_picture_description # <-- the model choice ) pipeline_options.picture_description_options.prompt = ( \"Describe the image in three sentences. Be consise and accurate.\" ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(DOC_SOURCE).document In\u00a0[6]: Copied!
from docling_core.types.doc.document import PictureDescriptionData\nfrom IPython import display\n\nhtml_buffer = []\n# display the first 5 pictures and their captions and annotations:\nfor pic in doc.pictures[:5]:\n html_item = (\n f\"<h3>Picture <code>{pic.self_ref}</code></h3>\"\n f'<img src=\"{pic.image.uri!s}\" /><br />'\n f\"<h4>Caption</h4>{pic.caption_text(doc=doc)}<br />\"\n )\n for annotation in pic.annotations:\n if not isinstance(annotation, PictureDescriptionData):\n continue\n html_item += (\n f\"<h4>Annotations ({annotation.provenance})</h4>{annotation.text}<br />\\n\"\n )\n html_buffer.append(html_item)\ndisplay.HTML(\"<hr />\".join(html_buffer))\nfrom docling_core.types.doc.document import PictureDescriptionData from IPython import display html_buffer = [] # display the first 5 pictures and their captions and annotations: for pic in doc.pictures[:5]: html_item = ( f\"Picture
{pic.self_ref}
\" f'' f\"Caption{pic.caption_text(doc=doc)}\" ) for annotation in pic.annotations: if not isinstance(annotation, PictureDescriptionData): continue html_item += ( f\"Annotations ({annotation.provenance}){annotation.text}\\n\" ) html_buffer.append(html_item) display.HTML(\"\".join(html_buffer)) Out[6]: Picture #/pictures/0
CaptionFigure 1: Sketch of Docling's pipelines and usage model. Both PDF pipeline and simple pipeline build up a DoclingDocument representation, which can be further enriched. Downstream applications can utilize Docling's API to inspect, export, or chunk the document for various purposes.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)This is a page that has different types of documents on it. Picture #/pictures/1
CaptionFigure 2: Dataset categories and sample counts for documents and pages.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)Here is a page-by-page list of documents per category: - Science - Articles - Law and Regulations - Articles - Misc. Picture #/pictures/2
CaptionFigure 3: Distribution of conversion times for all documents, ordered by number of pages in a document, on all system configurations. Every dot represents one document. Log/log scale is used to even the spacing, since both number of pages and conversion times have long-tail distributions.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)The image is a bar chart that shows the number of pages of a website as a function of the number of pages of the website. The x-axis represents the number of pages, ranging from 100 to 10,000. The y-axis represents the number of pages, ranging from 100 to 10,000. The chart is labeled \"Number of pages\" and has a legend at the top of the chart that indicates the number of pages. The chart shows a clear trend: as the number of pages increases, the number of pages decreases. This is evident from the following points: - The number of pages increases from 100 to 1000. - The number of pages decreases from 1000 to 10,000. - The number of pages increases from 10,000 to 10,000. Picture #/pictures/3
CaptionFigure 4: Contributions of PDF backend and AI models to the conversion time of a page (in seconds per page). Lower is better. Left: Ranges of time contributions for each model to pages it was applied on (i.e., OCR was applied only on pages with bitmaps, table structure was applied only on pages with tables). Right: Average time contribution to a page in the benchmark dataset (factoring in zero-time contribution for OCR and table structure models on pages without bitmaps or tables) .Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)bar chart with different colored bars representing different data points. Picture #/pictures/4
CaptionFigure 5: Conversion time in seconds per page on our dataset in three scenarios, across all assets and system configurations. Lower bars are better. The configuration includes OCR and table structure recognition ( fast table option on Docling and MinerU, hi res in unstructured, as shown in table 1).Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)A bar chart with the following information: - The x-axis represents the number of pages, ranging from 0 to 14. - The y-axis represents the page count, ranging from 0 to 14. - The chart has three categories: Marker, Unstructured, and Detailed. - The x-axis is labeled \"see/page.\" - The y-axis is labeled \"Page Count.\" - The chart shows that the Marker category has the highest number of pages, followed by the Unstructured category, and then the Detailed category. In\u00a0[8]: Copied! from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = PictureDescriptionVlmOptions(\n repo_id=\"\", # <-- add here the Hugging Face repo_id of your favorite VLM\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n)\n\n# Uncomment to run:\n# doc = converter.convert(DOC_SOURCE).document\nfrom docling.datamodel.pipeline_options import PictureDescriptionVlmOptions pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = PictureDescriptionVlmOptions( repo_id=\"\", # <-- add here the Hugging Face repo_id of your favorite VLM prompt=\"Describe the image in three sentences. Be consise and accurate.\", ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) # Uncomment to run: # doc = converter.convert(DOC_SOURCE).document In\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/pictures_description/#describe-pictures-with-granite-vision","title":"Describe pictures with Granite Vision\u00b6","text":"
This section will run locally the ibm-granite/granite-vision-3.1-2b-preview model to describe the pictures of the document.
"},{"location":"examples/pictures_description/#describe-pictures-with-smolvlm","title":"Describe pictures with SmolVLM\u00b6","text":"This section will run locally the HuggingFaceTB/SmolVLM-256M-Instruct model to describe the pictures of the document.
"},{"location":"examples/pictures_description/#use-other-vision-models","title":"Use other vision models\u00b6","text":"The examples above can also be reproduced using other vision model. The Docling options PictureDescriptionVlmOptions
allows to specify your favorite vision model from the Hugging Face Hub.
import logging\nimport os\nfrom pathlib import Path\nimport logging import os from pathlib import Path In\u00a0[\u00a0]: Copied!
import requests\nfrom docling_core.types.doc import PictureItem\nfrom dotenv import load_dotenv\nimport requests from docling_core.types.doc import PictureItem from dotenv import load_dotenv In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n PictureDescriptionApiOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, PictureDescriptionApiOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied!
def vllm_local_options(model: str):\n options = PictureDescriptionApiOptions(\n url=\"http://localhost:8000/v1/chat/completions\",\n params=dict(\n model=model,\n seed=42,\n max_completion_tokens=200,\n ),\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n timeout=90,\n )\n return options\ndef vllm_local_options(model: str): options = PictureDescriptionApiOptions( url=\"http://localhost:8000/v1/chat/completions\", params=dict( model=model, seed=42, max_completion_tokens=200, ), prompt=\"Describe the image in three sentences. Be consise and accurate.\", timeout=90, ) return options In\u00a0[\u00a0]: Copied!
def lms_local_options(model: str):\n options = PictureDescriptionApiOptions(\n url=\"http://localhost:1234/v1/chat/completions\",\n params=dict(\n model=model,\n seed=42,\n max_completion_tokens=200,\n ),\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n timeout=90,\n )\n return options\ndef lms_local_options(model: str): options = PictureDescriptionApiOptions( url=\"http://localhost:1234/v1/chat/completions\", params=dict( model=model, seed=42, max_completion_tokens=200, ), prompt=\"Describe the image in three sentences. Be consise and accurate.\", timeout=90, ) return options In\u00a0[\u00a0]: Copied!
def watsonx_vlm_options():\n load_dotenv()\n api_key = os.environ.get(\"WX_API_KEY\")\n project_id = os.environ.get(\"WX_PROJECT_ID\")\n\n def _get_iam_access_token(api_key: str) -> str:\n res = requests.post(\n url=\"https://iam.cloud.ibm.com/identity/token\",\n headers={\n \"Content-Type\": \"application/x-www-form-urlencoded\",\n },\n data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\",\n )\n res.raise_for_status()\n api_out = res.json()\n print(f\"{api_out=}\")\n return api_out[\"access_token\"]\n\n options = PictureDescriptionApiOptions(\n url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\",\n params=dict(\n model_id=\"ibm/granite-vision-3-2-2b\",\n project_id=project_id,\n parameters=dict(\n max_new_tokens=400,\n ),\n ),\n headers={\n \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key),\n },\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n timeout=60,\n )\n return options\ndef watsonx_vlm_options(): load_dotenv() api_key = os.environ.get(\"WX_API_KEY\") project_id = os.environ.get(\"WX_PROJECT_ID\") def _get_iam_access_token(api_key: str) -> str: res = requests.post( url=\"https://iam.cloud.ibm.com/identity/token\", headers={ \"Content-Type\": \"application/x-www-form-urlencoded\", }, data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\", ) res.raise_for_status() api_out = res.json() print(f\"{api_out=}\") return api_out[\"access_token\"] options = PictureDescriptionApiOptions( url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\", params=dict( model_id=\"ibm/granite-vision-3-2-2b\", project_id=project_id, parameters=dict( max_new_tokens=400, ), ), headers={ \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key), }, prompt=\"Describe the image in three sentences. Be consise and accurate.\", timeout=60, ) return options In\u00a0[\u00a0]: Copied!
def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n pipeline_options = PdfPipelineOptions(\n enable_remote_services=True # <-- this is required!\n )\n pipeline_options.do_picture_description = True\n\n # The PictureDescriptionApiOptions() allows to interface with APIs supporting\n # the multi-modal chat interface. Here follow a few example on how to configure those.\n #\n # One possibility is self-hosting model, e.g. via VLLM.\n # $ vllm serve MODEL_NAME\n # Then PictureDescriptionApiOptions can point to the localhost endpoint.\n\n # Example for the Granite Vision model:\n # (uncomment the following lines)\n # pipeline_options.picture_description_options = vllm_local_options(\n # model=\"ibm-granite/granite-vision-3.3-2b\"\n # )\n\n # Example for the SmolVLM model:\n # (uncomment the following lines)\n # pipeline_options.picture_description_options = vllm_local_options(\n # model=\"HuggingFaceTB/SmolVLM-256M-Instruct\"\n # )\n\n # For using models on LM Studio using the built-in GGUF or MLX runtimes, e.g. the SmolVLM model:\n # (uncomment the following lines)\n pipeline_options.picture_description_options = lms_local_options(\n model=\"smolvlm-256m-instruct\"\n )\n\n # Another possibility is using online services, e.g. watsonx.ai.\n # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID.\n # (uncomment the following lines)\n # pipeline_options.picture_description_options = watsonx_vlm_options()\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n result = doc_converter.convert(input_doc_path)\n\n for element, _level in result.document.iterate_items():\n if isinstance(element, PictureItem):\n print(\n f\"Picture {element.self_ref}\\n\"\n f\"Caption: {element.caption_text(doc=result.document)}\\n\"\n f\"Annotations: {element.annotations}\"\n )\ndef main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" pipeline_options = PdfPipelineOptions( enable_remote_services=True # <-- this is required! ) pipeline_options.do_picture_description = True # The PictureDescriptionApiOptions() allows to interface with APIs supporting # the multi-modal chat interface. Here follow a few example on how to configure those. # # One possibility is self-hosting model, e.g. via VLLM. # $ vllm serve MODEL_NAME # Then PictureDescriptionApiOptions can point to the localhost endpoint. # Example for the Granite Vision model: # (uncomment the following lines) # pipeline_options.picture_description_options = vllm_local_options( # model=\"ibm-granite/granite-vision-3.3-2b\" # ) # Example for the SmolVLM model: # (uncomment the following lines) # pipeline_options.picture_description_options = vllm_local_options( # model=\"HuggingFaceTB/SmolVLM-256M-Instruct\" # ) # For using models on LM Studio using the built-in GGUF or MLX runtimes, e.g. the SmolVLM model: # (uncomment the following lines) pipeline_options.picture_description_options = lms_local_options( model=\"smolvlm-256m-instruct\" ) # Another possibility is using online services, e.g. watsonx.ai. # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID. # (uncomment the following lines) # pipeline_options.picture_description_options = watsonx_vlm_options() doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) result = doc_converter.convert(input_doc_path) for element, _level in result.document.iterate_items(): if isinstance(element, PictureItem): print( f\"Picture {element.self_ref}\\n\" f\"Caption: {element.caption_text(doc=result.document)}\\n\" f\"Annotations: {element.annotations}\" ) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/pictures_description_api/#example-of-picturedescriptionapioptions-definitions","title":"Example of PictureDescriptionApiOptions definitions\u00b6","text":""},{"location":"examples/pictures_description_api/#using-vllm","title":"Using vLLM\u00b6","text":"
Models can be launched via: $ vllm serve MODEL_NAME
"},{"location":"examples/pictures_description_api/#using-lm-studio","title":"Using LM Studio\u00b6","text":""},{"location":"examples/pictures_description_api/#using-a-cloud-service-like-ibm-watsonxai","title":"Using a cloud service like IBM watsonx.ai\u00b6","text":""},{"location":"examples/pictures_description_api/#usage-and-conversion","title":"Usage and conversion\u00b6","text":""},{"location":"examples/rag_azuresearch/","title":"RAG with Azure AI Search","text":"Step Tech Execution Embedding Azure OpenAI \ud83c\udf10 Remote Vector Store Azure AI Search \ud83c\udf10 Remote Gen AI Azure OpenAI \ud83c\udf10 Remote In\u00a0[\u00a0]: Copied!# If running in a fresh environment (like Google Colab), uncomment and run this single command:\n%pip install \"docling~=2.12\" azure-search-documents==11.5.2 azure-identity openai rich torch python-dotenv\n# If running in a fresh environment (like Google Colab), uncomment and run this single command: %pip install \"docling~=2.12\" azure-search-documents==11.5.2 azure-identity openai rich torch python-dotenv In\u00a0[1]: Copied!
import os\n\nfrom dotenv import load_dotenv\n\nload_dotenv()\n\n\ndef _get_env(key, default=None):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key, default)\n\n\nAZURE_SEARCH_ENDPOINT = _get_env(\"AZURE_SEARCH_ENDPOINT\")\nAZURE_SEARCH_KEY = _get_env(\"AZURE_SEARCH_KEY\") # Ensure this is your Admin Key\nAZURE_SEARCH_INDEX_NAME = _get_env(\"AZURE_SEARCH_INDEX_NAME\", \"docling-rag-sample\")\nAZURE_OPENAI_ENDPOINT = _get_env(\"AZURE_OPENAI_ENDPOINT\")\nAZURE_OPENAI_API_KEY = _get_env(\"AZURE_OPENAI_API_KEY\")\nAZURE_OPENAI_API_VERSION = _get_env(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\")\nAZURE_OPENAI_CHAT_MODEL = _get_env(\n \"AZURE_OPENAI_CHAT_MODEL\"\n) # Using a deployed model named \"gpt-4o\"\nAZURE_OPENAI_EMBEDDINGS = _get_env(\n \"AZURE_OPENAI_EMBEDDINGS\", \"text-embedding-3-small\"\n) # Using a deployed model named \"text-embeddings-3-small\"\nimport os from dotenv import load_dotenv load_dotenv() def _get_env(key, default=None): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key, default) AZURE_SEARCH_ENDPOINT = _get_env(\"AZURE_SEARCH_ENDPOINT\") AZURE_SEARCH_KEY = _get_env(\"AZURE_SEARCH_KEY\") # Ensure this is your Admin Key AZURE_SEARCH_INDEX_NAME = _get_env(\"AZURE_SEARCH_INDEX_NAME\", \"docling-rag-sample\") AZURE_OPENAI_ENDPOINT = _get_env(\"AZURE_OPENAI_ENDPOINT\") AZURE_OPENAI_API_KEY = _get_env(\"AZURE_OPENAI_API_KEY\") AZURE_OPENAI_API_VERSION = _get_env(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\") AZURE_OPENAI_CHAT_MODEL = _get_env( \"AZURE_OPENAI_CHAT_MODEL\" ) # Using a deployed model named \"gpt-4o\" AZURE_OPENAI_EMBEDDINGS = _get_env( \"AZURE_OPENAI_EMBEDDINGS\", \"text-embedding-3-small\" ) # Using a deployed model named \"text-embeddings-3-small\" In\u00a0[11]: Copied!
from rich.console import Console\nfrom rich.panel import Panel\n\nfrom docling.document_converter import DocumentConverter\n\nconsole = Console()\n\n# This URL points to the Microsoft GraphRAG Research Paper (arXiv: 2404.16130), ~15 pages\nsource_url = \"https://arxiv.org/pdf/2404.16130\"\n\nconsole.print(\n \"[bold yellow]Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...[/bold yellow]\"\n)\nconverter = DocumentConverter()\nresult = converter.convert(source_url)\n\n# Optional: preview the parsed Markdown\nmd_preview = result.document.export_to_markdown()\nconsole.print(Panel(md_preview[:500] + \"...\", title=\"Docling Markdown Preview\"))\nfrom rich.console import Console from rich.panel import Panel from docling.document_converter import DocumentConverter console = Console() # This URL points to the Microsoft GraphRAG Research Paper (arXiv: 2404.16130), ~15 pages source_url = \"https://arxiv.org/pdf/2404.16130\" console.print( \"[bold yellow]Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...[/bold yellow]\" ) converter = DocumentConverter() result = converter.convert(source_url) # Optional: preview the parsed Markdown md_preview = result.document.export_to_markdown() console.print(Panel(md_preview[:500] + \"...\", title=\"Docling Markdown Preview\"))
Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Docling Markdown Preview \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 ## From Local to Global: A Graph RAG Approach to Query-Focused Summarization \u2502\n\u2502 \u2502\n\u2502 Darren Edge 1\u2020 \u2502\n\u2502 \u2502\n\u2502 Ha Trinh 1\u2020 \u2502\n\u2502 \u2502\n\u2502 Newman Cheng 2 \u2502\n\u2502 \u2502\n\u2502 Joshua Bradley 2 \u2502\n\u2502 \u2502\n\u2502 Alex Chao 3 \u2502\n\u2502 \u2502\n\u2502 Apurva Mody 3 \u2502\n\u2502 \u2502\n\u2502 Steven Truitt 2 \u2502\n\u2502 \u2502\n\u2502 ## Jonathan Larson 1 \u2502\n\u2502 \u2502\n\u2502 1 Microsoft Research 2 Microsoft Strategic Missions and Technologies 3 Microsoft Office of the CTO \u2502\n\u2502 \u2502\n\u2502 { daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso } @microsoft.com \u2502\n\u2502 \u2502\n\u2502 \u2020 These authors contributed equally to this work \u2502\n\u2502 \u2502\n\u2502 ## Abstract \u2502\n\u2502 \u2502\n\u2502 The use of retrieval-augmented gen... \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\nIn\u00a0[22]: Copied!
from docling.chunking import HierarchicalChunker\n\nchunker = HierarchicalChunker()\ndoc_chunks = list(chunker.chunk(result.document))\n\nall_chunks = []\nfor idx, c in enumerate(doc_chunks):\n chunk_text = c.text\n all_chunks.append((f\"chunk_{idx}\", chunk_text))\n\nconsole.print(f\"Total chunks from PDF: {len(all_chunks)}\")\nfrom docling.chunking import HierarchicalChunker chunker = HierarchicalChunker() doc_chunks = list(chunker.chunk(result.document)) all_chunks = [] for idx, c in enumerate(doc_chunks): chunk_text = c.text all_chunks.append((f\"chunk_{idx}\", chunk_text)) console.print(f\"Total chunks from PDF: {len(all_chunks)}\")
Total chunks from PDF: 106\nIn\u00a0[\u00a0]: Copied!
from azure.core.credentials import AzureKeyCredential\nfrom azure.search.documents.indexes import SearchIndexClient\nfrom azure.search.documents.indexes.models import (\n AzureOpenAIVectorizer,\n AzureOpenAIVectorizerParameters,\n HnswAlgorithmConfiguration,\n SearchableField,\n SearchField,\n SearchFieldDataType,\n SearchIndex,\n SimpleField,\n VectorSearch,\n VectorSearchProfile,\n)\nfrom rich.console import Console\n\nconsole = Console()\n\nVECTOR_DIM = 1536 # Adjust based on your chosen embeddings model\n\nindex_client = SearchIndexClient(\n AZURE_SEARCH_ENDPOINT, AzureKeyCredential(AZURE_SEARCH_KEY)\n)\n\n\ndef create_search_index(index_name: str):\n # Define fields\n fields = [\n SimpleField(name=\"chunk_id\", type=SearchFieldDataType.String, key=True),\n SearchableField(name=\"content\", type=SearchFieldDataType.String),\n SearchField(\n name=\"content_vector\",\n type=SearchFieldDataType.Collection(SearchFieldDataType.Single),\n searchable=True,\n filterable=False,\n sortable=False,\n facetable=False,\n vector_search_dimensions=VECTOR_DIM,\n vector_search_profile_name=\"default\",\n ),\n ]\n # Vector search config with an AzureOpenAIVectorizer\n vector_search = VectorSearch(\n algorithms=[HnswAlgorithmConfiguration(name=\"default\")],\n profiles=[\n VectorSearchProfile(\n name=\"default\",\n algorithm_configuration_name=\"default\",\n vectorizer_name=\"default\",\n )\n ],\n vectorizers=[\n AzureOpenAIVectorizer(\n vectorizer_name=\"default\",\n parameters=AzureOpenAIVectorizerParameters(\n resource_url=AZURE_OPENAI_ENDPOINT,\n deployment_name=AZURE_OPENAI_EMBEDDINGS,\n model_name=\"text-embedding-3-small\",\n api_key=AZURE_OPENAI_API_KEY,\n ),\n )\n ],\n )\n\n # Create or update the index\n new_index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)\n try:\n index_client.delete_index(index_name)\n except Exception:\n pass\n\n index_client.create_or_update_index(new_index)\n console.print(f\"Index '{index_name}' created.\")\n\n\ncreate_search_index(AZURE_SEARCH_INDEX_NAME)\nfrom azure.core.credentials import AzureKeyCredential from azure.search.documents.indexes import SearchIndexClient from azure.search.documents.indexes.models import ( AzureOpenAIVectorizer, AzureOpenAIVectorizerParameters, HnswAlgorithmConfiguration, SearchableField, SearchField, SearchFieldDataType, SearchIndex, SimpleField, VectorSearch, VectorSearchProfile, ) from rich.console import Console console = Console() VECTOR_DIM = 1536 # Adjust based on your chosen embeddings model index_client = SearchIndexClient( AZURE_SEARCH_ENDPOINT, AzureKeyCredential(AZURE_SEARCH_KEY) ) def create_search_index(index_name: str): # Define fields fields = [ SimpleField(name=\"chunk_id\", type=SearchFieldDataType.String, key=True), SearchableField(name=\"content\", type=SearchFieldDataType.String), SearchField( name=\"content_vector\", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), searchable=True, filterable=False, sortable=False, facetable=False, vector_search_dimensions=VECTOR_DIM, vector_search_profile_name=\"default\", ), ] # Vector search config with an AzureOpenAIVectorizer vector_search = VectorSearch( algorithms=[HnswAlgorithmConfiguration(name=\"default\")], profiles=[ VectorSearchProfile( name=\"default\", algorithm_configuration_name=\"default\", vectorizer_name=\"default\", ) ], vectorizers=[ AzureOpenAIVectorizer( vectorizer_name=\"default\", parameters=AzureOpenAIVectorizerParameters( resource_url=AZURE_OPENAI_ENDPOINT, deployment_name=AZURE_OPENAI_EMBEDDINGS, model_name=\"text-embedding-3-small\", api_key=AZURE_OPENAI_API_KEY, ), ) ], ) # Create or update the index new_index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search) try: index_client.delete_index(index_name) except Exception: pass index_client.create_or_update_index(new_index) console.print(f\"Index '{index_name}' created.\") create_search_index(AZURE_SEARCH_INDEX_NAME)
Index 'docling-rag-sample-2' created.\nIn\u00a0[28]: Copied!
from azure.search.documents import SearchClient\nfrom openai import AzureOpenAI\n\nsearch_client = SearchClient(\n AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_INDEX_NAME, AzureKeyCredential(AZURE_SEARCH_KEY)\n)\nopenai_client = AzureOpenAI(\n api_key=AZURE_OPENAI_API_KEY,\n api_version=AZURE_OPENAI_API_VERSION,\n azure_endpoint=AZURE_OPENAI_ENDPOINT,\n)\n\n\ndef embed_text(text: str):\n \"\"\"\n Helper to generate embeddings with Azure OpenAI.\n \"\"\"\n response = openai_client.embeddings.create(\n input=text, model=AZURE_OPENAI_EMBEDDINGS\n )\n return response.data[0].embedding\n\n\nupload_docs = []\nfor chunk_id, chunk_text in all_chunks:\n embedding_vector = embed_text(chunk_text)\n upload_docs.append(\n {\n \"chunk_id\": chunk_id,\n \"content\": chunk_text,\n \"content_vector\": embedding_vector,\n }\n )\n\n\nBATCH_SIZE = 50\nfor i in range(0, len(upload_docs), BATCH_SIZE):\n subset = upload_docs[i : i + BATCH_SIZE]\n resp = search_client.upload_documents(documents=subset)\n\n all_succeeded = all(r.succeeded for r in resp)\n console.print(\n f\"Uploaded batch {i} -> {i + len(subset)}; all_succeeded: {all_succeeded}, \"\n f\"first_doc_status_code: {resp[0].status_code}\"\n )\n\nconsole.print(\"All chunks uploaded to Azure Search.\")\nfrom azure.search.documents import SearchClient from openai import AzureOpenAI search_client = SearchClient( AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_INDEX_NAME, AzureKeyCredential(AZURE_SEARCH_KEY) ) openai_client = AzureOpenAI( api_key=AZURE_OPENAI_API_KEY, api_version=AZURE_OPENAI_API_VERSION, azure_endpoint=AZURE_OPENAI_ENDPOINT, ) def embed_text(text: str): \"\"\" Helper to generate embeddings with Azure OpenAI. \"\"\" response = openai_client.embeddings.create( input=text, model=AZURE_OPENAI_EMBEDDINGS ) return response.data[0].embedding upload_docs = [] for chunk_id, chunk_text in all_chunks: embedding_vector = embed_text(chunk_text) upload_docs.append( { \"chunk_id\": chunk_id, \"content\": chunk_text, \"content_vector\": embedding_vector, } ) BATCH_SIZE = 50 for i in range(0, len(upload_docs), BATCH_SIZE): subset = upload_docs[i : i + BATCH_SIZE] resp = search_client.upload_documents(documents=subset) all_succeeded = all(r.succeeded for r in resp) console.print( f\"Uploaded batch {i} -> {i + len(subset)}; all_succeeded: {all_succeeded}, \" f\"first_doc_status_code: {resp[0].status_code}\" ) console.print(\"All chunks uploaded to Azure Search.\")
Uploaded batch 0 -> 50; all_succeeded: True, first_doc_status_code: 201\n
Uploaded batch 50 -> 100; all_succeeded: True, first_doc_status_code: 201\n
Uploaded batch 100 -> 106; all_succeeded: True, first_doc_status_code: 201\n
All chunks uploaded to Azure Search.\nIn\u00a0[29]: Copied!
from typing import Optional\n\nfrom azure.search.documents.models import VectorizableTextQuery\n\n\ndef generate_chat_response(prompt: str, system_message: Optional[str] = None):\n \"\"\"\n Generates a single-turn chat response using Azure OpenAI Chat.\n If you need multi-turn conversation or follow-up queries, you'll have to\n maintain the messages list externally.\n \"\"\"\n messages = []\n if system_message:\n messages.append({\"role\": \"system\", \"content\": system_message})\n messages.append({\"role\": \"user\", \"content\": prompt})\n\n completion = openai_client.chat.completions.create(\n model=AZURE_OPENAI_CHAT_MODEL, messages=messages, temperature=0.7\n )\n return completion.choices[0].message.content\n\n\nuser_query = \"What are the main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG methods?\"\nuser_embed = embed_text(user_query)\n\nvector_query = VectorizableTextQuery(\n text=user_query, # passing in text for a hybrid search\n k_nearest_neighbors=5,\n fields=\"content_vector\",\n)\n\nsearch_results = search_client.search(\n search_text=user_query, vector_queries=[vector_query], select=[\"content\"], top=10\n)\n\nretrieved_chunks = []\nfor result in search_results:\n snippet = result[\"content\"]\n retrieved_chunks.append(snippet)\n\ncontext_str = \"\\n---\\n\".join(retrieved_chunks)\nrag_prompt = f\"\"\"\nYou are an AI assistant helping answering questions about Microsoft GraphRAG.\nUse ONLY the text below to answer the user's question.\nIf the answer isn't in the text, say you don't know.\n\nContext:\n{context_str}\n\nQuestion: {user_query}\nAnswer:\n\"\"\"\n\nfinal_answer = generate_chat_response(rag_prompt)\n\nconsole.print(Panel(rag_prompt, title=\"RAG Prompt\", style=\"bold red\"))\nconsole.print(Panel(final_answer, title=\"RAG Response\", style=\"bold green\"))\nfrom typing import Optional from azure.search.documents.models import VectorizableTextQuery def generate_chat_response(prompt: str, system_message: Optional[str] = None): \"\"\" Generates a single-turn chat response using Azure OpenAI Chat. If you need multi-turn conversation or follow-up queries, you'll have to maintain the messages list externally. \"\"\" messages = [] if system_message: messages.append({\"role\": \"system\", \"content\": system_message}) messages.append({\"role\": \"user\", \"content\": prompt}) completion = openai_client.chat.completions.create( model=AZURE_OPENAI_CHAT_MODEL, messages=messages, temperature=0.7 ) return completion.choices[0].message.content user_query = \"What are the main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG methods?\" user_embed = embed_text(user_query) vector_query = VectorizableTextQuery( text=user_query, # passing in text for a hybrid search k_nearest_neighbors=5, fields=\"content_vector\", ) search_results = search_client.search( search_text=user_query, vector_queries=[vector_query], select=[\"content\"], top=10 ) retrieved_chunks = [] for result in search_results: snippet = result[\"content\"] retrieved_chunks.append(snippet) context_str = \"\\n---\\n\".join(retrieved_chunks) rag_prompt = f\"\"\" You are an AI assistant helping answering questions about Microsoft GraphRAG. Use ONLY the text below to answer the user's question. If the answer isn't in the text, say you don't know. Context: {context_str} Question: {user_query} Answer: \"\"\" final_answer = generate_chat_response(rag_prompt) console.print(Panel(rag_prompt, title=\"RAG Prompt\", style=\"bold red\")) console.print(Panel(final_answer, title=\"RAG Response\", style=\"bold green\"))
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 RAG Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 \u2502\n\u2502 You are an AI assistant helping answering questions about Microsoft GraphRAG. \u2502\n\u2502 Use ONLY the text below to answer the user's question. \u2502\n\u2502 If the answer isn't in the text, say you don't know. \u2502\n\u2502 \u2502\n\u2502 Context: \u2502\n\u2502 Community summaries vs. source texts. When comparing community summaries to source texts using Graph RAG, \u2502\n\u2502 community summaries generally provided a small but consistent improvement in answer comprehensiveness and \u2502\n\u2502 diversity, except for root-level summaries. Intermediate-level summaries in the Podcast dataset and low-level \u2502\n\u2502 community summaries in the News dataset achieved comprehensiveness win rates of 57% and 64%, respectively. \u2502\n\u2502 Diversity win rates were 57% for Podcast intermediate-level summaries and 60% for News low-level community \u2502\n\u2502 summaries. Table 3 also illustrates the scalability advantages of Graph RAG compared to source text \u2502\n\u2502 summarization: for low-level community summaries ( C3 ), Graph RAG required 26-33% fewer context tokens, while \u2502\n\u2502 for root-level community summaries ( C0 ), it required over 97% fewer tokens. For a modest drop in performance \u2502\n\u2502 compared with other global methods, root-level Graph RAG offers a highly efficient method for the iterative \u2502\n\u2502 question answering that characterizes sensemaking activity, while retaining advantages in comprehensiveness \u2502\n\u2502 (72% win rate) and diversity (62% win rate) over na\u00a8\u0131ve RAG. \u2502\n\u2502 --- \u2502\n\u2502 We have presented a global approach to Graph RAG, combining knowledge graph generation, retrieval-augmented \u2502\n\u2502 generation (RAG), and query-focused summarization (QFS) to support human sensemaking over entire text corpora. \u2502\n\u2502 Initial evaluations show substantial improvements over a na\u00a8\u0131ve RAG baseline for both the comprehensiveness and \u2502\n\u2502 diversity of answers, as well as favorable comparisons to a global but graph-free approach using map-reduce \u2502\n\u2502 source text summarization. For situations requiring many global queries over the same dataset, summaries of \u2502\n\u2502 root-level communities in the entity-based graph index provide a data index that is both superior to na\u00a8\u0131ve RAG \u2502\n\u2502 and achieves competitive performance to other global methods at a fraction of the token cost. \u2502\n\u2502 --- \u2502\n\u2502 Trade-offs of building a graph index . We consistently observed Graph RAG achieve the best headto-head results \u2502\n\u2502 against other methods, but in many cases the graph-free approach to global summarization of source texts \u2502\n\u2502 performed competitively. The real-world decision about whether to invest in building a graph index depends on \u2502\n\u2502 multiple factors, including the compute budget, expected number of lifetime queries per dataset, and value \u2502\n\u2502 obtained from other aspects of the graph index (including the generic community summaries and the use of other \u2502\n\u2502 graph-related RAG approaches). \u2502\n\u2502 --- \u2502\n\u2502 Future work . The graph index, rich text annotations, and hierarchical community structure supporting the \u2502\n\u2502 current Graph RAG approach offer many possibilities for refinement and adaptation. This includes RAG approaches \u2502\n\u2502 that operate in a more local manner, via embedding-based matching of user queries and graph annotations, as \u2502\n\u2502 well as the possibility of hybrid RAG schemes that combine embedding-based matching against community reports \u2502\n\u2502 before employing our map-reduce summarization mechanisms. This 'roll-up' operation could also be extended \u2502\n\u2502 across more levels of the community hierarchy, as well as implemented as a more exploratory 'drill down' \u2502\n\u2502 mechanism that follows the information scent contained in higher-level community summaries. \u2502\n\u2502 --- \u2502\n\u2502 Advanced RAG systems include pre-retrieval, retrieval, post-retrieval strategies designed to overcome the \u2502\n\u2502 drawbacks of Na\u00a8\u0131ve RAG, while Modular RAG systems include patterns for iterative and dynamic cycles of \u2502\n\u2502 interleaved retrieval and generation (Gao et al., 2023). Our implementation of Graph RAG incorporates multiple \u2502\n\u2502 concepts related to other systems. For example, our community summaries are a kind of self-memory (Selfmem, \u2502\n\u2502 Cheng et al., 2024) for generation-augmented retrieval (GAR, Mao et al., 2020) that facilitates future \u2502\n\u2502 generation cycles, while our parallel generation of community answers from these summaries is a kind of \u2502\n\u2502 iterative (Iter-RetGen, Shao et al., 2023) or federated (FeB4RAG, Wang et al., 2024) retrieval-generation \u2502\n\u2502 strategy. Other systems have also combined these concepts for multi-document summarization (CAiRE-COVID, Su et \u2502\n\u2502 al., 2020) and multi-hop question answering (ITRG, Feng et al., 2023; IR-CoT, Trivedi et al., 2022; DSP, \u2502\n\u2502 Khattab et al., 2022). Our use of a hierarchical index and summarization also bears resemblance to further \u2502\n\u2502 approaches, such as generating a hierarchical index of text chunks by clustering the vectors of text embeddings \u2502\n\u2502 (RAPTOR, Sarthi et al., 2024) or generating a 'tree of clarifications' to answer multiple interpretations of \u2502\n\u2502 ambiguous questions (Kim et al., 2023). However, none of these iterative or hierarchical approaches use the \u2502\n\u2502 kind of self-generated graph index that enables Graph RAG. \u2502\n\u2502 --- \u2502\n\u2502 The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge \u2502\n\u2502 source enables large language models (LLMs) to answer questions over private and/or previously unseen document \u2502\n\u2502 collections. However, RAG fails on global questions directed at an entire text corpus, such as 'What are the \u2502\n\u2502 main themes in the dataset?', since this is inherently a queryfocused summarization (QFS) task, rather than an \u2502\n\u2502 explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by \u2502\n\u2502 typical RAGsystems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to \u2502\n\u2502 question answering over private text corpora that scales with both the generality of user questions and the \u2502\n\u2502 quantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two \u2502\n\u2502 stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community \u2502\n\u2502 summaries for all groups of closely-related entities. Given a question, each community summary is used to \u2502\n\u2502 generate a partial response, before all partial responses are again summarized in a final response to the user. \u2502\n\u2502 For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG \u2502\n\u2502 leads to substantial improvements over a na\u00a8\u0131ve RAG baseline for both the comprehensiveness and diversity of \u2502\n\u2502 generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is \u2502\n\u2502 forthcoming at https://aka . ms/graphrag . \u2502\n\u2502 --- \u2502\n\u2502 Given the multi-stage nature of our Graph RAG mechanism, the multiple conditions we wanted to compare, and the \u2502\n\u2502 lack of gold standard answers to our activity-based sensemaking questions, we decided to adopt a head-to-head \u2502\n\u2502 comparison approach using an LLM evaluator. We selected three target metrics capturing qualities that are \u2502\n\u2502 desirable for sensemaking activities, as well as a control metric (directness) used as a indicator of validity. \u2502\n\u2502 Since directness is effectively in opposition to comprehensiveness and diversity, we would not expect any \u2502\n\u2502 method to win across all four metrics. \u2502\n\u2502 --- \u2502\n\u2502 Figure 1: Graph RAG pipeline using an LLM-derived graph index of source document text. This index spans nodes \u2502\n\u2502 (e.g., entities), edges (e.g., relationships), and covariates (e.g., claims) that have been detected, \u2502\n\u2502 extracted, and summarized by LLM prompts tailored to the domain of the dataset. Community detection (e.g., \u2502\n\u2502 Leiden, Traag et al., 2019) is used to partition the graph index into groups of elements (nodes, edges, \u2502\n\u2502 covariates) that the LLM can summarize in parallel at both indexing time and query time. The 'global answer' to \u2502\n\u2502 a given query is produced using a final round of query-focused summarization over all community summaries \u2502\n\u2502 reporting relevance to that query. \u2502\n\u2502 --- \u2502\n\u2502 Retrieval-augmented generation (RAG, Lewis et al., 2020) is an established approach to answering user questions \u2502\n\u2502 over entire datasets, but it is designed for situations where these answers are contained locally within \u2502\n\u2502 regions of text whose retrieval provides sufficient grounding for the generation task. Instead, a more \u2502\n\u2502 appropriate task framing is query-focused summarization (QFS, Dang, 2006), and in particular, query-focused \u2502\n\u2502 abstractive summarization that generates natural language summaries and not just concatenated excerpts (Baumel \u2502\n\u2502 et al., 2018; Laskar et al., 2020; Yao et al., 2017) . In recent years, however, such distinctions between \u2502\n\u2502 summarization tasks that are abstractive versus extractive, generic versus query-focused, and single-document \u2502\n\u2502 versus multi-document, have become less relevant. While early applications of the transformer architecture \u2502\n\u2502 showed substantial improvements on the state-of-the-art for all such summarization tasks (Goodwin et al., 2020; \u2502\n\u2502 Laskar et al., 2022; Liu and Lapata, 2019), these tasks are now trivialized by modern LLMs, including the GPT \u2502\n\u2502 (Achiam et al., 2023; Brown et al., 2020), Llama (Touvron et al., 2023), and Gemini (Anil et al., 2023) series, \u2502\n\u2502 all of which can use in-context learning to summarize any content provided in their context window. \u2502\n\u2502 --- \u2502\n\u2502 community descriptions provide complete coverage of the underlying graph index and the input documents it \u2502\n\u2502 represents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach: \u2502\n\u2502 first using each community summary to answer the query independently and in parallel, then summarizing all \u2502\n\u2502 relevant partial answers into a final global answer. \u2502\n\u2502 \u2502\n\u2502 Question: What are the main advantages of using the Graph RAG approach for query-focused summarization compared \u2502\n\u2502 to traditional RAG methods? \u2502\n\u2502 Answer: \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 RAG Response \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 The main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG \u2502\n\u2502 methods include: \u2502\n\u2502 \u2502\n\u2502 1. **Improved Comprehensiveness and Diversity**: Graph RAG shows substantial improvements over a na\u00efve RAG \u2502\n\u2502 baseline in terms of the comprehensiveness and diversity of answers. This is particularly beneficial for global \u2502\n\u2502 sensemaking questions over large datasets. \u2502\n\u2502 \u2502\n\u2502 2. **Scalability**: Graph RAG provides scalability advantages, achieving efficient summarization with \u2502\n\u2502 significantly fewer context tokens required. For instance, it requires 26-33% fewer tokens for low-level \u2502\n\u2502 community summaries and over 97% fewer tokens for root-level summaries compared to source text summarization. \u2502\n\u2502 \u2502\n\u2502 3. **Efficiency in Iterative Question Answering**: Root-level Graph RAG offers a highly efficient method for \u2502\n\u2502 iterative question answering, which is crucial for sensemaking activities, with only a modest drop in \u2502\n\u2502 performance compared to other global methods. \u2502\n\u2502 \u2502\n\u2502 4. **Global Query Handling**: It supports handling global queries effectively, as it combines knowledge graph \u2502\n\u2502 generation, retrieval-augmented generation, and query-focused summarization, making it suitable for sensemaking \u2502\n\u2502 over entire text corpora. \u2502\n\u2502 \u2502\n\u2502 5. **Hierarchical Indexing and Summarization**: The use of a hierarchical index and summarization allows for \u2502\n\u2502 efficient processing and summarizing of community summaries into a final global answer, facilitating a \u2502\n\u2502 comprehensive coverage of the underlying graph index and input documents. \u2502\n\u2502 \u2502\n\u2502 6. **Reduced Token Cost**: For situations requiring many global queries over the same dataset, Graph RAG \u2502\n\u2502 achieves competitive performance to other global methods at a fraction of the token cost. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n"},{"location":"examples/rag_azuresearch/#rag-with-azure-ai-search","title":"RAG with Azure AI Search\u00b6","text":""},{"location":"examples/rag_azuresearch/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"
This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:
This sample demonstrates how to:
Azure AI Search resource
Azure OpenAI resource with a deployed embedding and chat completion model (e.g. text-embedding-3-small
and gpt-4o
)
Docling 2.12+ (installs docling_core
automatically) Docling installed (Python 3.8+ environment)
A GPU-enabled environment is preferred for faster parsing. Docling 2.12 automatically detects GPU if present.
We\u2019ll parse the Microsoft GraphRAG Research Paper (~15 pages). Parsing should be relatively quick, even on CPU, but it will be faster on a GPU or MPS device if available.
(If you prefer a different document, simply provide a different URL or local file path.)
"},{"location":"examples/rag_azuresearch/#part-2-hierarchical-chunking","title":"Part 2: Hierarchical Chunking\u00b6","text":"We convert the Document
into smaller chunks for embedding and indexing. The built-in HierarchicalChunker
preserves structure.
We\u2019ll define a vector index in Azure AI Search, then embed each chunk using Azure OpenAI and upload in batches.
"},{"location":"examples/rag_azuresearch/#generate-embeddings-and-upload-to-azure-ai-search","title":"Generate Embeddings and Upload to Azure AI Search\u00b6","text":""},{"location":"examples/rag_azuresearch/#part-4-perform-rag-over-pdf","title":"Part 4: Perform RAG over PDF\u00b6","text":"Combine retrieval from Azure AI Search with Azure OpenAI Chat Completions (aka. grounding your LLM)
"},{"location":"examples/rag_haystack/","title":"RAG with Haystack","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 RemoteThis example leverages the Haystack Docling extension, along with Milvus-based document store and retriever instances, as well as sentence-transformers embeddings.
The presented DoclingConverter
component enables you to:
DoclingConverter
supports two different export modes:
ExportType.MARKDOWN
: if you want to capture each input document as a separate Haystack document, orExportType.DOC_CHUNKS
(default): if you want to have each input document chunked and to then capture each individual chunk as a separate Haystack document downstream.The example allows to explore both modes via parameter EXPORT_TYPE
; depending on the value set, the ingestion and RAG pipelines are then set up accordingly.
HF_TOKEN
.--no-warn-conflicts
meant for Colab's pre-populated Python env; feel free to remove for stricter usage):%pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling pymilvus milvus-haystack sentence-transformers python-dotenv\n%pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling pymilvus milvus-haystack sentence-transformers python-dotenv
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom docling_haystack.converter import ExportType\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nPATHS = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGENERATION_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nEXPORT_TYPE = ExportType.DOC_CHUNKS\nQUESTION = \"Which are the main AI models in Docling?\"\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\nimport os from pathlib import Path from tempfile import mkdtemp from docling_haystack.converter import ExportType from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") PATHS = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GENERATION_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" EXPORT_TYPE = ExportType.DOC_CHUNKS QUESTION = \"Which are the main AI models in Docling?\" TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied!
from docling_haystack.converter import DoclingConverter\nfrom haystack import Pipeline\nfrom haystack.components.embedders import (\n SentenceTransformersDocumentEmbedder,\n SentenceTransformersTextEmbedder,\n)\nfrom haystack.components.preprocessors import DocumentSplitter\nfrom haystack.components.writers import DocumentWriter\nfrom milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever\n\nfrom docling.chunking import HybridChunker\n\ndocument_store = MilvusDocumentStore(\n connection_args={\"uri\": MILVUS_URI},\n drop_old=True,\n text_field=\"txt\", # set for preventing conflict with same-name metadata field\n)\n\nidx_pipe = Pipeline()\nidx_pipe.add_component(\n \"converter\",\n DoclingConverter(\n export_type=EXPORT_TYPE,\n chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n ),\n)\nidx_pipe.add_component(\n \"embedder\",\n SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID),\n)\nidx_pipe.add_component(\"writer\", DocumentWriter(document_store=document_store))\nif EXPORT_TYPE == ExportType.DOC_CHUNKS:\n idx_pipe.connect(\"converter\", \"embedder\")\nelif EXPORT_TYPE == ExportType.MARKDOWN:\n idx_pipe.add_component(\n \"splitter\",\n DocumentSplitter(split_by=\"sentence\", split_length=1),\n )\n idx_pipe.connect(\"converter.documents\", \"splitter.documents\")\n idx_pipe.connect(\"splitter.documents\", \"embedder.documents\")\nelse:\n raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\nidx_pipe.connect(\"embedder\", \"writer\")\nidx_pipe.run({\"converter\": {\"paths\": PATHS}})\nfrom docling_haystack.converter import DoclingConverter from haystack import Pipeline from haystack.components.embedders import ( SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder, ) from haystack.components.preprocessors import DocumentSplitter from haystack.components.writers import DocumentWriter from milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever from docling.chunking import HybridChunker document_store = MilvusDocumentStore( connection_args={\"uri\": MILVUS_URI}, drop_old=True, text_field=\"txt\", # set for preventing conflict with same-name metadata field ) idx_pipe = Pipeline() idx_pipe.add_component( \"converter\", DoclingConverter( export_type=EXPORT_TYPE, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ), ) idx_pipe.add_component( \"embedder\", SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID), ) idx_pipe.add_component(\"writer\", DocumentWriter(document_store=document_store)) if EXPORT_TYPE == ExportType.DOC_CHUNKS: idx_pipe.connect(\"converter\", \"embedder\") elif EXPORT_TYPE == ExportType.MARKDOWN: idx_pipe.add_component( \"splitter\", DocumentSplitter(split_by=\"sentence\", split_length=1), ) idx_pipe.connect(\"converter.documents\", \"splitter.documents\") idx_pipe.connect(\"splitter.documents\", \"embedder.documents\") else: raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\") idx_pipe.connect(\"embedder\", \"writer\") idx_pipe.run({\"converter\": {\"paths\": PATHS}})
Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors\n
Batches: 0%| | 0/2 [00:00<?, ?it/s]Out[3]:
{'writer': {'documents_written': 54}}In\u00a0[4]: Copied!
from haystack.components.builders import AnswerBuilder\nfrom haystack.components.builders.prompt_builder import PromptBuilder\nfrom haystack.components.generators import HuggingFaceAPIGenerator\nfrom haystack.utils import Secret\n\nprompt_template = \"\"\"\n Given these documents, answer the question.\n Documents:\n {% for doc in documents %}\n {{ doc.content }}\n {% endfor %}\n Question: {{query}}\n Answer:\n \"\"\"\n\nrag_pipe = Pipeline()\nrag_pipe.add_component(\n \"embedder\",\n SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID),\n)\nrag_pipe.add_component(\n \"retriever\",\n MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K),\n)\nrag_pipe.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template))\nrag_pipe.add_component(\n \"llm\",\n HuggingFaceAPIGenerator(\n api_type=\"serverless_inference_api\",\n api_params={\"model\": GENERATION_MODEL_ID},\n token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None,\n ),\n)\nrag_pipe.add_component(\"answer_builder\", AnswerBuilder())\nrag_pipe.connect(\"embedder.embedding\", \"retriever\")\nrag_pipe.connect(\"retriever\", \"prompt_builder.documents\")\nrag_pipe.connect(\"prompt_builder\", \"llm\")\nrag_pipe.connect(\"llm.replies\", \"answer_builder.replies\")\nrag_pipe.connect(\"llm.meta\", \"answer_builder.meta\")\nrag_pipe.connect(\"retriever\", \"answer_builder.documents\")\nrag_res = rag_pipe.run(\n {\n \"embedder\": {\"text\": QUESTION},\n \"prompt_builder\": {\"query\": QUESTION},\n \"answer_builder\": {\"query\": QUESTION},\n }\n)\nfrom haystack.components.builders import AnswerBuilder from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.generators import HuggingFaceAPIGenerator from haystack.utils import Secret prompt_template = \"\"\" Given these documents, answer the question. Documents: {% for doc in documents %} {{ doc.content }} {% endfor %} Question: {{query}} Answer: \"\"\" rag_pipe = Pipeline() rag_pipe.add_component( \"embedder\", SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID), ) rag_pipe.add_component( \"retriever\", MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K), ) rag_pipe.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template)) rag_pipe.add_component( \"llm\", HuggingFaceAPIGenerator( api_type=\"serverless_inference_api\", api_params={\"model\": GENERATION_MODEL_ID}, token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None, ), ) rag_pipe.add_component(\"answer_builder\", AnswerBuilder()) rag_pipe.connect(\"embedder.embedding\", \"retriever\") rag_pipe.connect(\"retriever\", \"prompt_builder.documents\") rag_pipe.connect(\"prompt_builder\", \"llm\") rag_pipe.connect(\"llm.replies\", \"answer_builder.replies\") rag_pipe.connect(\"llm.meta\", \"answer_builder.meta\") rag_pipe.connect(\"retriever\", \"answer_builder.documents\") rag_res = rag_pipe.run( { \"embedder\": {\"text\": QUESTION}, \"prompt_builder\": {\"query\": QUESTION}, \"answer_builder\": {\"query\": QUESTION}, } )
Batches: 0%| | 0/1 [00:00<?, ?it/s]
/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n warnings.warn(\n
Below we print out the RAG results. If you have used ExportType.DOC_CHUNKS
, notice how the sources contain document-level grounding (e.g. page number or bounding box information):
from docling.chunking import DocChunk\n\nprint(f\"Question:\\n{QUESTION}\\n\")\nprint(f\"Answer:\\n{rag_res['answer_builder']['answers'][0].data.strip()}\\n\")\nprint(\"Sources:\")\nsources = rag_res[\"answer_builder\"][\"answers\"][0].documents\nfor source in sources:\n if EXPORT_TYPE == ExportType.DOC_CHUNKS:\n doc_chunk = DocChunk.model_validate(source.meta[\"dl_meta\"])\n print(f\"- text: {doc_chunk.text!r}\")\n if doc_chunk.meta.origin:\n print(f\" file: {doc_chunk.meta.origin.filename}\")\n if doc_chunk.meta.headings:\n print(f\" section: {' / '.join(doc_chunk.meta.headings)}\")\n bbox = doc_chunk.meta.doc_items[0].prov[0].bbox\n print(\n f\" page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, \"\n f\"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]\"\n )\n elif EXPORT_TYPE == ExportType.MARKDOWN:\n print(repr(source.content))\n else:\n raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\nfrom docling.chunking import DocChunk print(f\"Question:\\n{QUESTION}\\n\") print(f\"Answer:\\n{rag_res['answer_builder']['answers'][0].data.strip()}\\n\") print(\"Sources:\") sources = rag_res[\"answer_builder\"][\"answers\"][0].documents for source in sources: if EXPORT_TYPE == ExportType.DOC_CHUNKS: doc_chunk = DocChunk.model_validate(source.meta[\"dl_meta\"]) print(f\"- text: {doc_chunk.text!r}\") if doc_chunk.meta.origin: print(f\" file: {doc_chunk.meta.origin.filename}\") if doc_chunk.meta.headings: print(f\" section: {' / '.join(doc_chunk.meta.headings)}\") bbox = doc_chunk.meta.doc_items[0].prov[0].bbox print( f\" page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, \" f\"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]\" ) elif EXPORT_TYPE == ExportType.MARKDOWN: print(repr(source.content)) else: raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")
Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nThe main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, while TableFormer is a state-of-the-art table structure recognition model. These models are provided with pre-trained weights and a separate package for the inference code as docling-ibm-models. They are also used in the open-access deepsearch-experience, a cloud-native service for knowledge exploration tasks. Additionally, Docling plans to extend its model library with a figure-classifier model, an equation-recognition model, a code-recognition model, and more in the future.\n\nSources:\n- text: 'As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.'\n file: 2408.09869v5.pdf\n section: 3.2 AI models\n page: 3, bounding box: [107, 406, 504, 330]\n- text: 'Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading-order and eventually assembles a typed document object which can be serialized to JSON or Markdown.'\n file: 2408.09869v5.pdf\n section: 3 Processing pipeline\n page: 2, bounding box: [107, 273, 504, 176]\n- text: 'Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of content, as well as augment extracted document metadata with additional information. Further investment into testing and optimizing GPU acceleration as well as improving the Docling-native PDF backend are on our roadmap, too.\\nWe encourage everyone to propose or implement additional features and models, and will gladly take your inputs and contributions under review . The codebase of Docling is open for use and contribution, under the MIT license agreement and in alignment with our contributing guidelines included in the Docling repository. If you use Docling in your projects, please consider citing this technical report.'\n section: 6 Future work and contributions\n page: 5, bounding box: [106, 323, 504, 258]\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/rag_haystack/#rag-with-haystack","title":"RAG with Haystack\u00b6","text":""},{"location":"examples/rag_haystack/#overview","title":"Overview\u00b6","text":""},{"location":"examples/rag_haystack/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_haystack/#indexing-pipeline","title":"Indexing pipeline\u00b6","text":""},{"location":"examples/rag_haystack/#rag-pipeline","title":"RAG pipeline\u00b6","text":""},{"location":"examples/rag_langchain/","title":"RAG with LangChain","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote
This example leverages the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings.
The presented DoclingLoader
component enables you to:
DoclingLoader
supports two different export modes:
ExportType.MARKDOWN
: if you want to capture each input document as a separate LangChain document, orExportType.DOC_CHUNKS
(default): if you want to have each input document chunked and to then capture each individual chunk as a separate LangChain document downstream.The example allows exploring both modes via parameter EXPORT_TYPE
; depending on the value set, the example pipeline is then set up accordingly.
HF_TOKEN
.--no-warn-conflicts
meant for Colab's pre-populated Python env; feel free to remove for stricter usage):%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain python-dotenv\n%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain python-dotenv
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom dotenv import load_dotenv\nfrom langchain_core.prompts import PromptTemplate\nfrom langchain_docling.loader import ExportType\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nFILE_PATH = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nEXPORT_TYPE = ExportType.DOC_CHUNKS\nQUESTION = \"Which are the main AI models in Docling?\"\nPROMPT = PromptTemplate.from_template(\n \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\",\n)\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\nimport os from pathlib import Path from tempfile import mkdtemp from dotenv import load_dotenv from langchain_core.prompts import PromptTemplate from langchain_docling.loader import ExportType def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") FILE_PATH = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" EXPORT_TYPE = ExportType.DOC_CHUNKS QUESTION = \"Which are the main AI models in Docling?\" PROMPT = PromptTemplate.from_template( \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\", ) TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied!
from langchain_docling import DoclingLoader\n\nfrom docling.chunking import HybridChunker\n\nloader = DoclingLoader(\n file_path=FILE_PATH,\n export_type=EXPORT_TYPE,\n chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n)\n\ndocs = loader.load()\nfrom langchain_docling import DoclingLoader from docling.chunking import HybridChunker loader = DoclingLoader( file_path=FILE_PATH, export_type=EXPORT_TYPE, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ) docs = loader.load()
Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors\n
Note: a message saying \"Token indices sequence length is longer than the specified maximum sequence length...\"
can be ignored in this case \u2014 details here.
Determining the splits:
In\u00a0[4]: Copied!if EXPORT_TYPE == ExportType.DOC_CHUNKS:\n splits = docs\nelif EXPORT_TYPE == ExportType.MARKDOWN:\n from langchain_text_splitters import MarkdownHeaderTextSplitter\n\n splitter = MarkdownHeaderTextSplitter(\n headers_to_split_on=[\n (\"#\", \"Header_1\"),\n (\"##\", \"Header_2\"),\n (\"###\", \"Header_3\"),\n ],\n )\n splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]\nelse:\n raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\nif EXPORT_TYPE == ExportType.DOC_CHUNKS: splits = docs elif EXPORT_TYPE == ExportType.MARKDOWN: from langchain_text_splitters import MarkdownHeaderTextSplitter splitter = MarkdownHeaderTextSplitter( headers_to_split_on=[ (\"#\", \"Header_1\"), (\"##\", \"Header_2\"), (\"###\", \"Header_3\"), ], ) splits = [split for doc in docs for split in splitter.split_text(doc.page_content)] else: raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")
Inspecting some sample splits:
In\u00a0[5]: Copied!for d in splits[:3]:\n print(f\"- {d.page_content=}\")\nprint(\"...\")\nfor d in splits[:3]: print(f\"- {d.page_content=}\") print(\"...\")
- d.page_content='arXiv:2408.09869v5 [cs.CL] 9 Dec 2024'\n- d.page_content='Docling Technical Report\\nVersion 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R\u00a8uschlikon, Switzerland'\n- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n...\nIn\u00a0[6]: Copied!
import json\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom langchain_huggingface.embeddings import HuggingFaceEmbeddings\nfrom langchain_milvus import Milvus\n\nembedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)\n\n\nmilvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed\nvectorstore = Milvus.from_documents(\n documents=splits,\n embedding=embedding,\n collection_name=\"docling_demo\",\n connection_args={\"uri\": milvus_uri},\n index_params={\"index_type\": \"FLAT\"},\n drop_old=True,\n)\nimport json from pathlib import Path from tempfile import mkdtemp from langchain_huggingface.embeddings import HuggingFaceEmbeddings from langchain_milvus import Milvus embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID) milvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed vectorstore = Milvus.from_documents( documents=splits, embedding=embedding, collection_name=\"docling_demo\", connection_args={\"uri\": milvus_uri}, index_params={\"index_type\": \"FLAT\"}, drop_old=True, ) In\u00a0[7]: Copied!
from langchain.chains import create_retrieval_chain\nfrom langchain.chains.combine_documents import create_stuff_documents_chain\nfrom langchain_huggingface import HuggingFaceEndpoint\n\nretriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K})\nllm = HuggingFaceEndpoint(\n repo_id=GEN_MODEL_ID,\n huggingfacehub_api_token=HF_TOKEN,\n)\n\n\ndef clip_text(text, threshold=100):\n return f\"{text[:threshold]}...\" if len(text) > threshold else text\nfrom langchain.chains import create_retrieval_chain from langchain.chains.combine_documents import create_stuff_documents_chain from langchain_huggingface import HuggingFaceEndpoint retriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K}) llm = HuggingFaceEndpoint( repo_id=GEN_MODEL_ID, huggingfacehub_api_token=HF_TOKEN, ) def clip_text(text, threshold=100): return f\"{text[:threshold]}...\" if len(text) > threshold else text
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\nIn\u00a0[8]: Copied!
question_answer_chain = create_stuff_documents_chain(llm, PROMPT)\nrag_chain = create_retrieval_chain(retriever, question_answer_chain)\nresp_dict = rag_chain.invoke({\"input\": QUESTION})\n\nclipped_answer = clip_text(resp_dict[\"answer\"], threshold=200)\nprint(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")\nfor i, doc in enumerate(resp_dict[\"context\"]):\n print()\n print(f\"Source {i + 1}:\")\n print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")\n for key in doc.metadata:\n if key != \"pk\":\n val = doc.metadata.get(key)\n clipped_val = clip_text(val) if isinstance(val, str) else val\n print(f\" {key}: {clipped_val}\")\nquestion_answer_chain = create_stuff_documents_chain(llm, PROMPT) rag_chain = create_retrieval_chain(retriever, question_answer_chain) resp_dict = rag_chain.invoke({\"input\": QUESTION}) clipped_answer = clip_text(resp_dict[\"answer\"], threshold=200) print(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\") for i, doc in enumerate(resp_dict[\"context\"]): print() print(f\"Source {i + 1}:\") print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\") for key in doc.metadata: if key != \"pk\": val = doc.metadata.get(key) clipped_val = clip_text(val) if isinstance(val, str) else val print(f\" {key}: {clipped_val}\")
Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nDocling initially releases two AI models, a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art tab...\n\nSource 1:\n text: \"3.2 AI models\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re...\"\n dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 405.1419982910156, 'r': 504.00299072265625, 'b': 330.7799987792969, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n source: https://arxiv.org/pdf/2408.09869\n\nSource 2:\n text: \"3 Processing pipeline\\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ...\"\n dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/26', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 2, 'bbox': {'l': 108.0, 't': 273.01800537109375, 'r': 504.00299072265625, 'b': 176.83799743652344, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 796]}]}], 'headings': ['3 Processing pipeline'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n source: https://arxiv.org/pdf/2408.09869\n\nSource 3:\n text: \"6 Future work and contributions\\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ...\"\n dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/76', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 322.468994140625, 'r': 504.00299072265625, 'b': 259.0169982910156, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}, {'self_ref': '#/texts/77', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 251.6540069580078, 'r': 504.00299072265625, 'b': 198.99200439453125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 402]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n source: https://arxiv.org/pdf/2408.09869\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/rag_langchain/#rag-with-langchain","title":"RAG with LangChain\u00b6","text":""},{"location":"examples/rag_langchain/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_langchain/#document-loading","title":"Document loading\u00b6","text":"
Now we can instantiate our loader and load documents.
"},{"location":"examples/rag_langchain/#ingestion","title":"Ingestion\u00b6","text":""},{"location":"examples/rag_langchain/#rag","title":"RAG\u00b6","text":""},{"location":"examples/rag_llamaindex/","title":"RAG with LlamaIndex","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 RemoteThis example leverages the official LlamaIndex Docling extension.
Presented extensions DoclingReader
and DoclingNodeParser
enable you to:
HF_TOKEN
.--no-warn-conflicts
meant for Colab's pre-populated Python env; feel free to remove for stricter usage):%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv\n%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\nfrom warnings import filterwarnings\n\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\nfilterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")\nfilterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\")\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\nimport os from pathlib import Path from tempfile import mkdtemp from warnings import filterwarnings from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\") filterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\") # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"
We can now define the main parameters:
In\u00a0[3]: Copied!from llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n\nEMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\nGEN_MODEL = HuggingFaceInferenceAPI(\n token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n)\nSOURCE = \"https://arxiv.org/pdf/2408.09869\" # Docling Technical Report\nQUERY = \"Which are the main AI models in Docling?\"\n\nembed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI EMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\") MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") GEN_MODEL = HuggingFaceInferenceAPI( token=_get_env_from_colab_or_os(\"HF_TOKEN\"), model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", ) SOURCE = \"https://arxiv.org/pdf/2408.09869\" # Docling Technical Report QUERY = \"Which are the main AI models in Docling?\" embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))
To create a simple RAG pipeline, we can:
DoclingReader
, which by default exports to Markdown, andMarkdownNodeParser
from llama_index.core import StorageContext, VectorStoreIndex\nfrom llama_index.core.node_parser import MarkdownNodeParser\nfrom llama_index.readers.docling import DoclingReader\nfrom llama_index.vector_stores.milvus import MilvusVectorStore\n\nreader = DoclingReader()\nnode_parser = MarkdownNodeParser()\n\nvector_store = MilvusVectorStore(\n uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed\n dim=embed_dim,\n overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n documents=reader.load_data(SOURCE),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\nfrom llama_index.core import StorageContext, VectorStoreIndex from llama_index.core.node_parser import MarkdownNodeParser from llama_index.readers.docling import DoclingReader from llama_index.vector_stores.milvus import MilvusVectorStore reader = DoclingReader() node_parser = MarkdownNodeParser() vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes])
Q: Which are the main AI models in Docling?\nA: The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.\n\nSources:\n
[('3.2 AI models\\n\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n {'Header_2': '3.2 AI models'}),\n (\"5 Applications\\n\\nThanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.\",\n {'Header_2': '5 Applications'})]
To leverage Docling's rich native format, we:
DoclingReader
with JSON export type, andDoclingNodeParser
in order to appropriately parse that Docling format.Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):
In\u00a0[5]: Copied!from llama_index.node_parser.docling import DoclingNodeParser\n\nreader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\nnode_parser = DoclingNodeParser()\n\nvector_store = MilvusVectorStore(\n uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed\n dim=embed_dim,\n overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n documents=reader.load_data(SOURCE),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\nfrom llama_index.node_parser.docling import DoclingNodeParser reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) node_parser = DoclingNodeParser() vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes])
Q: Which are the main AI models in Docling?\nA: The main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art table structure recognition model.\n\nSources:\n
[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n {'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/34',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 3,\n 'bbox': {'l': 107.07593536376953,\n 't': 406.1695251464844,\n 'r': 504.1148681640625,\n 'b': 330.2677307128906,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 608]}]}],\n 'headings': ['3.2 AI models'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869v3.pdf'}}),\n ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n {'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/9',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 1,\n 'bbox': {'l': 107.0031967163086,\n 't': 136.7283935546875,\n 'r': 504.04998779296875,\n 'b': 83.30133056640625,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 488]}]}],\n 'headings': ['1 Introduction'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869v3.pdf'}})]
To demonstrate this usage pattern, we first set up a test document directory.
In\u00a0[6]: Copied!from pathlib import Path\nfrom tempfile import mkdtemp\n\nimport requests\n\ntmp_dir_path = Path(mkdtemp())\nr = requests.get(SOURCE)\nwith open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:\n out_file.write(r.content)\nfrom pathlib import Path from tempfile import mkdtemp import requests tmp_dir_path = Path(mkdtemp()) r = requests.get(SOURCE) with open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file: out_file.write(r.content)
Using the reader
and node_parser
definitions from any of the above variants, usage with SimpleDirectoryReader
then looks as follows:
from llama_index.core import SimpleDirectoryReader\n\ndir_reader = SimpleDirectoryReader(\n input_dir=tmp_dir_path,\n file_extractor={\".pdf\": reader},\n)\n\nvector_store = MilvusVectorStore(\n uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed\n dim=embed_dim,\n overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n documents=dir_reader.load_data(SOURCE),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\nfrom llama_index.core import SimpleDirectoryReader dir_reader = SimpleDirectoryReader( input_dir=tmp_dir_path, file_extractor={\".pdf\": reader}, ) vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=dir_reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes])
Loading files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:11<00:00, 11.27s/file]\n
Q: Which are the main AI models in Docling?\nA: 1. A layout analysis model, an accurate object-detector for page elements. 2. TableFormer, a state-of-the-art table structure recognition model.\n\nSources:\n
[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf',\n 'file_name': '2408.09869.pdf',\n 'file_type': 'application/pdf',\n 'file_size': 5566574,\n 'creation_date': '2024-10-28',\n 'last_modified_date': '2024-10-28',\n 'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/34',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 3,\n 'bbox': {'l': 107.07593536376953,\n 't': 406.1695251464844,\n 'r': 504.1148681640625,\n 'b': 330.2677307128906,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 608]}]}],\n 'headings': ['3.2 AI models'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869.pdf'}}),\n ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf',\n 'file_name': '2408.09869.pdf',\n 'file_type': 'application/pdf',\n 'file_size': 5566574,\n 'creation_date': '2024-10-28',\n 'last_modified_date': '2024-10-28',\n 'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/9',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 1,\n 'bbox': {'l': 107.0031967163086,\n 't': 136.7283935546875,\n 'r': 504.04998779296875,\n 'b': 83.30133056640625,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 488]}]}],\n 'headings': ['1 Introduction'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869.pdf'}})]In\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/rag_llamaindex/#rag-with-llamaindex","title":"RAG with LlamaIndex\u00b6","text":""},{"location":"examples/rag_llamaindex/#overview","title":"Overview\u00b6","text":""},{"location":"examples/rag_llamaindex/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_llamaindex/#using-markdown-export","title":"Using Markdown export\u00b6","text":""},{"location":"examples/rag_llamaindex/#using-docling-format","title":"Using Docling format\u00b6","text":""},{"location":"examples/rag_llamaindex/#with-simple-directory-reader","title":"With Simple Directory Reader\u00b6","text":""},{"location":"examples/rag_milvus/","title":"RAG with Milvus","text":"In\u00a0[\u00a0]: Copied!
! pip install --upgrade pymilvus docling openai torch\n! pip install --upgrade pymilvus docling openai torch
If you are using Google Colab, to enable dependencies just installed, you may need to restart the runtime (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu).
Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.
The code below checks to see if a GPU is available, either via CUDA or MPS.
In\u00a0[1]: Copied!import torch\n\n# Check if GPU or MPS is available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\n print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n device = torch.device(\"mps\")\n print(\"MPS GPU is enabled.\")\nelse:\n raise OSError(\n \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n )\nimport torch # Check if GPU or MPS is available if torch.cuda.is_available(): device = torch.device(\"cuda\") print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available(): device = torch.device(\"mps\") print(\"MPS GPU is enabled.\") else: raise OSError( \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\" )
MPS GPU is enabled.\nIn\u00a0[2]: Copied!
import os\n\nos.environ[\"OPENAI_API_KEY\"] = \"sk-***********\"\nimport os os.environ[\"OPENAI_API_KEY\"] = \"sk-***********\" In\u00a0[3]: Copied!
from openai import OpenAI\n\nopenai_client = OpenAI()\nfrom openai import OpenAI openai_client = OpenAI()
Define a function to generate text embeddings using OpenAI client. We use the text-embedding-3-small model as an example.
In\u00a0[4]: Copied!def emb_text(text):\n return (\n openai_client.embeddings.create(input=text, model=\"text-embedding-3-small\")\n .data[0]\n .embedding\n )\ndef emb_text(text): return ( openai_client.embeddings.create(input=text, model=\"text-embedding-3-small\") .data[0] .embedding )
Generate a test embedding and print its dimension and first few elements.
In\u00a0[5]: Copied!test_embedding = emb_text(\"This is a test\")\nembedding_dim = len(test_embedding)\nprint(embedding_dim)\nprint(test_embedding[:10])\ntest_embedding = emb_text(\"This is a test\") embedding_dim = len(test_embedding) print(embedding_dim) print(test_embedding[:10])
1536\n[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]\n
In this tutorial, we will use a Markdown file (source) as the input. We will process the document using a HierarchicalChunker provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks.
In\u00a0[6]: Copied!from docling_core.transforms.chunker import HierarchicalChunker\n\nfrom docling.document_converter import DocumentConverter\n\nconverter = DocumentConverter()\nchunker = HierarchicalChunker()\n\n# Convert the input file to Docling Document\nsource = \"https://milvus.io/docs/overview.md\"\ndoc = converter.convert(source).document\n\n# Perform hierarchical chunking\ntexts = [chunk.text for chunk in chunker.chunk(doc)]\nfrom docling_core.transforms.chunker import HierarchicalChunker from docling.document_converter import DocumentConverter converter = DocumentConverter() chunker = HierarchicalChunker() # Convert the input file to Docling Document source = \"https://milvus.io/docs/overview.md\" doc = converter.convert(source).document # Perform hierarchical chunking texts = [chunk.text for chunk in chunker.chunk(doc)] In\u00a0[7]: Copied!
from pymilvus import MilvusClient\n\nmilvus_client = MilvusClient(uri=\"./milvus_demo.db\")\ncollection_name = \"my_rag_collection\"\nfrom pymilvus import MilvusClient milvus_client = MilvusClient(uri=\"./milvus_demo.db\") collection_name = \"my_rag_collection\"
As for the argument of MilvusClient
:
uri
as a local file, e.g../milvus.db
, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file.http://localhost:19530
, as your uri
.uri
and token
, which correspond to the Public Endpoint and Api key in Zilliz Cloud.Check if the collection already exists and drop it if it does.
In\u00a0[8]: Copied!if milvus_client.has_collection(collection_name):\n milvus_client.drop_collection(collection_name)\nif milvus_client.has_collection(collection_name): milvus_client.drop_collection(collection_name)
Create a new collection with specified parameters.
If we don\u2019t specify any field information, Milvus will automatically create a default id
field for primary key, and a vector
field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values.
milvus_client.create_collection(\n collection_name=collection_name,\n dimension=embedding_dim,\n metric_type=\"IP\", # Inner product distance\n consistency_level=\"Strong\", # Supported values are (`\"Strong\"`, `\"Session\"`, `\"Bounded\"`, `\"Eventually\"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details.\n)\nmilvus_client.create_collection( collection_name=collection_name, dimension=embedding_dim, metric_type=\"IP\", # Inner product distance consistency_level=\"Strong\", # Supported values are (`\"Strong\"`, `\"Session\"`, `\"Bounded\"`, `\"Eventually\"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details. ) In\u00a0[10]: Copied!
from tqdm import tqdm\n\ndata = []\n\nfor i, chunk in enumerate(tqdm(texts, desc=\"Processing chunks\")):\n embedding = emb_text(chunk)\n data.append({\"id\": i, \"vector\": embedding, \"text\": chunk})\n\nmilvus_client.insert(collection_name=collection_name, data=data)\nfrom tqdm import tqdm data = [] for i, chunk in enumerate(tqdm(texts, desc=\"Processing chunks\")): embedding = emb_text(chunk) data.append({\"id\": i, \"vector\": embedding, \"text\": chunk}) milvus_client.insert(collection_name=collection_name, data=data)
Processing chunks: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 38/38 [00:14<00:00, 2.59it/s]\nOut[10]:
{'insert_count': 38, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37], 'cost': 0}In\u00a0[11]: Copied!
question = (\n \"What are the three deployment modes of Milvus, and what are their differences?\"\n)\nquestion = ( \"What are the three deployment modes of Milvus, and what are their differences?\" )
Search for the question in the collection and retrieve the semantic top-3 matches.
In\u00a0[12]: Copied!search_res = milvus_client.search(\n collection_name=collection_name,\n data=[emb_text(question)],\n limit=3,\n search_params={\"metric_type\": \"IP\", \"params\": {}},\n output_fields=[\"text\"],\n)\nsearch_res = milvus_client.search( collection_name=collection_name, data=[emb_text(question)], limit=3, search_params={\"metric_type\": \"IP\", \"params\": {}}, output_fields=[\"text\"], )
Let\u2019s take a look at the search results of the query
In\u00a0[13]: Copied!import json\n\nretrieved_lines_with_distances = [\n (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0]\n]\nprint(json.dumps(retrieved_lines_with_distances, indent=4))\nimport json retrieved_lines_with_distances = [ (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0] ] print(json.dumps(retrieved_lines_with_distances, indent=4))
[\n [\n \"Milvus offers three deployment modes, covering a wide range of data scales\\u2014from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors:\",\n 0.6503315567970276\n ],\n [\n \"Milvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it\\u2019s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\\nMilvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\\nMilvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more.\",\n 0.6281915903091431\n ],\n [\n \"What is Milvus?\\nUnstructured Data, Embeddings, and Milvus\\nWhat Makes Milvus so Fast\\uff1f\\nWhat Makes Milvus so Scalable\\nTypes of Searches Supported by Milvus\\nComprehensive Feature Set\",\n 0.6117826700210571\n ]\n]\nIn\u00a0[14]: Copied!
context = \"\\n\".join(\n [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]\n)\ncontext = \"\\n\".join( [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances] )
Define system and user prompts for the Lanage Model. This prompt is assembled with the retrieved documents from Milvus.
In\u00a0[16]: Copied!SYSTEM_PROMPT = \"\"\"\nHuman: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.\n\"\"\"\nUSER_PROMPT = f\"\"\"\nUse the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.\n<context>\n{context}\n</context>\n<question>\n{question}\n</question>\n\"\"\"\nSYSTEM_PROMPT = \"\"\" Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. \"\"\" USER_PROMPT = f\"\"\" Use the following pieces of information enclosed in tags to provide an answer to the question enclosed in tags. {context} {question} \"\"\"
Use OpenAI ChatGPT to generate a response based on the prompts.
In\u00a0[17]: Copied!response = openai_client.chat.completions.create(\n model=\"gpt-4o\",\n messages=[\n {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n {\"role\": \"user\", \"content\": USER_PROMPT},\n ],\n)\nprint(response.choices[0].message.content)\nresponse = openai_client.chat.completions.create( model=\"gpt-4o\", messages=[ {\"role\": \"system\", \"content\": SYSTEM_PROMPT}, {\"role\": \"user\", \"content\": USER_PROMPT}, ], ) print(response.choices[0].message.content)
The three deployment modes of Milvus are:\n\n1. **Milvus Lite**: This is a Python library that integrates easily into your applications. It's a lightweight version ideal for quick prototyping in Jupyter Notebooks or for running on edge devices with limited resources.\n\n2. **Milvus Standalone**: This mode is a single-machine server deployment where all components are bundled into a single Docker image, making it convenient to deploy.\n\n3. **Milvus Distributed**: This mode is designed for deployment on Kubernetes clusters. It features a cloud-native architecture suited for managing scenarios at a billion-scale or larger, ensuring redundancy in critical components.\n"},{"location":"examples/rag_milvus/#rag-with-milvus","title":"RAG with Milvus\u00b6","text":"Step Tech Execution Embedding OpenAI (text-embedding-3-small) \ud83c\udf10 Remote Vector store Milvus \ud83d\udcbb Local Gen AI OpenAI (gpt-4o) \ud83c\udf10 Remote"},{"location":"examples/rag_milvus/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"
This is a code recipe that uses Milvus, the world's most advanced open-source vector database, to perform RAG over documents parsed by Docling.
In this notebook, we accomplish the following:
Note: For best results, please use GPU acceleration to run this notebook. Here are two options for running this notebook:
To start, install the required dependencies by running the following command:
"},{"location":"examples/rag_milvus/#gpu-checking","title":"GPU Checking\u00b6","text":""},{"location":"examples/rag_milvus/#setting-up-api-keys","title":"Setting Up API Keys\u00b6","text":"We will use OpenAI as the LLM in this example. You should prepare the OPENAI_API_KEY as an environment variable.
"},{"location":"examples/rag_milvus/#prepare-the-llm-and-embedding-model","title":"Prepare the LLM and Embedding Model\u00b6","text":"We initialize the OpenAI client to prepare the embedding model.
"},{"location":"examples/rag_milvus/#process-data-using-docling","title":"Process Data Using Docling\u00b6","text":"Docling can parse various document formats into a unified representation (Docling Document), which can then be exported to different output formats. For a full list of supported input and output formats, please refer to the official documentation.
"},{"location":"examples/rag_milvus/#load-data-into-milvus","title":"Load Data into Milvus\u00b6","text":""},{"location":"examples/rag_milvus/#create-the-collection","title":"Create the collection\u00b6","text":"With data in hand, we can create a MilvusClient
instance and insert the data into a Milvus collection.
Let\u2019s specify a query question about the website we just scraped.
"},{"location":"examples/rag_milvus/#use-llm-to-get-a-rag-response","title":"Use LLM to get a RAG response\u00b6","text":"Convert the retrieved documents into a string format.
"},{"location":"examples/rag_weaviate/","title":"RAG with Weaviate","text":"Step Tech Execution Embedding Open AI \ud83c\udf10 Remote Vector store Weavieate \ud83d\udcbb Local Gen AI Open AI \ud83c\udf10 Remote In\u00a0[\u00a0]: Copied!%%capture\n%pip install docling~=\"2.7.0\"\n%pip install -U weaviate-client~=\"4.9.4\"\n%pip install rich\n%pip install torch\n\nimport logging\nimport warnings\n\nwarnings.filterwarnings(\"ignore\")\n\n# Suppress Weaviate client logs\nlogging.getLogger(\"weaviate\").setLevel(logging.ERROR)\n%%capture %pip install docling~=\"2.7.0\" %pip install -U weaviate-client~=\"4.9.4\" %pip install rich %pip install torch import logging import warnings warnings.filterwarnings(\"ignore\") # Suppress Weaviate client logs logging.getLogger(\"weaviate\").setLevel(logging.ERROR) In\u00a0[2]: Copied!
import torch\n\n# Check if GPU or MPS is available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\n print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n device = torch.device(\"mps\")\n print(\"MPS GPU is enabled.\")\nelse:\n raise OSError(\n \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n )\nimport torch # Check if GPU or MPS is available if torch.cuda.is_available(): device = torch.device(\"cuda\") print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available(): device = torch.device(\"mps\") print(\"MPS GPU is enabled.\") else: raise OSError( \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\" )
MPS GPU is enabled.\n
Here, we've collected 10 influential machine learning papers published as PDFs on arXiv. Because Docling does not yet have title extraction for PDFs, we manually add the titles in a corresponding list.
Note: Converting all 10 papers should take around 8 minutes with a T4 GPU.
In\u00a0[3]: Copied!# Influential machine learning papers\nsource_urls = [\n \"https://arxiv.org/pdf/1706.03762\",\n \"https://arxiv.org/pdf/1810.04805\",\n \"https://arxiv.org/pdf/1406.2661\",\n \"https://arxiv.org/pdf/1409.0473\",\n \"https://arxiv.org/pdf/1412.6980\",\n \"https://arxiv.org/pdf/1312.6114\",\n \"https://arxiv.org/pdf/1312.5602\",\n \"https://arxiv.org/pdf/1512.03385\",\n \"https://arxiv.org/pdf/1409.3215\",\n \"https://arxiv.org/pdf/1301.3781\",\n]\n\n# And their corresponding titles (because Docling doesn't have title extraction yet!)\nsource_titles = [\n \"Attention Is All You Need\",\n \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\",\n \"Generative Adversarial Nets\",\n \"Neural Machine Translation by Jointly Learning to Align and Translate\",\n \"Adam: A Method for Stochastic Optimization\",\n \"Auto-Encoding Variational Bayes\",\n \"Playing Atari with Deep Reinforcement Learning\",\n \"Deep Residual Learning for Image Recognition\",\n \"Sequence to Sequence Learning with Neural Networks\",\n \"A Neural Probabilistic Language Model\",\n]\n# Influential machine learning papers source_urls = [ \"https://arxiv.org/pdf/1706.03762\", \"https://arxiv.org/pdf/1810.04805\", \"https://arxiv.org/pdf/1406.2661\", \"https://arxiv.org/pdf/1409.0473\", \"https://arxiv.org/pdf/1412.6980\", \"https://arxiv.org/pdf/1312.6114\", \"https://arxiv.org/pdf/1312.5602\", \"https://arxiv.org/pdf/1512.03385\", \"https://arxiv.org/pdf/1409.3215\", \"https://arxiv.org/pdf/1301.3781\", ] # And their corresponding titles (because Docling doesn't have title extraction yet!) source_titles = [ \"Attention Is All You Need\", \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\", \"Generative Adversarial Nets\", \"Neural Machine Translation by Jointly Learning to Align and Translate\", \"Adam: A Method for Stochastic Optimization\", \"Auto-Encoding Variational Bayes\", \"Playing Atari with Deep Reinforcement Learning\", \"Deep Residual Learning for Image Recognition\", \"Sequence to Sequence Learning with Neural Networks\", \"A Neural Probabilistic Language Model\", ] In\u00a0[4]: Copied!
from docling.document_converter import DocumentConverter\n\n# Instantiate the doc converter\ndoc_converter = DocumentConverter()\n\n# Directly pass list of files or streams to `convert_all`\nconv_results_iter = doc_converter.convert_all(source_urls) # previously `convert`\n\n# Iterate over the generator to get a list of Docling documents\ndocs = [result.document for result in conv_results_iter]\nfrom docling.document_converter import DocumentConverter # Instantiate the doc converter doc_converter = DocumentConverter() # Directly pass list of files or streams to `convert_all` conv_results_iter = doc_converter.convert_all(source_urls) # previously `convert` # Iterate over the generator to get a list of Docling documents docs = [result.document for result in conv_results_iter]
Fetching 9 files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 9/9 [00:00<00:00, 84072.91it/s]\n
ERR#: COULD NOT CONVERT TO RS THIS TABLE TO COMPUTE SPANS\nIn\u00a0[5]: Copied!
from docling_core.transforms.chunker import HierarchicalChunker\n\n# Initialize lists for text, and titles\ntexts, titles = [], []\n\nchunker = HierarchicalChunker()\n\n# Process each document in the list\nfor doc, title in zip(docs, source_titles): # Pair each document with its title\n chunks = list(\n chunker.chunk(doc)\n ) # Perform hierarchical chunking and get text from chunks\n for chunk in chunks:\n texts.append(chunk.text)\n titles.append(title)\nfrom docling_core.transforms.chunker import HierarchicalChunker # Initialize lists for text, and titles texts, titles = [], [] chunker = HierarchicalChunker() # Process each document in the list for doc, title in zip(docs, source_titles): # Pair each document with its title chunks = list( chunker.chunk(doc) ) # Perform hierarchical chunking and get text from chunks for chunk in chunks: texts.append(chunk.text) titles.append(title)
Because we're splitting the documents into chunks, we'll concatenate the article title to the beginning of each chunk for additional context.
In\u00a0[6]: Copied!# Concatenate title and text\nfor i in range(len(texts)):\n texts[i] = f\"{titles[i]} {texts[i]}\"\n# Concatenate title and text for i in range(len(texts)): texts[i] = f\"{titles[i]} {texts[i]}\"
We'll be using the OpenAI API for both generating the text embeddings and for the generative model in our RAG pipeline. The code below dynamically fetches your API key based on whether you're running this notebook in Google Colab and running it as a regular Jupyter notebook. All you need to do is replace openai_api_key_var
with the name of your environmental variable name or Colab secret name for the API key.
If you're running this notebook in Google Colab, make sure you add your API key as a secret.
In\u00a0[7]: Copied!# OpenAI API key variable name\nopenai_api_key_var = \"OPENAI_API_KEY\" # Replace with the name of your secret/env var\n\n# Fetch OpenAI API key\ntry:\n # If running in Colab, fetch API key from Secrets\n import google.colab\n from google.colab import userdata\n\n openai_api_key = userdata.get(openai_api_key_var)\n if not openai_api_key:\n raise ValueError(f\"Secret '{openai_api_key_var}' not found in Colab secrets.\")\nexcept ImportError:\n # If not running in Colab, fetch API key from environment variable\n import os\n\n openai_api_key = os.getenv(openai_api_key_var)\n if not openai_api_key:\n raise OSError(\n f\"Environment variable '{openai_api_key_var}' is not set. \"\n \"Please define it before running this script.\"\n )\n# OpenAI API key variable name openai_api_key_var = \"OPENAI_API_KEY\" # Replace with the name of your secret/env var # Fetch OpenAI API key try: # If running in Colab, fetch API key from Secrets import google.colab from google.colab import userdata openai_api_key = userdata.get(openai_api_key_var) if not openai_api_key: raise ValueError(f\"Secret '{openai_api_key_var}' not found in Colab secrets.\") except ImportError: # If not running in Colab, fetch API key from environment variable import os openai_api_key = os.getenv(openai_api_key_var) if not openai_api_key: raise OSError( f\"Environment variable '{openai_api_key_var}' is not set. \" \"Please define it before running this script.\" )
Embedded Weaviate allows you to spin up a Weaviate instance directly from your application code, without having to use a Docker container. If you're interested in other deployment methods, like using Docker-Compose or Kubernetes, check out this page in the Weaviate docs.
In\u00a0[\u00a0]: Copied!import weaviate\n\n# Connect to Weaviate embedded\nclient = weaviate.connect_to_embedded(headers={\"X-OpenAI-Api-Key\": openai_api_key})\nimport weaviate # Connect to Weaviate embedded client = weaviate.connect_to_embedded(headers={\"X-OpenAI-Api-Key\": openai_api_key}) In\u00a0[\u00a0]: Copied!
import weaviate.classes.config as wc\n\n# Define the collection name\ncollection_name = \"docling\"\n\n# Delete the collection if it already exists\nif client.collections.exists(collection_name):\n client.collections.delete(collection_name)\n\n# Create the collection\ncollection = client.collections.create(\n name=collection_name,\n vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(\n model=\"text-embedding-3-large\", # Specify your embedding model here\n ),\n # Enable generative model from Cohere\n generative_config=wc.Configure.Generative.openai(\n model=\"gpt-4o\" # Specify your generative model for RAG here\n ),\n # Define properties of metadata\n properties=[\n wc.Property(name=\"text\", data_type=wc.DataType.TEXT),\n wc.Property(name=\"title\", data_type=wc.DataType.TEXT, skip_vectorization=True),\n ],\n)\nimport weaviate.classes.config as wc # Define the collection name collection_name = \"docling\" # Delete the collection if it already exists if client.collections.exists(collection_name): client.collections.delete(collection_name) # Create the collection collection = client.collections.create( name=collection_name, vectorizer_config=wc.Configure.Vectorizer.text2vec_openai( model=\"text-embedding-3-large\", # Specify your embedding model here ), # Enable generative model from Cohere generative_config=wc.Configure.Generative.openai( model=\"gpt-4o\" # Specify your generative model for RAG here ), # Define properties of metadata properties=[ wc.Property(name=\"text\", data_type=wc.DataType.TEXT), wc.Property(name=\"title\", data_type=wc.DataType.TEXT, skip_vectorization=True), ], ) In\u00a0[10]: Copied!
# Initialize the data object\ndata = []\n\n# Create a dictionary for each row by iterating through the corresponding lists\nfor text, title in zip(texts, titles):\n data_point = {\n \"text\": text,\n \"title\": title,\n }\n data.append(data_point)\n# Initialize the data object data = [] # Create a dictionary for each row by iterating through the corresponding lists for text, title in zip(texts, titles): data_point = { \"text\": text, \"title\": title, } data.append(data_point) In\u00a0[\u00a0]: Copied!
# Insert text chunks and metadata into vector DB collection\nresponse = collection.data.insert_many(data)\n\nif response.has_errors:\n print(response.errors)\nelse:\n print(\"Insert complete.\")\n# Insert text chunks and metadata into vector DB collection response = collection.data.insert_many(data) if response.has_errors: print(response.errors) else: print(\"Insert complete.\") In\u00a0[12]: Copied!
from weaviate.classes.query import MetadataQuery\n\nresponse = collection.query.near_text(\n query=\"bert\",\n limit=2,\n return_metadata=MetadataQuery(distance=True),\n return_properties=[\"text\", \"title\"],\n)\n\nfor o in response.objects:\n print(o.properties)\n print(o.metadata.distance)\nfrom weaviate.classes.query import MetadataQuery response = collection.query.near_text( query=\"bert\", limit=2, return_metadata=MetadataQuery(distance=True), return_properties=[\"text\", \"title\"], ) for o in response.objects: print(o.properties) print(o.metadata.distance)
{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding A distinctive feature of BERT is its unified architecture across different tasks. There is mini-', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}\n0.6578550338745117\n{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding We introduce a new language representation model called BERT , which stands for B idirectional E ncoder R epresentations from T ransformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}\n0.6696287989616394\nIn\u00a0[13]: Copied!
from rich.console import Console\nfrom rich.panel import Panel\n\n# Create a prompt where context from the Weaviate collection will be injected\nprompt = \"Explain how {text} works, using only the retrieved context.\"\nquery = \"bert\"\n\nresponse = collection.generate.near_text(\n query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"]\n)\n\n# Prettify the output using Rich\nconsole = Console()\n\nconsole.print(\n Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\")\n)\nconsole.print(\n Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\")\n)\nfrom rich.console import Console from rich.panel import Panel # Create a prompt where context from the Weaviate collection will be injected prompt = \"Explain how {text} works, using only the retrieved context.\" query = \"bert\" response = collection.generate.near_text( query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"] ) # Prettify the output using Rich console = Console() console.print( Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\") ) console.print( Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\") )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Explain how bert works, using only the retrieved context. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 BERT, which stands for Bidirectional Encoder Representations from Transformers, is a language representation \u2502\n\u2502 model designed to pretrain deep bidirectional representations from unlabeled text. It conditions on both left \u2502\n\u2502 and right context in all layers, unlike traditional left-to-right or right-to-left language models. This \u2502\n\u2502 pre-training involves two unsupervised tasks. The pre-trained BERT model can then be fine-tuned with just one \u2502\n\u2502 additional output layer to create state-of-the-art models for various tasks, such as question answering and \u2502\n\u2502 language inference, without needing substantial task-specific architecture modifications. A distinctive feature \u2502\n\u2502 of BERT is its unified architecture across different tasks. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\nIn\u00a0[14]: Copied!
# Create a prompt where context from the Weaviate collection will be injected\nprompt = \"Explain how {text} works, using only the retrieved context.\"\nquery = \"a generative adversarial net\"\n\nresponse = collection.generate.near_text(\n query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"]\n)\n\n# Prettify the output using Rich\nconsole = Console()\n\nconsole.print(\n Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\")\n)\nconsole.print(\n Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\")\n)\n# Create a prompt where context from the Weaviate collection will be injected prompt = \"Explain how {text} works, using only the retrieved context.\" query = \"a generative adversarial net\" response = collection.generate.near_text( query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"] ) # Prettify the output using Rich console = Console() console.print( Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\") ) console.print( Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\") )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Explain how a generative adversarial net works, using only the retrieved context. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Generative Adversarial Nets (GANs) operate within an adversarial framework where two models are trained \u2502\n\u2502 simultaneously: a generative model (G) and a discriminative model (D). The generative model aims to capture the \u2502\n\u2502 data distribution and generate samples that mimic real data, while the discriminative model's task is to \u2502\n\u2502 distinguish between samples from the real data and those generated by G. This setup is akin to a game where the \u2502\n\u2502 generative model acts like counterfeiters trying to produce indistinguishable fake currency, and the \u2502\n\u2502 discriminative model acts like the police trying to detect these counterfeits. \u2502\n\u2502 \u2502\n\u2502 The training process involves a minimax two-player game where G tries to maximize the probability of D making a \u2502\n\u2502 mistake, while D tries to minimize it. When both models are defined by multilayer perceptrons, they can be \u2502\n\u2502 trained using backpropagation without the need for Markov chains or approximate inference networks. The \u2502\n\u2502 ultimate goal is for G to perfectly replicate the training data distribution, making D's output equal to 1/2 \u2502\n\u2502 everywhere, indicating it cannot distinguish between real and generated data. This framework allows for \u2502\n\u2502 specific training algorithms and optimization techniques, such as backpropagation and dropout, to be \u2502\n\u2502 effectively utilized. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
We can see that our RAG pipeline performs relatively well for simple queries, especially given the small size of the dataset. Scaling this method for converting a larger sample of PDFs would require more compute (GPUs) and a more advanced deployment of Weaviate (like Docker, Kubernetes, or Weaviate Cloud). For more information on available Weaviate configurations, check out the documentation.
"},{"location":"examples/rag_weaviate/#rag-with-weaviate","title":"RAG with Weaviate\u00b6","text":""},{"location":"examples/rag_weaviate/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"This is a code recipe that uses Weaviate to perform RAG over PDF documents parsed by Docling.
In this notebook, we accomplish the following:
To run this notebook, you'll need:
Note: For best results, please use GPU acceleration to run this notebook. Here are two options for running this notebook:
Note: If Colab prompts you to restart the session after running the cell below, click \"restart\" and proceed with running the rest of the notebook.
"},{"location":"examples/rag_weaviate/#part-1-docling","title":"\ud83d\udc25 Part 1: Docling\u00b6","text":"Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.
The code below checks to see if a GPU is available, either via CUDA or MPS.
"},{"location":"examples/rag_weaviate/#convert-pdfs-to-docling-documents","title":"Convert PDFs to Docling documents\u00b6","text":"Here we use Docling's .convert_all()
to parse a batch of PDFs. The result is a list of Docling documents that we can use for text extraction.
Note: Please ignore the ERR#
message.
We use Docling's HierarchicalChunker()
to perform hierarchy-aware chunking of our list of documents. This is meant to preserve some of the structure and relationships within the document, which enables more accurate and relevant retrieval in our RAG pipeline.
Transform our data from lists to a list of dictionaries for insertion into our Weaviate collection.
"},{"location":"examples/rag_weaviate/#insert-data-into-weaviate-and-generate-embeddings","title":"Insert data into Weaviate and generate embeddings\u00b6","text":"Embeddings will be generated upon insertion to our Weaviate collection.
"},{"location":"examples/rag_weaviate/#query-the-data","title":"Query the data\u00b6","text":"Here, we perform a simple similarity search to return the most similar embedded chunks to our search query.
"},{"location":"examples/rag_weaviate/#perform-rag-on-parsed-articles","title":"Perform RAG on parsed articles\u00b6","text":"Weaviate's generate
module allows you to perform RAG over your embedded data without having to use a separate framework.
We specify a prompt that includes the field we want to search through in the database (in this case it's text
), a query that includes our search term, and the number of retrieved results to use in the generation.
import os\nimport os In\u00a0[\u00a0]: Copied!
from huggingface_hub import snapshot_download\nfrom huggingface_hub import snapshot_download In\u00a0[\u00a0]: Copied!
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions\nfrom docling.document_converter import (\n ConversionResult,\n DocumentConverter,\n InputFormat,\n PdfFormatOption,\n)\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions from docling.document_converter import ( ConversionResult, DocumentConverter, InputFormat, PdfFormatOption, ) In\u00a0[\u00a0]: Copied!
def main():\n # Source document to convert\n source = \"https://arxiv.org/pdf/2408.09869v4\"\n\n # Download RappidOCR models from HuggingFace\n print(\"Downloading RapidOCR models\")\n download_path = snapshot_download(repo_id=\"SWHL/RapidOCR\")\n\n # Setup RapidOcrOptions for english detection\n det_model_path = os.path.join(\n download_path, \"PP-OCRv4\", \"en_PP-OCRv3_det_infer.onnx\"\n )\n rec_model_path = os.path.join(\n download_path, \"PP-OCRv4\", \"ch_PP-OCRv4_rec_server_infer.onnx\"\n )\n cls_model_path = os.path.join(\n download_path, \"PP-OCRv3\", \"ch_ppocr_mobile_v2.0_cls_train.onnx\"\n )\n ocr_options = RapidOcrOptions(\n det_model_path=det_model_path,\n rec_model_path=rec_model_path,\n cls_model_path=cls_model_path,\n )\n\n pipeline_options = PdfPipelineOptions(\n ocr_options=ocr_options,\n )\n\n # Convert the document\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n ),\n },\n )\n\n conversion_result: ConversionResult = converter.convert(source=source)\n doc = conversion_result.document\n md = doc.export_to_markdown()\n print(md)\ndef main(): # Source document to convert source = \"https://arxiv.org/pdf/2408.09869v4\" # Download RappidOCR models from HuggingFace print(\"Downloading RapidOCR models\") download_path = snapshot_download(repo_id=\"SWHL/RapidOCR\") # Setup RapidOcrOptions for english detection det_model_path = os.path.join( download_path, \"PP-OCRv4\", \"en_PP-OCRv3_det_infer.onnx\" ) rec_model_path = os.path.join( download_path, \"PP-OCRv4\", \"ch_PP-OCRv4_rec_server_infer.onnx\" ) cls_model_path = os.path.join( download_path, \"PP-OCRv3\", \"ch_ppocr_mobile_v2.0_cls_train.onnx\" ) ocr_options = RapidOcrOptions( det_model_path=det_model_path, rec_model_path=rec_model_path, cls_model_path=cls_model_path, ) pipeline_options = PdfPipelineOptions( ocr_options=ocr_options, ) # Convert the document converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ), }, ) conversion_result: ConversionResult = converter.convert(source=source) doc = conversion_result.document md = doc.export_to_markdown() print(md) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/retrieval_qdrant/","title":"Retrieval with Qdrant","text":"Step Tech Execution Embedding FastEmbed \ud83d\udcbb Local Vector store Qdrant \ud83d\udcbb Local
This example demonstrates using Docling with Qdrant to perform a hybrid search across your documents using dense and sparse vectors.
We'll chunk the documents using Docling before adding them to a Qdrant collection. By limiting the length of the chunks, we can preserve the meaning in each vector embedding.
fastembed-gpu
package if you've got the hardware to support it.%pip install --no-warn-conflicts -q qdrant-client docling fastembed\n%pip install --no-warn-conflicts -q qdrant-client docling fastembed
Note: you may need to restart the kernel to use updated packages.\n
Let's import all the classes we'll be working with.
In\u00a0[2]: Copied!from qdrant_client import QdrantClient\n\nfrom docling.chunking import HybridChunker\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter\nfrom qdrant_client import QdrantClient from docling.chunking import HybridChunker from docling.datamodel.base_models import InputFormat from docling.document_converter import DocumentConverter
COLLECTION_NAME = \"docling\"\n\ndoc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML])\nclient = QdrantClient(location=\":memory:\")\n# The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI.\n# For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant\n# client = QdrantClient(location=\"http://localhost:6333\")\n\nclient.set_model(\"sentence-transformers/all-MiniLM-L6-v2\")\nclient.set_sparse_model(\"Qdrant/bm25\")\nCOLLECTION_NAME = \"docling\" doc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML]) client = QdrantClient(location=\":memory:\") # The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI. # For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant # client = QdrantClient(location=\"http://localhost:6333\") client.set_model(\"sentence-transformers/all-MiniLM-L6-v2\") client.set_sparse_model(\"Qdrant/bm25\")
/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n warnings.warn(\n
We can now download and chunk the document using Docling. For demonstration, we'll use an article about chunking strategies :)
In\u00a0[4]: Copied!result = doc_converter.convert(\n \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\"\n)\ndocuments, metadatas = [], []\nfor chunk in HybridChunker().chunk(result.document):\n documents.append(chunk.text)\n metadatas.append(chunk.meta.export_json_dict())\nresult = doc_converter.convert( \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\" ) documents, metadatas = [], [] for chunk in HybridChunker().chunk(result.document): documents.append(chunk.text) metadatas.append(chunk.meta.export_json_dict())
Let's now upload the documents to Qdrant.
add()
method batches the documents and uses FastEmbed to generate vector embeddings on our machine._ = client.add(\n collection_name=COLLECTION_NAME,\n documents=documents,\n metadata=metadatas,\n batch_size=64,\n)\n_ = client.add( collection_name=COLLECTION_NAME, documents=documents, metadata=metadatas, batch_size=64, ) In\u00a0[6]: Copied!
points = client.query(\n collection_name=COLLECTION_NAME,\n query_text=\"Can I split documents?\",\n limit=10,\n)\npoints = client.query( collection_name=COLLECTION_NAME, query_text=\"Can I split documents?\", limit=10, ) In\u00a0[7]: Copied!
for i, point in enumerate(points):\n print(f\"=== {i} ===\")\n print(point.document)\n print()\nfor i, point in enumerate(points): print(f\"=== {i} ===\") print(point.document) print()
=== 0 ===\nHave you ever wondered how we, humans, would chunk? Here's a breakdown of a possible way a human would process a new document:\n1. We start at the top of the document, treating the first part as a chunk.\n\u00a0\u00a0\u00a02. We continue down the document, deciding if a new sentence or piece of information belongs with the first chunk or should start a new one.\n \u00a0\u00a0\u00a03. We keep this up until we reach the end of the document.\nThe ultimate dream? Having an agent do this for you. But slow down! This approach is still being tested and isn't quite ready for the big leagues due to the time it takes to process multiple LLM calls and the cost of those calls. There's no implementation available in public libraries just yet. However, Greg Kamradt has his version available here.\n\n=== 1 ===\nDocument Specific Chunking is a strategy that respects the document's structure. Rather than using a set number of characters or a recursive process, it creates chunks that align with the logical sections of the document, like paragraphs or subsections. This approach maintains the original author's organization of content and helps keep the text coherent. It makes the retrieved information more relevant and useful, particularly for structured documents with clearly defined sections.\nDocument Specific Chunking can handle a variety of document formats, such as:\nMarkdown\nHTML\nPython\netc\nHere we\u2019ll take Markdown as our example and use a modified version of our first sample text:\n\u200d\nThe result is the following:\nYou can see here that with a chunk size of 105, the Markdown structure of the document is taken into account, and the chunks thus preserve the semantics of the text!\n\n=== 2 ===\nAnd there you have it! These chunking strategies are like a personal toolbox when it comes to implementing Retrieval Augmented Generation. They're a ton of ways to slice and dice text, each with its unique features and quirks. This variety gives you the freedom to pick the strategy that suits your project best, allowing you to tailor your approach to perfectly fit the unique needs of your work.\nTo put these strategies into action, there's a whole array of tools and libraries at your disposal. For example, llama_index is a fantastic tool that lets you create document indices and retrieve chunked documents. Let's not forget LangChain, another remarkable tool that makes implementing chunking strategies a breeze, particularly when dealing with multi-language data. Diving into these tools and understanding how they can work in harmony with the chunking strategies we've discussed is a crucial part of mastering Retrieval Augmented Generation.\nBy the way, if you're eager to experiment with your own examples using the chunking visualisation tool featured in this blog, feel free to give it a try! You can access it right here. Enjoy, and happy chunking! \ud83d\ude09\n\n=== 3 ===\nRetrieval Augmented Generation (RAG) has been a hot topic in understanding, interpreting, and generating text with AI for the last few months. It's like a wonderful union of retrieval-based and generative models, creating a playground for researchers, data scientists, and natural language processing enthusiasts, like you and me.\nTo truly control the results produced by our RAG, we need to understand chunking strategies and their role in the process of retrieving and generating text. Indeed, each chunking strategy enhances RAG's effectiveness in its unique way.\nThe goal of chunking is, as its name says, to chunk the information into multiple smaller pieces in order to store it in a more efficient and meaningful way. This allows the retrieval to capture pieces of information that are more related to the question at hand, and the generation to be more precise, but also less costly, as only a part of a document will be included in the LLM prompt, instead of the whole document.\nLet's explore some chunking strategies together.\nThe methods mentioned in the article you're about to read usually make use of two key parameters. First, we have [chunk_size]\u2014 which controls the size of your text chunks. Then there's [chunk_overlap], which takes care of how much text overlaps between one chunk and the next.\n\n=== 4 ===\nSemantic Chunking considers the relationships within the text. It divides the text into meaningful, semantically complete chunks. This approach ensures the information's integrity during retrieval, leading to a more accurate and contextually appropriate outcome.\nSemantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.\nBy focusing on the text's meaning and context, Semantic Chunking significantly enhances the quality of retrieval. It's a top-notch choice when maintaining the semantic integrity of the text is vital.\nHowever, this method does require more effort and is notably slower than the previous ones.\nOn our example text, since it is quite short and does not expose varied subjects, this method would only generate a single chunk.\n\n=== 5 ===\nLanguage models used in the rest of your possible RAG pipeline have a token limit, which should not be exceeded. When dividing your text into chunks, it's advisable to count the number of tokens. Plenty of tokenizers are available. To ensure accuracy, use the same tokenizer for counting tokens as the one used in the language model.\nConsequently, there are also splitters available for this purpose.\nFor instance, by using the [SpacyTextSplitter] from LangChain, the following chunks are created:\n\u200d\n\n=== 6 ===\nFirst things first, we have Character Chunking. This strategy divides the text into chunks based on a fixed number of characters. Its simplicity makes it a great starting point, but it can sometimes disrupt the text's flow, breaking sentences or words in unexpected places. Despite its limitations, it's a great stepping stone towards more advanced methods.\nNow let\u2019s see that in action with an example. Imagine a text that reads:\nIf we decide to set our chunk size to 100 and no chunk overlap, we'd end up with the following chunks. As you can see, Character Chunking can lead to some intriguing, albeit sometimes nonsensical, results, cutting some of the sentences in their middle.\nBy choosing a smaller chunk size, \u00a0we would obtain more chunks, and by setting a bigger chunk overlap, we could obtain something like this:\n\u200d\nAlso, by default this method creates chunks character by character based on the empty character [\u2019 \u2019]. But you can specify a different one in order to chunk on something else, even a complete word! For instance, by specifying [' '] as the separator, you can avoid cutting words in their middle.\n\n=== 7 ===\nNext, let's take a look at Recursive Character Chunking. Based on the basic concept of Character Chunking, this advanced version takes it up a notch by dividing the text into chunks until a certain condition is met, such as reaching a minimum chunk size. This method ensures that the chunking process aligns with the text's structure, preserving more meaning. Its adaptability makes Recursive Character Chunking great for texts with varied structures.\nAgain, let\u2019s use the same example in order to illustrate this method. With a chunk size of 100, and the default settings for the other parameters, we obtain the following chunks:\n\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/retrieval_qdrant/#retrieval-with-qdrant","title":"Retrieval with Qdrant\u00b6","text":""},{"location":"examples/retrieval_qdrant/#overview","title":"Overview\u00b6","text":""},{"location":"examples/retrieval_qdrant/#setup","title":"Setup\u00b6","text":""},{"location":"examples/retrieval_qdrant/#retrieval","title":"Retrieval\u00b6","text":""},{"location":"examples/run_md/","title":"Run md","text":"In\u00a0[\u00a0]: Copied!
import json\nimport logging\nimport os\nfrom pathlib import Path\nimport json import logging import os from pathlib import Path In\u00a0[\u00a0]: Copied!
import yaml\nimport yaml In\u00a0[\u00a0]: Copied!
from docling.backend.md_backend import MarkdownDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.document import InputDocument\nfrom docling.backend.md_backend import MarkdownDocumentBackend from docling.datamodel.base_models import InputFormat from docling.datamodel.document import InputDocument In\u00a0[\u00a0]: Copied!
_log = logging.getLogger(__name__)\n_log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied!
def main():\n input_paths = [Path(\"README.md\")]\n\n for path in input_paths:\n in_doc = InputDocument(\n path_or_stream=path,\n format=InputFormat.PDF,\n backend=MarkdownDocumentBackend,\n )\n mdb = MarkdownDocumentBackend(in_doc=in_doc, path_or_stream=path)\n document = mdb.convert()\n\n out_path = Path(\"scratch\")\n print(f\"Document {path} converted.\\nSaved markdown output to: {out_path!s}\")\n\n # Export Docling document format to markdowndoc:\n fn = os.path.basename(path)\n\n with (out_path / f\"{fn}.md\").open(\"w\") as fp:\n fp.write(document.export_to_markdown())\n\n with (out_path / f\"{fn}.json\").open(\"w\") as fp:\n fp.write(json.dumps(document.export_to_dict()))\n\n with (out_path / f\"{fn}.yaml\").open(\"w\") as fp:\n fp.write(yaml.safe_dump(document.export_to_dict()))\ndef main(): input_paths = [Path(\"README.md\")] for path in input_paths: in_doc = InputDocument( path_or_stream=path, format=InputFormat.PDF, backend=MarkdownDocumentBackend, ) mdb = MarkdownDocumentBackend(in_doc=in_doc, path_or_stream=path) document = mdb.convert() out_path = Path(\"scratch\") print(f\"Document {path} converted.\\nSaved markdown output to: {out_path!s}\") # Export Docling document format to markdowndoc: fn = os.path.basename(path) with (out_path / f\"{fn}.md\").open(\"w\") as fp: fp.write(document.export_to_markdown()) with (out_path / f\"{fn}.json\").open(\"w\") as fp: fp.write(json.dumps(document.export_to_dict())) with (out_path / f\"{fn}.yaml\").open(\"w\") as fp: fp.write(yaml.safe_dump(document.export_to_dict())) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/run_with_accelerator/","title":"Accelerator options","text":"In\u00a0[\u00a0]: Copied!
from pathlib import Path\nfrom pathlib import Path In\u00a0[\u00a0]: Copied!
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n)\nfrom docling.datamodel.settings import settings\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, ) from docling.datamodel.settings import settings from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied!
def main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n # Explicitly set the accelerator\n # accelerator_options = AcceleratorOptions(\n # num_threads=8, device=AcceleratorDevice.AUTO\n # )\n accelerator_options = AcceleratorOptions(\n num_threads=8, device=AcceleratorDevice.CPU\n )\n # accelerator_options = AcceleratorOptions(\n # num_threads=8, device=AcceleratorDevice.MPS\n # )\n # accelerator_options = AcceleratorOptions(\n # num_threads=8, device=AcceleratorDevice.CUDA\n # )\n\n # easyocr doesnt support cuda:N allocation, defaults to cuda:0\n # accelerator_options = AcceleratorOptions(num_threads=8, device=\"cuda:1\")\n\n pipeline_options = PdfPipelineOptions()\n pipeline_options.accelerator_options = accelerator_options\n pipeline_options.do_ocr = True\n pipeline_options.do_table_structure = True\n pipeline_options.table_structure_options.do_cell_matching = True\n\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n\n # Enable the profiling to measure the time spent\n settings.debug.profile_pipeline_timings = True\n\n # Convert the document\n conversion_result = converter.convert(input_doc_path)\n doc = conversion_result.document\n\n # List with total time per document\n doc_conversion_secs = conversion_result.timings[\"pipeline_total\"].times\n\n md = doc.export_to_markdown()\n print(md)\n print(f\"Conversion secs: {doc_conversion_secs}\")\ndef main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" # Explicitly set the accelerator # accelerator_options = AcceleratorOptions( # num_threads=8, device=AcceleratorDevice.AUTO # ) accelerator_options = AcceleratorOptions( num_threads=8, device=AcceleratorDevice.CPU ) # accelerator_options = AcceleratorOptions( # num_threads=8, device=AcceleratorDevice.MPS # ) # accelerator_options = AcceleratorOptions( # num_threads=8, device=AcceleratorDevice.CUDA # ) # easyocr doesnt support cuda:N allocation, defaults to cuda:0 # accelerator_options = AcceleratorOptions(num_threads=8, device=\"cuda:1\") pipeline_options = PdfPipelineOptions() pipeline_options.accelerator_options = accelerator_options pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) # Enable the profiling to measure the time spent settings.debug.profile_pipeline_timings = True # Convert the document conversion_result = converter.convert(input_doc_path) doc = conversion_result.document # List with total time per document doc_conversion_secs = conversion_result.timings[\"pipeline_total\"].times md = doc.export_to_markdown() print(md) print(f\"Conversion secs: {doc_conversion_secs}\") In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/run_with_formats/","title":"Multi-format conversion","text":"In\u00a0[\u00a0]: Copied!
import json\nimport logging\nfrom pathlib import Path\nimport json import logging from pathlib import Path In\u00a0[\u00a0]: Copied!
import yaml\nimport yaml In\u00a0[\u00a0]: Copied!
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import (\n DocumentConverter,\n PdfFormatOption,\n WordFormatOption,\n)\nfrom docling.pipeline.simple_pipeline import SimplePipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\nfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend from docling.datamodel.base_models import InputFormat from docling.document_converter import ( DocumentConverter, PdfFormatOption, WordFormatOption, ) from docling.pipeline.simple_pipeline import SimplePipeline from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline In\u00a0[\u00a0]: Copied!
_log = logging.getLogger(__name__)\n_log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied!
def main():\n input_paths = [\n Path(\"README.md\"),\n Path(\"tests/data/html/wiki_duck.html\"),\n Path(\"tests/data/docx/word_sample.docx\"),\n Path(\"tests/data/docx/lorem_ipsum.docx\"),\n Path(\"tests/data/pptx/powerpoint_sample.pptx\"),\n Path(\"tests/data/2305.03393v1-pg9-img.png\"),\n Path(\"tests/data/pdf/2206.01062.pdf\"),\n Path(\"tests/data/asciidoc/test_01.asciidoc\"),\n ]\n\n ## for defaults use:\n # doc_converter = DocumentConverter()\n\n ## to customize use:\n\n doc_converter = (\n DocumentConverter( # all of the below is optional, has internal defaults.\n allowed_formats=[\n InputFormat.PDF,\n InputFormat.IMAGE,\n InputFormat.DOCX,\n InputFormat.HTML,\n InputFormat.PPTX,\n InputFormat.ASCIIDOC,\n InputFormat.CSV,\n InputFormat.MD,\n ], # whitelist formats, non-matching files are ignored.\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend\n ),\n InputFormat.DOCX: WordFormatOption(\n pipeline_cls=SimplePipeline # , backend=MsWordDocumentBackend\n ),\n },\n )\n )\n\n conv_results = doc_converter.convert_all(input_paths)\n\n for res in conv_results:\n out_path = Path(\"scratch\")\n print(\n f\"Document {res.input.file.name} converted.\"\n f\"\\nSaved markdown output to: {out_path!s}\"\n )\n _log.debug(res.document._export_to_indented_text(max_text_len=16))\n # Export Docling document format to markdowndoc:\n with (out_path / f\"{res.input.file.stem}.md\").open(\"w\") as fp:\n fp.write(res.document.export_to_markdown())\n\n with (out_path / f\"{res.input.file.stem}.json\").open(\"w\") as fp:\n fp.write(json.dumps(res.document.export_to_dict()))\n\n with (out_path / f\"{res.input.file.stem}.yaml\").open(\"w\") as fp:\n fp.write(yaml.safe_dump(res.document.export_to_dict()))\ndef main(): input_paths = [ Path(\"README.md\"), Path(\"tests/data/html/wiki_duck.html\"), Path(\"tests/data/docx/word_sample.docx\"), Path(\"tests/data/docx/lorem_ipsum.docx\"), Path(\"tests/data/pptx/powerpoint_sample.pptx\"), Path(\"tests/data/2305.03393v1-pg9-img.png\"), Path(\"tests/data/pdf/2206.01062.pdf\"), Path(\"tests/data/asciidoc/test_01.asciidoc\"), ] ## for defaults use: # doc_converter = DocumentConverter() ## to customize use: doc_converter = ( DocumentConverter( # all of the below is optional, has internal defaults. allowed_formats=[ InputFormat.PDF, InputFormat.IMAGE, InputFormat.DOCX, InputFormat.HTML, InputFormat.PPTX, InputFormat.ASCIIDOC, InputFormat.CSV, InputFormat.MD, ], # whitelist formats, non-matching files are ignored. format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend ), InputFormat.DOCX: WordFormatOption( pipeline_cls=SimplePipeline # , backend=MsWordDocumentBackend ), }, ) ) conv_results = doc_converter.convert_all(input_paths) for res in conv_results: out_path = Path(\"scratch\") print( f\"Document {res.input.file.name} converted.\" f\"\\nSaved markdown output to: {out_path!s}\" ) _log.debug(res.document._export_to_indented_text(max_text_len=16)) # Export Docling document format to markdowndoc: with (out_path / f\"{res.input.file.stem}.md\").open(\"w\") as fp: fp.write(res.document.export_to_markdown()) with (out_path / f\"{res.input.file.stem}.json\").open(\"w\") as fp: fp.write(json.dumps(res.document.export_to_dict())) with (out_path / f\"{res.input.file.stem}.yaml\").open(\"w\") as fp: fp.write(yaml.safe_dump(res.document.export_to_dict())) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/serialization/","title":"Serialization","text":"
In this notebook we showcase the usage of Docling serializers.
In\u00a0[1]: Copied!%pip install -qU pip docling docling-core~=2.29 rich\n%pip install -qU pip docling docling-core~=2.29 rich
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
DOC_SOURCE = \"https://arxiv.org/pdf/2311.18481\"\n\n# we set some start-stop cues for defining an excerpt to print\nstart_cue = \"Copyright \u00a9 2024\"\nstop_cue = \"Application of NLP to ESG\"\nDOC_SOURCE = \"https://arxiv.org/pdf/2311.18481\" # we set some start-stop cues for defining an excerpt to print start_cue = \"Copyright \u00a9 2024\" stop_cue = \"Application of NLP to ESG\" In\u00a0[3]: Copied!
from rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console(width=210) # for preventing Markdown table wrapped rendering\n\n\ndef print_in_console(text):\n console.print(Panel(text))\nfrom rich.console import Console from rich.panel import Panel console = Console(width=210) # for preventing Markdown table wrapped rendering def print_in_console(text): console.print(Panel(text))
We first convert the document:
In\u00a0[4]: Copied!from docling.document_converter import DocumentConverter\n\nconverter = DocumentConverter()\ndoc = converter.convert(source=DOC_SOURCE).document\nfrom docling.document_converter import DocumentConverter converter = DocumentConverter() doc = converter.convert(source=DOC_SOURCE).document
/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n warnings.warn(warn_msg)\n
We can now apply any BaseDocSerializer
on the produced document.
\ud83d\udc49 Note that, to keep the shown output brief, we only print an excerpt.
E.g. below we apply an HTMLDocSerializer
:
from docling_core.transforms.serializer.html import HTMLDocSerializer\n\nserializer = HTMLDocSerializer(doc=doc)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\n# we here only print an excerpt to keep the output brief:\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\nfrom docling_core.transforms.serializer.html import HTMLDocSerializer serializer = HTMLDocSerializer(doc=doc) ser_result = serializer.serialize() ser_text = ser_result.text # we here only print an excerpt to keep the output brief: print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.</p> \u2502\n\u2502 <table><tbody><tr><th>Report</th><th>Question</th><th>Answer</th></tr><tr><td>IBM 2022</td><td>How many hours were spent on employee learning in 2021?</td><td>22.5 million hours</td></tr><tr><td>IBM \u2502\n\u2502 2022</td><td>What was the rate of fatalities in 2021?</td><td>The rate of fatalities in 2021 was 0.0016.</td></tr><tr><td>IBM 2022</td><td>How many full audits were con- ducted in 2022 in \u2502\n\u2502 India?</td><td>2</td></tr><tr><td>Starbucks 2022</td><td>What is the percentage of women in the Board of Directors?</td><td>25%</td></tr><tr><td>Starbucks 2022</td><td>What was the total energy con- \u2502\n\u2502 sumption in 2021?</td><td>According to the table, the total energy consumption in 2021 was 2,491,543 MWh.</td></tr><tr><td>Starbucks 2022</td><td>How much packaging material was made from renewable mate- \u2502\n\u2502 rials?</td><td>According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.</td></tr></tbody></table> \u2502\n\u2502 <p>Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.</p> \u2502\n\u2502 <p>ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response.</p> \u2502\n\u2502 <h2>Related Work</h2> \u2502\n\u2502 <p>The DocQA integrates multiple AI technologies, namely:</p> \u2502\n\u2502 <p>Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric \u2502\n\u2502 layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et \u2502\n\u2502 al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-</p> \u2502\n\u2502 <figure><figcaption>Figure 1: System architecture: Simplified sketch of document question-answering pipeline.</figcaption></figure> \u2502\n\u2502 <p>based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).</p> \u2502\n\u2502 <p> \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
In the following example, we use a MarkdownDocSerializer
:
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer\n\nserializer = MarkdownDocSerializer(doc=doc)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\nfrom docling_core.transforms.serializer.markdown import MarkdownDocSerializer serializer = MarkdownDocSerializer(doc=doc) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. \u2502\n\u2502 \u2502\n\u2502 | Report | Question | Answer | \u2502\n\u2502 |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| \u2502\n\u2502 | IBM 2022 | How many hours were spent on employee learning in 2021? | 22.5 million hours | \u2502\n\u2502 | IBM 2022 | What was the rate of fatalities in 2021? | The rate of fatalities in 2021 was 0.0016. | \u2502\n\u2502 | IBM 2022 | How many full audits were con- ducted in 2022 in India? | 2 | \u2502\n\u2502 | Starbucks 2022 | What is the percentage of women in the Board of Directors? | 25% | \u2502\n\u2502 | Starbucks 2022 | What was the total energy con- sumption in 2021? | According to the table, the total energy consumption in 2021 was 2,491,543 MWh. | \u2502\n\u2502 | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. | \u2502\n\u2502 \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. \u2502\n\u2502 \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response. \u2502\n\u2502 \u2502\n\u2502 ## Related Work \u2502\n\u2502 \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely: \u2502\n\u2502 \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- \u2502\n\u2502 \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline. \u2502\n\u2502 \u2502\n\u2502 <!-- image --> \u2502\n\u2502 \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). \u2502\n\u2502 \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
Let's now assume we would like to reconfigure the Markdown serialization such that:
Check out the following configuration and notice the serialization differences in the output further below:
In\u00a0[7]: Copied!from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer\nfrom docling_core.transforms.serializer.markdown import MarkdownParams\n\nserializer = MarkdownDocSerializer(\n doc=doc,\n table_serializer=TripletTableSerializer(),\n params=MarkdownParams(\n image_placeholder=\"<!-- demo picture placeholder -->\",\n # ...\n ),\n)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\nfrom docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer from docling_core.transforms.serializer.markdown import MarkdownParams serializer = MarkdownDocSerializer( doc=doc, table_serializer=TripletTableSerializer(), params=MarkdownParams( image_placeholder=\"\", # ... ), ) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. \u2502\n\u2502 \u2502\n\u2502 IBM 2022, Question = How many hours were spent on employee learning in 2021?. IBM 2022, Answer = 22.5 million hours. IBM 2022, Question = What was the rate of fatalities in 2021?. IBM 2022, Answer = The \u2502\n\u2502 rate of fatalities in 2021 was 0.0016.. IBM 2022, Question = How many full audits were con- ducted in 2022 in India?. IBM 2022, Answer = 2. Starbucks 2022, Question = What is the percentage of women in the \u2502\n\u2502 Board of Directors?. Starbucks 2022, Answer = 25%. Starbucks 2022, Question = What was the total energy con- sumption in 2021?. Starbucks 2022, Answer = According to the table, the total energy consumption \u2502\n\u2502 in 2021 was 2,491,543 MWh.. Starbucks 2022, Question = How much packaging material was made from renewable mate- rials?. Starbucks 2022, Answer = According to the given data, 31% of packaging materials were \u2502\n\u2502 made from recycled or renewable materials in FY22. \u2502\n\u2502 \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. \u2502\n\u2502 \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response. \u2502\n\u2502 \u2502\n\u2502 ## Related Work \u2502\n\u2502 \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely: \u2502\n\u2502 \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- \u2502\n\u2502 \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline. \u2502\n\u2502 \u2502\n\u2502 <!-- demo picture placeholder --> \u2502\n\u2502 \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). \u2502\n\u2502 \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
In the examples above, we were able to reuse existing implementations for our desired serialization strategy, but let's now assume we want to define a custom serialization logic, e.g. we would like picture serialization to include any available picture description (captioning) annotations.
To that end, we first need to revisit our conversion and include all pipeline options needed for picture description enrichment.
In\u00a0[8]: Copied!from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n PictureDescriptionVlmOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions(\n do_picture_description=True,\n picture_description_options=PictureDescriptionVlmOptions(\n repo_id=\"HuggingFaceTB/SmolVLM-256M-Instruct\",\n prompt=\"Describe this picture in three to five sentences. Be precise and concise.\",\n ),\n generate_picture_images=True,\n images_scale=2,\n)\n\nconverter = DocumentConverter(\n format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}\n)\ndoc = converter.convert(source=DOC_SOURCE).document\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, PictureDescriptionVlmOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption pipeline_options = PdfPipelineOptions( do_picture_description=True, picture_description_options=PictureDescriptionVlmOptions( repo_id=\"HuggingFaceTB/SmolVLM-256M-Instruct\", prompt=\"Describe this picture in three to five sentences. Be precise and concise.\", ), generate_picture_images=True, images_scale=2, ) converter = DocumentConverter( format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} ) doc = converter.convert(source=DOC_SOURCE).document
/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n warnings.warn(warn_msg)\n
We can then define our custom picture serializer:
In\u00a0[9]: Copied!from typing import Any, Optional\n\nfrom docling_core.transforms.serializer.base import (\n BaseDocSerializer,\n SerializationResult,\n)\nfrom docling_core.transforms.serializer.common import create_ser_result\nfrom docling_core.transforms.serializer.markdown import (\n MarkdownParams,\n MarkdownPictureSerializer,\n)\nfrom docling_core.types.doc.document import (\n DoclingDocument,\n ImageRefMode,\n PictureDescriptionData,\n PictureItem,\n)\nfrom typing_extensions import override\n\n\nclass AnnotationPictureSerializer(MarkdownPictureSerializer):\n @override\n def serialize(\n self,\n *,\n item: PictureItem,\n doc_serializer: BaseDocSerializer,\n doc: DoclingDocument,\n separator: Optional[str] = None,\n **kwargs: Any,\n ) -> SerializationResult:\n text_parts: list[str] = []\n\n # reusing the existing result:\n parent_res = super().serialize(\n item=item,\n doc_serializer=doc_serializer,\n doc=doc,\n **kwargs,\n )\n text_parts.append(parent_res.text)\n\n # appending annotations:\n for annotation in item.annotations:\n if isinstance(annotation, PictureDescriptionData):\n text_parts.append(f\"<!-- Picture description: {annotation.text} -->\")\n\n text_res = (separator or \"\\n\").join(text_parts)\n return create_ser_result(text=text_res, span_source=item)\nfrom typing import Any, Optional from docling_core.transforms.serializer.base import ( BaseDocSerializer, SerializationResult, ) from docling_core.transforms.serializer.common import create_ser_result from docling_core.transforms.serializer.markdown import ( MarkdownParams, MarkdownPictureSerializer, ) from docling_core.types.doc.document import ( DoclingDocument, ImageRefMode, PictureDescriptionData, PictureItem, ) from typing_extensions import override class AnnotationPictureSerializer(MarkdownPictureSerializer): @override def serialize( self, *, item: PictureItem, doc_serializer: BaseDocSerializer, doc: DoclingDocument, separator: Optional[str] = None, **kwargs: Any, ) -> SerializationResult: text_parts: list[str] = [] # reusing the existing result: parent_res = super().serialize( item=item, doc_serializer=doc_serializer, doc=doc, **kwargs, ) text_parts.append(parent_res.text) # appending annotations: for annotation in item.annotations: if isinstance(annotation, PictureDescriptionData): text_parts.append(f\"\") text_res = (separator or \"\\n\").join(text_parts) return create_ser_result(text=text_res, span_source=item)
Last but not least, we define a new doc serializer which leverages our custom picture serializer.
Notice the picture description annotations in the output below:
In\u00a0[10]: Copied!serializer = MarkdownDocSerializer(\n doc=doc,\n picture_serializer=AnnotationPictureSerializer(),\n params=MarkdownParams(\n image_mode=ImageRefMode.PLACEHOLDER,\n image_placeholder=\"\",\n ),\n)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\nserializer = MarkdownDocSerializer( doc=doc, picture_serializer=AnnotationPictureSerializer(), params=MarkdownParams( image_mode=ImageRefMode.PLACEHOLDER, image_placeholder=\"\", ), ) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. \u2502\n\u2502 \u2502\n\u2502 | Report | Question | Answer | \u2502\n\u2502 |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| \u2502\n\u2502 | IBM 2022 | How many hours were spent on employee learning in 2021? | 22.5 million hours | \u2502\n\u2502 | IBM 2022 | What was the rate of fatalities in 2021? | The rate of fatalities in 2021 was 0.0016. | \u2502\n\u2502 | IBM 2022 | How many full audits were con- ducted in 2022 in India? | 2 | \u2502\n\u2502 | Starbucks 2022 | What is the percentage of women in the Board of Directors? | 25% | \u2502\n\u2502 | Starbucks 2022 | What was the total energy con- sumption in 2021? | According to the table, the total energy consumption in 2021 was 2,491,543 MWh. | \u2502\n\u2502 | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. | \u2502\n\u2502 \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. \u2502\n\u2502 \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response. \u2502\n\u2502 \u2502\n\u2502 ## Related Work \u2502\n\u2502 \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely: \u2502\n\u2502 \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- \u2502\n\u2502 \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline. \u2502\n\u2502 <!-- Picture description: The image depicts a document conversion process. It is a sequence of steps that includes document conversion, information retrieval, and response generation. The document \u2502\n\u2502 conversion step involves converting the document from a text format to a markdown format. The information retrieval step involves retrieving the document from a database or other source. The response \u2502\n\u2502 generation step involves generating a response from the information retrieval step. --> \u2502\n\u2502 \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). \u2502\n\u2502 \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n"},{"location":"examples/serialization/#serialization","title":"Serialization\u00b6","text":""},{"location":"examples/serialization/#overview","title":"Overview\u00b6","text":""},{"location":"examples/serialization/#setup","title":"Setup\u00b6","text":""},{"location":"examples/serialization/#basic-usage","title":"Basic usage\u00b6","text":""},{"location":"examples/serialization/#configuring-a-serializer","title":"Configuring a serializer\u00b6","text":""},{"location":"examples/serialization/#creating-a-custom-serializer","title":"Creating a custom serializer\u00b6","text":""},{"location":"examples/tesseract_lang_detection/","title":"Automatic OCR language detection with tesseract","text":"In\u00a0[\u00a0]: Copied!
from pathlib import Path\nfrom pathlib import Path In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n TesseractCliOcrOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, TesseractCliOcrOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied!
def main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n # Set lang=[\"auto\"] with a tesseract OCR engine: TesseractOcrOptions, TesseractCliOcrOptions\n # ocr_options = TesseractOcrOptions(lang=[\"auto\"])\n ocr_options = TesseractCliOcrOptions(lang=[\"auto\"])\n\n pipeline_options = PdfPipelineOptions(\n do_ocr=True, force_full_page_ocr=True, ocr_options=ocr_options\n )\n\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n\n doc = converter.convert(input_doc_path).document\n md = doc.export_to_markdown()\n print(md)\ndef main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" # Set lang=[\"auto\"] with a tesseract OCR engine: TesseractOcrOptions, TesseractCliOcrOptions # ocr_options = TesseractOcrOptions(lang=[\"auto\"]) ocr_options = TesseractCliOcrOptions(lang=[\"auto\"]) pipeline_options = PdfPipelineOptions( do_ocr=True, force_full_page_ocr=True, ocr_options=ocr_options ) converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(input_doc_path).document md = doc.export_to_markdown() print(md) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/translate/","title":"Simple translation","text":"In\u00a0[\u00a0]: Copied!
import logging\nfrom pathlib import Path\nimport logging from pathlib import Path In\u00a0[\u00a0]: Copied!
from docling_core.types.doc import ImageRefMode, TableItem, TextItem\nfrom docling_core.types.doc import ImageRefMode, TableItem, TextItem In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied!
_log = logging.getLogger(__name__)\n_log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied!
IMAGE_RESOLUTION_SCALE = 2.0\nIMAGE_RESOLUTION_SCALE = 2.0 In\u00a0[\u00a0]: Copied!
# FIXME: put in your favorite translation code ....\ndef translate(text: str, src: str = \"en\", dest: str = \"de\"):\n _log.warning(\"!!! IMPLEMENT HERE YOUR FAVORITE TRANSLATION CODE!!!\")\n # from googletrans import Translator\n\n # Initialize the translator\n # translator = Translator()\n\n # Translate text from English to German\n # text = \"Hello, how are you?\"\n # translated = translator.translate(text, src=\"en\", dest=\"de\")\n\n return text\n# FIXME: put in your favorite translation code .... def translate(text: str, src: str = \"en\", dest: str = \"de\"): _log.warning(\"!!! IMPLEMENT HERE YOUR FAVORITE TRANSLATION CODE!!!\") # from googletrans import Translator # Initialize the translator # translator = Translator() # Translate text from English to German # text = \"Hello, how are you?\" # translated = translator.translate(text, src=\"en\", dest=\"de\") return text In\u00a0[\u00a0]: Copied!
def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\")\n\n # Important: For operating with page images, we must keep them, otherwise the DocumentConverter\n # will destroy them for cleaning up memory.\n # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images.\n # scale=1 correspond of a standard 72 DPI image\n # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched\n # with the image field\n pipeline_options = PdfPipelineOptions()\n pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n pipeline_options.generate_page_images = True\n pipeline_options.generate_picture_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n conv_res = doc_converter.convert(input_doc_path)\n conv_doc = conv_res.document\n doc_filename = conv_res.input.file\n\n # Save markdown with embedded pictures in original text\n md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\"\n conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n for element, _level in conv_res.document.iterate_items():\n if isinstance(element, TextItem):\n element.orig = element.text\n element.text = translate(text=element.text)\n\n elif isinstance(element, TableItem):\n for cell in element.data.table_cells:\n cell.text = translate(text=element.text)\n\n # Save markdown with embedded pictures in translated text\n md_filename = output_dir / f\"{doc_filename}-with-images-translated.md\"\n conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\ndef main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") # Important: For operating with page images, we must keep them, otherwise the DocumentConverter # will destroy them for cleaning up memory. # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images. # scale=1 correspond of a standard 72 DPI image # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched # with the image field pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) conv_res = doc_converter.convert(input_doc_path) conv_doc = conv_res.document doc_filename = conv_res.input.file # Save markdown with embedded pictures in original text md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\" conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED) for element, _level in conv_res.document.iterate_items(): if isinstance(element, TextItem): element.orig = element.text element.text = translate(text=element.text) elif isinstance(element, TableItem): for cell in element.data.table_cells: cell.text = translate(text=element.text) # Save markdown with embedded pictures in translated text md_filename = output_dir / f\"{doc_filename}-with-images-translated.md\" conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)"},{"location":"examples/visual_grounding/","title":"Visual grounding","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote
This example showcases Docling's visual grounding capabilities, which can be combined with any agentic AI / RAG framework.
In this instance, we illustrate these capabilities leveraging the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings.
HF_TOKEN
.--no-warn-conflicts
meant for Colab's pre-populated Python env; feel free to remove for stricter usage):%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain matplotlib python-dotenv\n%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain matplotlib python-dotenv
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom dotenv import load_dotenv\nfrom langchain_core.prompts import PromptTemplate\nfrom langchain_docling.loader import ExportType\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nSOURCES = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nQUESTION = \"Which are the main AI models in Docling?\"\nPROMPT = PromptTemplate.from_template(\n \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\",\n)\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\nimport os from pathlib import Path from tempfile import mkdtemp from dotenv import load_dotenv from langchain_core.prompts import PromptTemplate from langchain_docling.loader import ExportType def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") SOURCES = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" QUESTION = \"Which are the main AI models in Docling?\" PROMPT = PromptTemplate.from_template( \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\", ) TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=PdfPipelineOptions(\n generate_page_images=True,\n images_scale=2.0,\n ),\n )\n }\n)\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=PdfPipelineOptions( generate_page_images=True, images_scale=2.0, ), ) } )
We set up a simple doc store for keeping converted documents, as that is needed for visual grounding further below.
In\u00a0[4]: Copied!doc_store = {}\ndoc_store_root = Path(mkdtemp())\nfor source in SOURCES:\n dl_doc = converter.convert(source=source).document\n file_path = Path(doc_store_root / f\"{dl_doc.origin.binary_hash}.json\")\n dl_doc.save_as_json(file_path)\n doc_store[dl_doc.origin.binary_hash] = file_path\ndoc_store = {} doc_store_root = Path(mkdtemp()) for source in SOURCES: dl_doc = converter.convert(source=source).document file_path = Path(doc_store_root / f\"{dl_doc.origin.binary_hash}.json\") dl_doc.save_as_json(file_path) doc_store[dl_doc.origin.binary_hash] = file_path
Now we can instantiate our loader and load documents.
In\u00a0[5]: Copied!from langchain_docling import DoclingLoader\n\nfrom docling.chunking import HybridChunker\n\nloader = DoclingLoader(\n file_path=SOURCES,\n converter=converter,\n export_type=ExportType.DOC_CHUNKS,\n chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n)\n\ndocs = loader.load()\nfrom langchain_docling import DoclingLoader from docling.chunking import HybridChunker loader = DoclingLoader( file_path=SOURCES, converter=converter, export_type=ExportType.DOC_CHUNKS, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ) docs = loader.load()
Token indices sequence length is longer than the specified maximum sequence length for this model (648 > 512). Running this sequence through the model will result in indexing errors\n
\ud83d\udc49 NOTE: As you see above, using the HybridChunker
can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here.
Inspecting some sample splits:
In\u00a0[6]: Copied!for d in docs[:3]:\n print(f\"- {d.page_content=}\")\nprint(\"...\")\nfor d in docs[:3]: print(f\"- {d.page_content=}\") print(\"...\")
- d.page_content='Docling Technical Report\\nVersion 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R\u00a8 uschlikon, Switzerland'\n- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n- d.page_content='1 Introduction\\nConverting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variability in formats, weak standardization and printing-optimized characteristic, which discards most structural features and metadata. With the advent of LLMs and popular application patterns such as retrieval-augmented generation (RAG), leveraging the rich content embedded in PDFs has become ever more relevant. In the past decade, several powerful document understanding solutions have emerged on the market, most of which are commercial software, cloud offerings [3] and most recently, multi-modal vision-language models. As of today, only a handful of open-source tools cover PDF conversion, leaving a significant feature and quality gap to proprietary solutions.\\nWith Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.\\nHere is what Docling delivers today:\\n\u00b7 Converts PDF documents to JSON or Markdown format, stable and lightning fast\\n\u00b7 Understands detailed page layout, reading order, locates figures and recovers table structures\\n\u00b7 Extracts metadata from the document, such as title, authors, references and language\\n\u00b7 Optionally applies OCR, e.g. for scanned PDFs\\n\u00b7 Can be configured to be optimal for batch-mode (i.e high throughput, low time-to-solution) or interactive mode (compromise on efficiency, low time-to-solution)\\n\u00b7 Can leverage different accelerators (GPU, MPS, etc).'\n...\nIn\u00a0[7]: Copied!
import json\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom langchain_huggingface.embeddings import HuggingFaceEmbeddings\nfrom langchain_milvus import Milvus\n\nembedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)\n\n\nmilvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed\nvectorstore = Milvus.from_documents(\n documents=docs,\n embedding=embedding,\n collection_name=\"docling_demo\",\n connection_args={\"uri\": milvus_uri},\n index_params={\"index_type\": \"FLAT\"},\n drop_old=True,\n)\nimport json from pathlib import Path from tempfile import mkdtemp from langchain_huggingface.embeddings import HuggingFaceEmbeddings from langchain_milvus import Milvus embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID) milvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed vectorstore = Milvus.from_documents( documents=docs, embedding=embedding, collection_name=\"docling_demo\", connection_args={\"uri\": milvus_uri}, index_params={\"index_type\": \"FLAT\"}, drop_old=True, ) In\u00a0[8]: Copied!
from langchain.chains import create_retrieval_chain\nfrom langchain.chains.combine_documents import create_stuff_documents_chain\nfrom langchain_huggingface import HuggingFaceEndpoint\n\nretriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K})\nllm = HuggingFaceEndpoint(\n repo_id=GEN_MODEL_ID,\n huggingfacehub_api_token=HF_TOKEN,\n)\n\n\ndef clip_text(text, threshold=100):\n return f\"{text[:threshold]}...\" if len(text) > threshold else text\nfrom langchain.chains import create_retrieval_chain from langchain.chains.combine_documents import create_stuff_documents_chain from langchain_huggingface import HuggingFaceEndpoint retriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K}) llm = HuggingFaceEndpoint( repo_id=GEN_MODEL_ID, huggingfacehub_api_token=HF_TOKEN, ) def clip_text(text, threshold=100): return f\"{text[:threshold]}...\" if len(text) > threshold else text
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\nIn\u00a0[9]: Copied!
from docling.chunking import DocMeta\nfrom docling.datamodel.document import DoclingDocument\n\nquestion_answer_chain = create_stuff_documents_chain(llm, PROMPT)\nrag_chain = create_retrieval_chain(retriever, question_answer_chain)\nresp_dict = rag_chain.invoke({\"input\": QUESTION})\n\nclipped_answer = clip_text(resp_dict[\"answer\"], threshold=200)\nprint(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")\nfrom docling.chunking import DocMeta from docling.datamodel.document import DoclingDocument question_answer_chain = create_stuff_documents_chain(llm, PROMPT) rag_chain = create_retrieval_chain(retriever, question_answer_chain) resp_dict = rag_chain.invoke({\"input\": QUESTION}) clipped_answer = clip_text(resp_dict[\"answer\"], threshold=200) print(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")
/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_deprecation.py:131: FutureWarning: 'post' (from 'huggingface_hub.inference._client') is deprecated and will be removed from version '0.31.0'. Making direct POST requests to the inference server is not supported anymore. Please use task methods instead (e.g. `InferenceClient.chat_completion`). If your use case is not supported, please open an issue in https://github.com/huggingface/huggingface_hub.\n warnings.warn(warning_message, FutureWarning)\n
Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nThe main AI models in Docling are:\n1. A layout analysis model, an accurate object-detector for page elements.\n2. TableFormer, a state-of-the-art table structure recognition model.\nIn\u00a0[10]: Copied!
import matplotlib.pyplot as plt\nfrom PIL import ImageDraw\n\nfor i, doc in enumerate(resp_dict[\"context\"][:]):\n image_by_page = {}\n print(f\"Source {i + 1}:\")\n print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")\n meta = DocMeta.model_validate(doc.metadata[\"dl_meta\"])\n\n # loading the full DoclingDocument from the document store:\n dl_doc = DoclingDocument.load_from_json(doc_store.get(meta.origin.binary_hash))\n\n for doc_item in meta.doc_items:\n if doc_item.prov:\n prov = doc_item.prov[0] # here we only consider the first provenence item\n page_no = prov.page_no\n if img := image_by_page.get(page_no):\n pass\n else:\n page = dl_doc.pages[prov.page_no]\n print(f\" page: {prov.page_no}\")\n img = page.image.pil_image\n image_by_page[page_no] = img\n bbox = prov.bbox.to_top_left_origin(page_height=page.size.height)\n bbox = bbox.normalized(page.size)\n thickness = 2\n padding = thickness + 2\n bbox.l = round(bbox.l * img.width - padding)\n bbox.r = round(bbox.r * img.width + padding)\n bbox.t = round(bbox.t * img.height - padding)\n bbox.b = round(bbox.b * img.height + padding)\n draw = ImageDraw.Draw(img)\n draw.rectangle(\n xy=bbox.as_tuple(),\n outline=\"blue\",\n width=thickness,\n )\n for p in image_by_page:\n img = image_by_page[p]\n plt.figure(figsize=[15, 15])\n plt.imshow(img)\n plt.axis(\"off\")\n plt.show()\nimport matplotlib.pyplot as plt from PIL import ImageDraw for i, doc in enumerate(resp_dict[\"context\"][:]): image_by_page = {} print(f\"Source {i + 1}:\") print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\") meta = DocMeta.model_validate(doc.metadata[\"dl_meta\"]) # loading the full DoclingDocument from the document store: dl_doc = DoclingDocument.load_from_json(doc_store.get(meta.origin.binary_hash)) for doc_item in meta.doc_items: if doc_item.prov: prov = doc_item.prov[0] # here we only consider the first provenence item page_no = prov.page_no if img := image_by_page.get(page_no): pass else: page = dl_doc.pages[prov.page_no] print(f\" page: {prov.page_no}\") img = page.image.pil_image image_by_page[page_no] = img bbox = prov.bbox.to_top_left_origin(page_height=page.size.height) bbox = bbox.normalized(page.size) thickness = 2 padding = thickness + 2 bbox.l = round(bbox.l * img.width - padding) bbox.r = round(bbox.r * img.width + padding) bbox.t = round(bbox.t * img.height - padding) bbox.b = round(bbox.b * img.height + padding) draw = ImageDraw.Draw(img) draw.rectangle( xy=bbox.as_tuple(), outline=\"blue\", width=thickness, ) for p in image_by_page: img = image_by_page[p] plt.figure(figsize=[15, 15]) plt.imshow(img) plt.axis(\"off\") plt.show()
Source 1:\n text: \"3.2 AI models\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re...\"\n page: 3\n
Source 2:\n text: \"3 Processing pipeline\\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ...\"\n page: 2\n
Source 3:\n text: \"6 Future work and contributions\\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ...\"\n page: 5\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/visual_grounding/#visual-grounding","title":"Visual grounding\u00b6","text":""},{"location":"examples/visual_grounding/#setup","title":"Setup\u00b6","text":""},{"location":"examples/visual_grounding/#document-store-setup","title":"Document store setup\u00b6","text":""},{"location":"examples/visual_grounding/#document-loading","title":"Document loading\u00b6","text":"
We first define our converter, in this case including options for keeping page images (for visual grounding).
"},{"location":"examples/visual_grounding/#ingestion","title":"Ingestion\u00b6","text":""},{"location":"examples/visual_grounding/#rag","title":"RAG\u00b6","text":""},{"location":"examples/visual_grounding/#visual-grounding","title":"Visual grounding\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/","title":"VLM pipeline with remote model","text":"In\u00a0[\u00a0]: Copied!import logging\nimport os\nfrom pathlib import Path\nfrom typing import Optional\nimport logging import os from pathlib import Path from typing import Optional In\u00a0[\u00a0]: Copied!
import requests\nfrom docling_core.types.doc.page import SegmentedPage\nfrom dotenv import load_dotenv\nimport requests from docling_core.types.doc.page import SegmentedPage from dotenv import load_dotenv In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline In\u00a0[\u00a0]: Copied!
def lms_vlm_options(model: str, prompt: str, format: ResponseFormat):\n options = ApiVlmOptions(\n url=\"http://localhost:1234/v1/chat/completions\", # the default LM Studio\n params=dict(\n model=model,\n ),\n prompt=prompt,\n timeout=90,\n scale=1.0,\n response_format=format,\n )\n return options\ndef lms_vlm_options(model: str, prompt: str, format: ResponseFormat): options = ApiVlmOptions( url=\"http://localhost:1234/v1/chat/completions\", # the default LM Studio params=dict( model=model, ), prompt=prompt, timeout=90, scale=1.0, response_format=format, ) return options In\u00a0[\u00a0]: Copied!
def lms_olmocr_vlm_options(model: str):\n def _dynamic_olmocr_prompt(page: Optional[SegmentedPage]):\n if page is None:\n return (\n \"Below is the image of one page of a document. Just return the plain text\"\n \" representation of this document as if you were reading it naturally.\\n\"\n \"Do not hallucinate.\\n\"\n )\n\n anchor = [\n f\"Page dimensions: {int(page.dimension.width)}x{int(page.dimension.height)}\"\n ]\n\n for text_cell in page.textline_cells:\n if not text_cell.text.strip():\n continue\n bbox = text_cell.rect.to_bounding_box().to_bottom_left_origin(\n page.dimension.height\n )\n anchor.append(f\"[{int(bbox.l)}x{int(bbox.b)}] {text_cell.text}\")\n\n for image_cell in page.bitmap_resources:\n bbox = image_cell.rect.to_bounding_box().to_bottom_left_origin(\n page.dimension.height\n )\n anchor.append(\n f\"[Image {int(bbox.l)}x{int(bbox.b)} to {int(bbox.r)}x{int(bbox.t)}]\"\n )\n\n if len(anchor) == 1:\n anchor.append(\n f\"[Image 0x0 to {int(page.dimension.width)}x{int(page.dimension.height)}]\"\n )\n\n # Original prompt uses cells sorting. We are skipping it in this demo.\n\n base_text = \"\\n\".join(anchor)\n\n return (\n f\"Below is the image of one page of a document, as well as some raw textual\"\n f\" content that was previously extracted for it. Just return the plain text\"\n f\" representation of this document as if you were reading it naturally.\\n\"\n f\"Do not hallucinate.\\n\"\n f\"RAW_TEXT_START\\n{base_text}\\nRAW_TEXT_END\"\n )\n\n options = ApiVlmOptions(\n url=\"http://localhost:1234/v1/chat/completions\",\n params=dict(\n model=model,\n ),\n prompt=_dynamic_olmocr_prompt,\n timeout=90,\n scale=1.0,\n max_size=1024, # from OlmOcr pipeline\n response_format=ResponseFormat.MARKDOWN,\n )\n return options\ndef lms_olmocr_vlm_options(model: str): def _dynamic_olmocr_prompt(page: Optional[SegmentedPage]): if page is None: return ( \"Below is the image of one page of a document. Just return the plain text\" \" representation of this document as if you were reading it naturally.\\n\" \"Do not hallucinate.\\n\" ) anchor = [ f\"Page dimensions: {int(page.dimension.width)}x{int(page.dimension.height)}\" ] for text_cell in page.textline_cells: if not text_cell.text.strip(): continue bbox = text_cell.rect.to_bounding_box().to_bottom_left_origin( page.dimension.height ) anchor.append(f\"[{int(bbox.l)}x{int(bbox.b)}] {text_cell.text}\") for image_cell in page.bitmap_resources: bbox = image_cell.rect.to_bounding_box().to_bottom_left_origin( page.dimension.height ) anchor.append( f\"[Image {int(bbox.l)}x{int(bbox.b)} to {int(bbox.r)}x{int(bbox.t)}]\" ) if len(anchor) == 1: anchor.append( f\"[Image 0x0 to {int(page.dimension.width)}x{int(page.dimension.height)}]\" ) # Original prompt uses cells sorting. We are skipping it in this demo. base_text = \"\\n\".join(anchor) return ( f\"Below is the image of one page of a document, as well as some raw textual\" f\" content that was previously extracted for it. Just return the plain text\" f\" representation of this document as if you were reading it naturally.\\n\" f\"Do not hallucinate.\\n\" f\"RAW_TEXT_START\\n{base_text}\\nRAW_TEXT_END\" ) options = ApiVlmOptions( url=\"http://localhost:1234/v1/chat/completions\", params=dict( model=model, ), prompt=_dynamic_olmocr_prompt, timeout=90, scale=1.0, max_size=1024, # from OlmOcr pipeline response_format=ResponseFormat.MARKDOWN, ) return options In\u00a0[\u00a0]: Copied!
def ollama_vlm_options(model: str, prompt: str):\n options = ApiVlmOptions(\n url=\"http://localhost:11434/v1/chat/completions\", # the default Ollama endpoint\n params=dict(\n model=model,\n ),\n prompt=prompt,\n timeout=90,\n scale=1.0,\n response_format=ResponseFormat.MARKDOWN,\n )\n return options\ndef ollama_vlm_options(model: str, prompt: str): options = ApiVlmOptions( url=\"http://localhost:11434/v1/chat/completions\", # the default Ollama endpoint params=dict( model=model, ), prompt=prompt, timeout=90, scale=1.0, response_format=ResponseFormat.MARKDOWN, ) return options In\u00a0[\u00a0]: Copied!
def watsonx_vlm_options(model: str, prompt: str):\n load_dotenv()\n api_key = os.environ.get(\"WX_API_KEY\")\n project_id = os.environ.get(\"WX_PROJECT_ID\")\n\n def _get_iam_access_token(api_key: str) -> str:\n res = requests.post(\n url=\"https://iam.cloud.ibm.com/identity/token\",\n headers={\n \"Content-Type\": \"application/x-www-form-urlencoded\",\n },\n data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\",\n )\n res.raise_for_status()\n api_out = res.json()\n print(f\"{api_out=}\")\n return api_out[\"access_token\"]\n\n options = ApiVlmOptions(\n url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\",\n params=dict(\n model_id=model,\n project_id=project_id,\n parameters=dict(\n max_new_tokens=400,\n ),\n ),\n headers={\n \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key),\n },\n prompt=prompt,\n timeout=60,\n response_format=ResponseFormat.MARKDOWN,\n )\n return options\ndef watsonx_vlm_options(model: str, prompt: str): load_dotenv() api_key = os.environ.get(\"WX_API_KEY\") project_id = os.environ.get(\"WX_PROJECT_ID\") def _get_iam_access_token(api_key: str) -> str: res = requests.post( url=\"https://iam.cloud.ibm.com/identity/token\", headers={ \"Content-Type\": \"application/x-www-form-urlencoded\", }, data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\", ) res.raise_for_status() api_out = res.json() print(f\"{api_out=}\") return api_out[\"access_token\"] options = ApiVlmOptions( url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\", params=dict( model_id=model, project_id=project_id, parameters=dict( max_new_tokens=400, ), ), headers={ \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key), }, prompt=prompt, timeout=60, response_format=ResponseFormat.MARKDOWN, ) return options In\u00a0[\u00a0]: Copied!
def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2305.03393v1-pg9.pdf\"\n\n pipeline_options = VlmPipelineOptions(\n enable_remote_services=True # <-- this is required!\n )\n\n # The ApiVlmOptions() allows to interface with APIs supporting\n # the multi-modal chat interface. Here follow a few example on how to configure those.\n\n # One possibility is self-hosting model, e.g. via LM Studio, Ollama or others.\n\n # Example using the SmolDocling model with LM Studio:\n # (uncomment the following lines)\n pipeline_options.vlm_options = lms_vlm_options(\n model=\"smoldocling-256m-preview-mlx-docling-snap\",\n prompt=\"Convert this page to docling.\",\n format=ResponseFormat.DOCTAGS,\n )\n\n # Example using the Granite Vision model with LM Studio:\n # (uncomment the following lines)\n # pipeline_options.vlm_options = lms_vlm_options(\n # model=\"granite-vision-3.2-2b\",\n # prompt=\"OCR the full page to markdown.\",\n # format=ResponseFormat.MARKDOWN,\n # )\n\n # Example using the OlmOcr (dynamic prompt) model with LM Studio:\n # (uncomment the following lines)\n # pipeline_options.vlm_options = lms_olmocr_vlm_options(\n # model=\"hf.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF\",\n # )\n\n # Example using the Granite Vision model with Ollama:\n # (uncomment the following lines)\n # pipeline_options.vlm_options = ollama_vlm_options(\n # model=\"granite3.2-vision:2b\",\n # prompt=\"OCR the full page to markdown.\",\n # )\n\n # Another possibility is using online services, e.g. watsonx.ai.\n # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID.\n # (uncomment the following lines)\n # pipeline_options.vlm_options = watsonx_vlm_options(\n # model=\"ibm/granite-vision-3-2-2b\", prompt=\"OCR the full page to markdown.\"\n # )\n\n # Create the DocumentConverter and launch the conversion.\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n pipeline_cls=VlmPipeline,\n )\n }\n )\n result = doc_converter.convert(input_doc_path)\n print(result.document.export_to_markdown())\ndef main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2305.03393v1-pg9.pdf\" pipeline_options = VlmPipelineOptions( enable_remote_services=True # <-- this is required! ) # The ApiVlmOptions() allows to interface with APIs supporting # the multi-modal chat interface. Here follow a few example on how to configure those. # One possibility is self-hosting model, e.g. via LM Studio, Ollama or others. # Example using the SmolDocling model with LM Studio: # (uncomment the following lines) pipeline_options.vlm_options = lms_vlm_options( model=\"smoldocling-256m-preview-mlx-docling-snap\", prompt=\"Convert this page to docling.\", format=ResponseFormat.DOCTAGS, ) # Example using the Granite Vision model with LM Studio: # (uncomment the following lines) # pipeline_options.vlm_options = lms_vlm_options( # model=\"granite-vision-3.2-2b\", # prompt=\"OCR the full page to markdown.\", # format=ResponseFormat.MARKDOWN, # ) # Example using the OlmOcr (dynamic prompt) model with LM Studio: # (uncomment the following lines) # pipeline_options.vlm_options = lms_olmocr_vlm_options( # model=\"hf.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF\", # ) # Example using the Granite Vision model with Ollama: # (uncomment the following lines) # pipeline_options.vlm_options = ollama_vlm_options( # model=\"granite3.2-vision:2b\", # prompt=\"OCR the full page to markdown.\", # ) # Another possibility is using online services, e.g. watsonx.ai. # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID. # (uncomment the following lines) # pipeline_options.vlm_options = watsonx_vlm_options( # model=\"ibm/granite-vision-3-2-2b\", prompt=\"OCR the full page to markdown.\" # ) # Create the DocumentConverter and launch the conversion. doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, pipeline_cls=VlmPipeline, ) } ) result = doc_converter.convert(input_doc_path) print(result.document.export_to_markdown()) In\u00a0[\u00a0]: Copied!
if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/vlm_pipeline_api_model/#example-of-apivlmoptions-definitions","title":"Example of ApiVlmOptions definitions\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/#using-lm-studio","title":"Using LM Studio\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/#using-lm-studio-with-olmocr-model","title":"Using LM Studio with OlmOcr model\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/#using-ollama","title":"Using Ollama\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/#using-a-cloud-service-like-ibm-watsonxai","title":"Using a cloud service like IBM watsonx.ai\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/#usage-and-conversion","title":"Usage and conversion\u00b6","text":""},{"location":"faq/","title":"FAQ","text":"
This is a collection of FAQ collected from the user questions on https://github.com/docling-project/docling/discussions.
Is Python 3.13 supported? Install conflicts with numpy (python 3.13) Is macOS x86_64 supported? Are text styles (bold, underline, etc) supported? How do I run completely offline? Which model weights are needed to run Docling? SSL error downloading model weights Which OCR languages are supported? Some images are missing from MS Word and PowerpointHybridChunker
triggers warning: 'Token indices sequence length is longer than the specified maximum sequence length for this model' How to use flash attention?"},{"location":"faq/#is-python-313-supported","title":"Is Python 3.13 supported?","text":"Python 3.13 is supported from Docling 2.18.0.
"},{"location":"faq/#install-conflicts-with-numpy-python-313","title":"Install conflicts with numpy (python 3.13)","text":"When using docling-ibm-models>=2.0.7
and deepsearch-glm>=0.26.2
these issues should not show up anymore. Docling supports numpy versions >=1.24.4,<3.0.0
which should match all usages.
For older versions
This has been observed installing docling and langchain via poetry.
...\nThus, docling (>=2.7.0,<3.0.0) requires numpy (>=1.26.4,<2.0.0).\nSo, because ... depends on both numpy (>=2.0.2,<3.0.0) and docling (^2.7.0), version solving failed.\n
Numpy is only adding Python 3.13 support starting in some 2.x.y version. In order to prepare for 3.13, Docling depends on a 2.x.y for 3.13, otherwise depending an 1.x.y version. If you are allowing 3.13 in your pyproject.toml, Poetry will try to find some way to reconcile Docling's numpy version for 3.13 (some 2.x.y) with LangChain's version for that (some 1.x.y) \u2014 leading to the error above.
Check if Python 3.13 is among the Python versions allowed by your pyproject.toml and if so, remove it and try again. E.g., if you have python = \"^3.10\", use python = \">=3.10,<3.13\" instead.
If you want to retain compatibility with python 3.9-3.13, you can also use a selector in pyproject.toml similar to the following
numpy = [\n { version = \"^2.1.0\", markers = 'python_version >= \"3.13\"' },\n { version = \"^1.24.4\", markers = 'python_version < \"3.13\"' },\n]\n
Source: Issue #283
"},{"location":"faq/#is-macos-x86_64-supported","title":"Is macOS x86_64 supported?","text":"Yes, Docling (still) supports running the standard pipeline on macOS x86_64.
However, users might get into a combination of incompatible dependencies on a fresh install. Because Docling depends on PyTorch which dropped support for macOS x86_64 after the 2.2.2 release, and this old version of PyTorch works only with NumPy 1.x, users must ensure the correct NumPy version is running.
pip install docling \"numpy<2.0.0\"\n
Source: Issue #1694.
"},{"location":"faq/#are-text-styles-bold-underline-etc-supported","title":"Are text styles (bold, underline, etc) supported?","text":"Currently text styles are not supported in the DoclingDocument
format. If you are interest in contributing this feature, please open a discussion topic to brainstorm on the design.
Note: this is not a simple topic
"},{"location":"faq/#how-do-i-run-completely-offline","title":"How do I run completely offline?","text":"Docling is not using any remote service, hence it can run in completely isolated air-gapped environments.
The only requirement is pointing the Docling runtime to the location where the model artifacts have been stored.
For example
pipeline_options = PdfPipelineOptions(artifacts_path=\"your location\")\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n
Source: Issue #326
"},{"location":"faq/#which-model-weights-are-needed-to-run-docling","title":"Which model weights are needed to run Docling?","text":"Model weights are needed for the AI models used in the PDF pipeline. Other document types (docx, pptx, etc) do not have any such requirement.
For processing PDF documents, Docling requires the model weights from https://huggingface.co/ds4sd/docling-models.
When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has special pipeline options to control the runtime behavior.
"},{"location":"faq/#ssl-error-downloading-model-weights","title":"SSL error downloading model weights","text":"URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)>\n
Similar SSL download errors have been observed by some users. This happens when model weights are fetched from Hugging Face. The error could happen when the python environment doesn't have an up-to-date list of trusted certificates.
Possible solutions were
pip install --upgrade certifi
SSL_CERT_FILE
and REQUESTS_CA_BUNDLE
to the value of python -m certifi
: CERT_PATH=$(python -m certifi)\nexport SSL_CERT_FILE=${CERT_PATH}\nexport REQUESTS_CA_BUNDLE=${CERT_PATH}\n
Docling supports multiple OCR engine, each one has its own list of supported languages. Here is a collection of links to the original OCR engine's documentation listing the OCR languages.
Setting the OCR language in Docling is done via the OCR pipeline options:
from docling.datamodel.pipeline_options import PdfPipelineOptions\n\npipeline_options = PdfPipelineOptions()\npipeline_options.ocr_options.lang = [\"fr\", \"de\", \"es\", \"en\"] # example of languages for EasyOCR\n
"},{"location":"faq/#some-images-are-missing-from-ms-word-and-powerpoint","title":"Some images are missing from MS Word and Powerpoint","text":"The image processing library used by Docling is able to handle embedded WMF images only on Windows platform. If you are on other operating systems, these images will be ignored.
"},{"location":"faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model","title":"HybridChunker
triggers warning: 'Token indices sequence length is longer than the specified maximum sequence length for this model'","text":"TLDR: In the context of the HybridChunker
, this is a known & ancitipated \"false alarm\".
Details:
Using the HybridChunker
often triggers a warning like this:
Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors
This is a warning that is emitted by transformers, saying that actually running this sequence through the model will result in indexing errors, i.e. the problematic case is only if one indeed passes the particular sequence through the (embedding) model.
In our case though, this occurs as a \"false alarm\", since what happens is the following:
What is important is the actual token length of the produced chunks. The snippet below can be used for getting the actual maximum chunk size (for users wanting to confirm that this does not exceed the model limit):
chunk_max_len = 0\nfor i, chunk in enumerate(chunks):\n ser_txt = chunker.serialize(chunk=chunk)\n ser_tokens = len(tokenizer.tokenize(ser_txt))\n if ser_tokens > chunk_max_len:\n chunk_max_len = ser_tokens\n print(f\"{i}\\t{ser_tokens}\\t{repr(ser_txt[:100])}...\")\nprint(f\"Longest chunk yielded: {chunk_max_len} tokens\")\nprint(f\"Model max length: {tokenizer.model_max_length}\")\n
Also see docling#725.
Source: Issue docling-core#119
"},{"location":"faq/#how-to-use-flash-attention","title":"How to use flash attention?","text":"When running models in Docling on CUDA devices, you can enable the usage of the Flash Attention2 library.
Using environment variables:
DOCLING_CUDA_USE_FLASH_ATTENTION2=1\n
Using code:
from docling.datamodel.accelerator_options import (\n AcceleratorOptions,\n)\n\npipeline_options = VlmPipelineOptions(\n accelerator_options=AcceleratorOptions(cuda_use_flash_attention2=True)\n)\n
This requires having the flash-attn package installed. Below are two alternative ways for installing it:
# Building from sources (required the CUDA dev environment)\npip install flash-attn\n\n# Using pre-built wheels (not available in all possible setups)\nFLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn\n
"},{"location":"installation/","title":"Installation","text":"To use Docling, simply install docling
from your Python package manager, e.g. pip:
pip install docling\n
Works on macOS, Linux, and Windows, with support for both x86_64 and arm64 architectures.
Alternative PyTorch distributionsThe Docling models depend on the PyTorch library. Depending on your architecture, you might want to use a different distribution of torch
. For example, you might want support for different accelerator or for a cpu-only version. All the different ways for installing torch
are listed on their website https://pytorch.org/.
One common situation is the installation on Linux systems with cpu-only support. In this case, we suggest the installation of Docling with the following options
# Example for installing on the Linux cpu-only version\npip install docling --extra-index-url https://download.pytorch.org/whl/cpu\n
Alternative OCR engines Docling supports multiple OCR engines for processing scanned documents. The current version provides the following engines.
Engine Installation Usage EasyOCR Default in Docling or viapip install easyocr
. EasyOcrOptions
Tesseract System dependency. See description for Tesseract and Tesserocr below. TesseractOcrOptions
Tesseract CLI System dependency. See description below. TesseractCliOcrOptions
OcrMac System dependency. See description below. OcrMacOptions
RapidOCR Extra feature not included in Default Docling installation can be installed via pip install rapidocr_onnxruntime
RapidOcrOptions
OnnxTR Can be installed via the plugin system pip install \"docling-ocr-onnxtr[cpu]\"
. Please take a look at docling-OCR-OnnxTR. OnnxtrOcrOptions
The Docling DocumentConverter
allows to choose the OCR engine with the ocr_options
settings. For example
from docling.datamodel.base_models import ConversionStatus, PipelineOptions\nfrom docling.datamodel.pipeline_options import PipelineOptions, EasyOcrOptions, TesseractOcrOptions\nfrom docling.document_converter import DocumentConverter\n\npipeline_options = PipelineOptions()\npipeline_options.do_ocr = True\npipeline_options.ocr_options = TesseractOcrOptions() # Use Tesseract\n\ndoc_converter = DocumentConverter(\n pipeline_options=pipeline_options,\n)\n
Tesseract installation
Tesseract is a popular OCR engine which is available on most operating systems. For using this engine with Docling, Tesseract must be installed on your system, using the packaging tool of your choice. Below we provide example commands. After installing Tesseract you are expected to provide the path to its language files using the TESSDATA_PREFIX
environment variable (note that it must terminate with a slash /
).
brew install tesseract leptonica pkg-config\nTESSDATA_PREFIX=/opt/homebrew/share/tessdata/\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n
apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config\nTESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n
dnf install tesseract tesseract-devel tesseract-langpack-eng tesseract-osd leptonica-devel\nTESSDATA_PREFIX=/usr/share/tesseract/tessdata/\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n
Linking to Tesseract The most efficient usage of the Tesseract library is via linking. Docling is using the Tesserocr package for this.
If you get into installation issues of Tesserocr, we suggest using the following installation options:
pip uninstall tesserocr\npip install --no-binary :all: tesserocr\n
ocrmac installation
ocrmac is using Apple's vision(or livetext) framework as OCR backend. For using this engine with Docling, ocrmac must be installed on your system. This only works on macOS systems with newer macOS versions (10.15+).
pip install ocrmac\n
Installation on macOS Intel (x86_64) When installing Docling on macOS with Intel processors, you might encounter errors with PyTorch compatibility. This happens because newer PyTorch versions (2.6.0+) no longer provide wheels for Intel-based Macs.
If you're using an Intel Mac, install Docling with compatible PyTorch Note: PyTorch 2.2.2 requires Python 3.12 or lower. Make sure you're not using Python 3.13+.
# For uv users\nuv add torch==2.2.2 torchvision==0.17.2 docling\n\n# For pip users\npip install \"docling[mac_intel]\"\n\n# For Poetry users\npoetry add docling\n
"},{"location":"installation/#development-setup","title":"Development setup","text":"To develop Docling features, bugfixes etc., install as follows from your local clone's root dir:
uv sync --all-extras\n
"},{"location":"integrations/","title":"Integrations","text":"Use the navigation on the left to browse through Docling integrations with popular frameworks and tools.
"},{"location":"integrations/apify/","title":"Apify","text":"
You can run Docling in the cloud without installation using the Docling Actor on Apify platform. Simply provide a document URL and get the processed result:
apify call vancura/docling -i '{\n \"options\": {\n \"to_formats\": [\"md\", \"json\", \"html\", \"text\", \"doctags\"]\n },\n \"http_sources\": [\n {\"url\": \"https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf\"},\n {\"url\": \"https://arxiv.org/pdf/2408.09869\"}\n ]\n}'\n
The Actor stores results in:
OUTPUT_RESULT
)DOCLING_LOG
)Read more about the Docling Actor, including how to use it via the Apify API and CLI.
Docling is available as an extraction backend in the Bee framework.
Docling is available in Cloudera through the RAG Studio Accelerator for Machine Learning Projects (AMP).
Docling is available in CrewAI as the CrewDoclingSource
knowledge source.
Docling is used by the Data Prep Kit open-source toolkit for preparing unstructured data for LLM application development ranging from laptop scale to datacenter scale.
"},{"location":"integrations/data_prep_kit/#components","title":"Components","text":""},{"location":"integrations/data_prep_kit/#pdf-ingestion-to-parquet","title":"PDF ingestion to Parquet","text":"Docling is available as a file conversion method in DocETL:
Docling is available as a converter in Haystack:
Docling is powering document processing in InstructLab, enabling users to unlock the knowledge hidden in documents and present it to InstructLab's fine-tuning for aligning AI models to the user's specific data.
More details can be found in this blog post.
Docling is available in Kotaemon as the DoclingReader
loader:
Docling is available as an official LangChain extension.
To get started, check out the step-by-step guide in LangChain.
Docling is available as an official LlamaIndex extension.
To get started, check out the step-by-step guide in LlamaIndex.
"},{"location":"integrations/llamaindex/#components","title":"Components","text":""},{"location":"integrations/llamaindex/#docling-reader","title":"Docling Reader","text":"Reads document files and uses Docling to populate LlamaIndex Document
objects \u2014 either serializing Docling's data model (losslessly, e.g. as JSON) or exporting to a simplified format (lossily, e.g. as Markdown).
Reads LlamaIndex Document
objects populated in Docling's format by Docling Reader and, using its knowledge of the Docling format, parses them to LlamaIndex Node
objects for downstream usage in LlamaIndex applications, e.g. as chunks for embedding.
Docling is powering the NVIDIA PDF to Podcast agentic AI blueprint:
Docling is available an ingestion engine for OpenContracts, allowing you to use Docling's OCR engine(s), chunker(s), labels, etc. and load them into a platform supporting bulk data extraction, text annotating, and question-answering:
Docling is available as a plugin for Open WebUI.
Docling is available in Prodigy as a Prodigy-PDF plugin recipe.
More details can be found in this blog post.
Docling is powering document processing in Red Hat Enterprise Linux AI (RHEL AI), enabling users to unlock the knowledge hidden in documents and present it to InstructLab's fine-tuning for aligning AI models to the user's specific data.
Docling is available in spaCy as the spaCy Layout plugin.
More details can be found in this blog post.
Docling is available as a text extraction backend for txtai.
Docling is available as a document parser in Vectara.
This page provides documentation for our command line tools.
"},{"location":"reference/cli/#docling","title":"docling","text":"Usage:
docling [OPTIONS] source\n
Options:
Name Type Description Default--from
choice (docx
| pptx
| html
| image
| pdf
| asciidoc
| md
| csv
| xlsx
| xml_uspto
| xml_jats
| json_docling
| audio
) Specify input formats to convert from. Defaults to all formats. None --to
choice (md
| json
| html
| html_split_page
| text
| doctags
) Specify output formats. Defaults to Markdown. None --show-layout
/ --no-show-layout
boolean If enabled, the page images will show the bounding-boxes of the items. False
--headers
text Specify http request headers used when fetching url input sources in the form of a JSON string None --image-export-mode
choice (placeholder
| embedded
| referenced
) Image export mode for the document (only in case of JSON, Markdown or HTML). With placeholder
, only the position of the image is marked in the output. In embedded
mode, the image is embedded as base64 encoded string. In referenced
mode, the image is exported in PNG format and referenced from the main exported document. ImageRefMode.EMBEDDED
--pipeline
choice (standard
| vlm
| asr
) Choose the pipeline to process PDF or image files. ProcessingPipeline.STANDARD
--vlm-model
choice (smoldocling
| granite_vision
| granite_vision_ollama
) Choose the VLM model to use with PDF or image files. VlmModelType.SMOLDOCLING
--asr-model
choice (whisper_tiny
| whisper_small
| whisper_medium
| whisper_base
| whisper_large
| whisper_turbo
) Choose the ASR model to use with audio/video files. AsrModelType.WHISPER_TINY
--ocr
/ --no-ocr
boolean If enabled, the bitmap content will be processed using OCR. True
--force-ocr
/ --no-force-ocr
boolean Replace any existing text with OCR generated text over the full content. False
--ocr-engine
text The OCR engine to use. When --allow-external-plugins is not set, the available values are: easyocr, ocrmac, rapidocr, tesserocr, tesseract. Use the option --show-external-plugins to see the options allowed with external plugins. easyocr
--ocr-lang
text Provide a comma-separated list of languages used by the OCR engine. Note that each OCR engine has different values for the language names. None --pdf-backend
choice (pypdfium2
| dlparse_v1
| dlparse_v2
| dlparse_v4
) The PDF backend to use. PdfBackend.DLPARSE_V2
--table-mode
choice (fast
| accurate
) The mode to use in the table structure model. TableFormerMode.ACCURATE
--enrich-code
/ --no-enrich-code
boolean Enable the code enrichment model in the pipeline. False
--enrich-formula
/ --no-enrich-formula
boolean Enable the formula enrichment model in the pipeline. False
--enrich-picture-classes
/ --no-enrich-picture-classes
boolean Enable the picture classification enrichment model in the pipeline. False
--enrich-picture-description
/ --no-enrich-picture-description
boolean Enable the picture description model in the pipeline. False
--artifacts-path
path If provided, the location of the model artifacts. None --enable-remote-services
/ --no-enable-remote-services
boolean Must be enabled when using models connecting to remote services. False
--allow-external-plugins
/ --no-allow-external-plugins
boolean Must be enabled for loading modules from third-party plugins. False
--show-external-plugins
/ --no-show-external-plugins
boolean List the third-party plugins which are available when the option --allow-external-plugins is set. False
--abort-on-error
/ --no-abort-on-error
boolean If enabled, the processing will be aborted when the first error is encountered. False
--output
path Output directory where results are saved. .
--verbose
, -v
integer Set the verbosity level. -v for info logging, -vv for debug logging. 0
--debug-visualize-cells
/ --no-debug-visualize-cells
boolean Enable debug output which visualizes the PDF cells False
--debug-visualize-ocr
/ --no-debug-visualize-ocr
boolean Enable debug output which visualizes the OCR cells False
--debug-visualize-layout
/ --no-debug-visualize-layout
boolean Enable debug output which visualizes the layour clusters False
--debug-visualize-tables
/ --no-debug-visualize-tables
boolean Enable debug output which visualizes the table cells False
--version
boolean Show version information. None --document-timeout
float The timeout for processing each document, in seconds. None --num-threads
integer Number of threads 4
--device
choice (auto
| cpu
| cuda
| mps
) Accelerator device AcceleratorDevice.AUTO
--logo
boolean Docling logo None --help
boolean Show this message and exit. False
"},{"location":"reference/docling_document/","title":"Docling Document","text":"This is an automatic generated API reference of the DoclingDocument type.
"},{"location":"reference/docling_document/#docling_core.types.doc","title":"doc","text":"Package for models defined by the Document type.
Classes:
DoclingDocument
\u2013 DoclingDocument.
DocumentOrigin
\u2013 FileSource.
DocItem
\u2013 DocItem.
DocItemLabel
\u2013 DocItemLabel.
ProvenanceItem
\u2013 ProvenanceItem.
GroupItem
\u2013 GroupItem.
GroupLabel
\u2013 GroupLabel.
NodeItem
\u2013 NodeItem.
PageItem
\u2013 PageItem.
FloatingItem
\u2013 FloatingItem.
TextItem
\u2013 TextItem.
TableItem
\u2013 TableItem.
TableCell
\u2013 TableCell.
TableData
\u2013 BaseTableData.
TableCellLabel
\u2013 TableCellLabel.
KeyValueItem
\u2013 KeyValueItem.
SectionHeaderItem
\u2013 SectionItem.
PictureItem
\u2013 PictureItem.
ImageRef
\u2013 ImageRef.
PictureClassificationClass
\u2013 PictureClassificationData.
PictureClassificationData
\u2013 PictureClassificationData.
RefItem
\u2013 RefItem.
BoundingBox
\u2013 BoundingBox.
CoordOrigin
\u2013 CoordOrigin.
ImageRefMode
\u2013 ImageRefMode.
Size
\u2013 Size.
Bases: BaseModel
DoclingDocument.
Methods:
add_code
\u2013 add_code.
add_document
\u2013 Adds the content from the body of a DoclingDocument to this document under a specific parent.
add_form
\u2013 add_form.
add_formula
\u2013 add_formula.
add_group
\u2013 add_group.
add_heading
\u2013 add_heading.
add_inline_group
\u2013 add_inline_group.
add_key_values
\u2013 add_key_values.
add_list_group
\u2013 add_list_group.
add_list_item
\u2013 add_list_item.
add_node_items
\u2013 Adds multiple NodeItems and their children under a parent in this document.
add_ordered_list
\u2013 add_ordered_list.
add_page
\u2013 add_page.
add_picture
\u2013 add_picture.
add_table
\u2013 add_table.
add_text
\u2013 add_text.
add_title
\u2013 add_title.
add_unordered_list
\u2013 add_unordered_list.
append_child_item
\u2013 Adds an item.
check_version_is_compatible
\u2013 Check if this document version is compatible with SDK schema version.
delete_items
\u2013 Deletes an item, given its instance or ref, and any children it has.
delete_items_range
\u2013 Deletes all NodeItems and their children in the range from the start NodeItem to the end NodeItem.
export_to_dict
\u2013 Export to dict.
export_to_doctags
\u2013 Exports the document content to a DocumentToken format.
export_to_document_tokens
\u2013 Export to DocTags format.
export_to_element_tree
\u2013 Export_to_element_tree.
export_to_html
\u2013 Serialize to HTML.
export_to_markdown
\u2013 Serialize to Markdown.
export_to_text
\u2013 export_to_text.
extract_items_range
\u2013 Extracts NodeItems and children in the range from the start NodeItem to the end as a new DoclingDocument.
get_visualization
\u2013 Get visualization of the document as images by page.
insert_code
\u2013 Creates a new CodeItem item and inserts it into the document.
insert_document
\u2013 Inserts the content from the body of a DoclingDocument into this document at a specific position.
insert_form
\u2013 Creates a new FormItem item and inserts it into the document.
insert_formula
\u2013 Creates a new FormulaItem item and inserts it into the document.
insert_group
\u2013 Creates a new GroupItem item and inserts it into the document.
insert_heading
\u2013 Creates a new SectionHeaderItem item and inserts it into the document.
insert_inline_group
\u2013 Creates a new InlineGroup item and inserts it into the document.
insert_item_after_sibling
\u2013 Inserts an item, given its node_item instance, after other as a sibling.
insert_item_before_sibling
\u2013 Inserts an item, given its node_item instance, before other as a sibling.
insert_key_values
\u2013 Creates a new KeyValueItem item and inserts it into the document.
insert_list_group
\u2013 Creates a new ListGroup item and inserts it into the document.
insert_list_item
\u2013 Creates a new ListItem item and inserts it into the document.
insert_node_items
\u2013 Insert multiple NodeItems and their children at a specific position in the document.
insert_picture
\u2013 Creates a new PictureItem item and inserts it into the document.
insert_table
\u2013 Creates a new TableItem item and inserts it into the document.
insert_text
\u2013 Creates a new TextItem item and inserts it into the document.
insert_title
\u2013 Creates a new TitleItem item and inserts it into the document.
iterate_items
\u2013 Iterate elements with level.
load_from_doctags
\u2013 Load Docling document from lists of DocTags and Images.
load_from_json
\u2013 load_from_json.
load_from_yaml
\u2013 load_from_yaml.
num_pages
\u2013 num_pages.
print_element_tree
\u2013 Print_element_tree.
replace_item
\u2013 Replace item with new item.
save_as_doctags
\u2013 Save the document content to DocTags format.
save_as_document_tokens
\u2013 Save the document content to a DocumentToken format.
save_as_html
\u2013 Save to HTML.
save_as_json
\u2013 Save as json.
save_as_markdown
\u2013 Save to markdown.
save_as_yaml
\u2013 Save as yaml.
transform_to_content_layer
\u2013 transform_to_content_layer.
validate_document
\u2013 validate_document.
validate_misplaced_list_items
\u2013 validate_misplaced_list_items.
validate_tree
\u2013 validate_tree.
Attributes:
body
(GroupItem
) \u2013 form_items
(List[FormItem]
) \u2013 furniture
(Annotated[GroupItem, Field(deprecated=True)]
) \u2013 groups
(List[Union[ListGroup, InlineGroup, GroupItem]]
) \u2013 key_value_items
(List[KeyValueItem]
) \u2013 name
(str
) \u2013 origin
(Optional[DocumentOrigin]
) \u2013 pages
(Dict[int, PageItem]
) \u2013 pictures
(List[PictureItem]
) \u2013 schema_name
(Literal['DoclingDocument']
) \u2013 tables
(List[TableItem]
) \u2013 texts
(List[Union[TitleItem, SectionHeaderItem, ListItem, CodeItem, FormulaItem, TextItem]]
) \u2013 version
(Annotated[str, StringConstraints(pattern=VERSION_PATTERN, strict=True)]
) \u2013 body: GroupItem = GroupItem(name='_root_', self_ref='#/body')\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.form_items","title":"form_items","text":"form_items: List[FormItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.furniture","title":"furniture","text":"furniture: Annotated[GroupItem, Field(deprecated=True)] = GroupItem(name='_root_', self_ref='#/furniture', content_layer=FURNITURE)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.groups","title":"groups","text":"groups: List[Union[ListGroup, InlineGroup, GroupItem]] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.key_value_items","title":"key_value_items","text":"key_value_items: List[KeyValueItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.name","title":"name","text":"name: str\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.origin","title":"origin","text":"origin: Optional[DocumentOrigin] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.pages","title":"pages","text":"pages: Dict[int, PageItem] = {}\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.pictures","title":"pictures","text":"pictures: List[PictureItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.schema_name","title":"schema_name","text":"schema_name: Literal['DoclingDocument'] = 'DoclingDocument'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.tables","title":"tables","text":"tables: List[TableItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.texts","title":"texts","text":"texts: List[Union[TitleItem, SectionHeaderItem, ListItem, CodeItem, FormulaItem, TextItem]] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.version","title":"version","text":"version: Annotated[str, StringConstraints(pattern=VERSION_PATTERN, strict=True)] = CURRENT_VERSION\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_code","title":"add_code","text":"add_code(text: str, code_language: Optional[CodeLanguageLabel] = None, orig: Optional[str] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n
add_code.
Parameters:
text
(str
) \u2013 str:
code_language
(Optional[CodeLanguageLabel]
, default: None
) \u2013 Optional[str]: (Default value = None)
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
caption
(Optional[Union[TextItem, RefItem]]
, default: None
) \u2013 Optional[Union[TextItem:
RefItem]]
\u2013 (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: (Default value = None)
add_document(doc: DoclingDocument, parent: Optional[NodeItem] = None) -> None\n
Adds the content from the body of a DoclingDocument to this document under a specific parent.
Parameters:
doc
(DoclingDocument
) \u2013 DoclingDocument: The document whose content will be added
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: The parent NodeItem under which new items are added (Default value = None)
Returns:
None
\u2013 None
add_form(graph: GraphData, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None)\n
add_form.
Parameters:
graph
(GraphData
) \u2013 GraphData:
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: (Default value = None)
add_formula(text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n
add_formula.
Parameters:
text
(str
) \u2013 str:
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
level
\u2013 LevelNumber: (Default value = 1)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: (Default value = None)
add_group(label: Optional[GroupLabel] = None, name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> GroupItem\n
add_group.
Parameters:
label
(Optional[GroupLabel]
, default: None
) \u2013 Optional[GroupLabel]: (Default value = None)
name
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: (Default value = None)
add_heading(text: str, orig: Optional[str] = None, level: LevelNumber = 1, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n
add_heading.
Parameters:
label
\u2013 DocItemLabel:
text
(str
) \u2013 str:
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
level
(LevelNumber
, default: 1
) \u2013 LevelNumber: (Default value = 1)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: (Default value = None)
add_inline_group(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> InlineGroup\n
add_inline_group.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_key_values","title":"add_key_values","text":"add_key_values(graph: GraphData, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None)\n
add_key_values.
Parameters:
graph
(GraphData
) \u2013 GraphData:
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: (Default value = None)
add_list_group(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> ListGroup\n
add_list_group.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_list_item","title":"add_list_item","text":"add_list_item(text: str, enumerated: bool = False, marker: Optional[str] = None, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n
add_list_item.
Parameters:
label
\u2013 str:
text
(str
) \u2013 str:
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: (Default value = None)
add_node_items(node_items: List[NodeItem], doc: DoclingDocument, parent: Optional[NodeItem] = None) -> None\n
Adds multiple NodeItems and their children under a parent in this document.
Parameters:
node_items
(List[NodeItem]
) \u2013 list[NodeItem]: The NodeItems to be added
doc
(DoclingDocument
) \u2013 DoclingDocument: The document to which the NodeItems and their children belong
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: The parent NodeItem under which new items are added (Default value = None)
Returns:
None
\u2013 None
add_ordered_list(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> GroupItem\n
add_ordered_list.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_page","title":"add_page","text":"add_page(page_no: int, size: Size, image: Optional[ImageRef] = None) -> PageItem\n
add_page.
Parameters:
page_no
(int
) \u2013 int:
size
(Size
) \u2013 Size:
add_picture(annotations: Optional[List[PictureDataType]] = None, image: Optional[ImageRef] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None)\n
add_picture.
Parameters:
data
\u2013 Optional[List[PictureData]]: (Default value = None)
caption
(Optional[Union[TextItem, RefItem]]
, default: None
) \u2013 Optional[Union[TextItem:
RefItem]]
\u2013 (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: (Default value = None)
add_table(data: TableData, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, label: DocItemLabel = TABLE, content_layer: Optional[ContentLayer] = None, annotations: Optional[list[TableAnnotationType]] = None)\n
add_table.
Parameters:
data
(TableData
) \u2013 TableData:
caption
(Optional[Union[TextItem, RefItem]]
, default: None
) \u2013 Optional[Union[TextItem, RefItem]]: (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: (Default value = None)
label
(DocItemLabel
, default: TABLE
) \u2013 DocItemLabel: (Default value = DocItemLabel.TABLE)
add_text(label: DocItemLabel, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n
add_text.
Parameters:
label
(DocItemLabel
) \u2013 str:
text
(str
) \u2013 str:
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: (Default value = None)
add_title(text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n
add_title.
Parameters:
text
(str
) \u2013 str:
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
level
\u2013 LevelNumber: (Default value = 1)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent
(Optional[NodeItem]
, default: None
) \u2013 Optional[NodeItem]: (Default value = None)
add_unordered_list(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> GroupItem\n
add_unordered_list.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.append_child_item","title":"append_child_item","text":"append_child_item(*, child: NodeItem, parent: Optional[NodeItem] = None) -> None\n
Adds an item.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.check_version_is_compatible","title":"check_version_is_compatible","text":"check_version_is_compatible(v: str) -> str\n
Check if this document version is compatible with SDK schema version.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.delete_items","title":"delete_items","text":"delete_items(*, node_items: List[NodeItem]) -> None\n
Deletes an item, given its instance or ref, and any children it has.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.delete_items_range","title":"delete_items_range","text":"delete_items_range(*, start: NodeItem, end: NodeItem, start_inclusive: bool = True, end_inclusive: bool = True) -> None\n
Deletes all NodeItems and their children in the range from the start NodeItem to the end NodeItem.
Parameters:
start
(NodeItem
) \u2013 NodeItem: The starting NodeItem of the range
end
(NodeItem
) \u2013 NodeItem: The ending NodeItem of the range
start_inclusive
(bool
, default: True
) \u2013 bool: (Default value = True): If True, the start NodeItem will also be deleted
end_inclusive
(bool
, default: True
) \u2013 bool: (Default value = True): If True, the end NodeItem will also be deleted
Returns:
None
\u2013 None
export_to_dict(mode: str = 'json', by_alias: bool = True, exclude_none: bool = True, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None) -> Dict[str, Any]\n
Export to dict.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_doctags","title":"export_to_doctags","text":"export_to_doctags(delim: str = '', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True, add_page_index: bool = True, add_table_cell_location: bool = False, add_table_cell_text: bool = True, minified: bool = False, pages: Optional[set[int]] = None) -> str\n
Exports the document content to a DocumentToken format.
Operates on a slice of the document's body as defined through arguments from_element and to_element; defaulting to the whole main_text.
Parameters:
delim
(str
, default: ''
) \u2013 str: (Default value = \"\") Deprecated
from_element
(int
, default: 0
) \u2013 int: (Default value = 0)
to_element
(int
, default: maxsize
) \u2013 Optional[int]: (Default value = None)
labels
(Optional[set[DocItemLabel]]
, default: None
) \u2013 set[DocItemLabel]
xsize
(int
, default: 500
) \u2013 int: (Default value = 500)
ysize
(int
, default: 500
) \u2013 int: (Default value = 500)
add_location
(bool
, default: True
) \u2013 bool: (Default value = True)
add_content
(bool
, default: True
) \u2013 bool: (Default value = True)
add_page_index
(bool
, default: True
) \u2013 bool: (Default value = True)
flagsadd_table_cell_location
\u2013 bool
add_table_cell_text
(bool
, default: True
) \u2013 bool: (Default value = True)
minified
(bool
, default: False
) \u2013 bool: (Default value = False)
pages
(Optional[set[int]]
, default: None
) \u2013 set[int]: (Default value = None)
Returns:
str
\u2013 The content of the document formatted as a DocTags string.
export_to_document_tokens(*args, **kwargs)\n
Export to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_element_tree","title":"export_to_element_tree","text":"export_to_element_tree() -> str\n
Export_to_element_tree.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_html","title":"export_to_html","text":"export_to_html(from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, enable_chart_tables: bool = True, image_mode: ImageRefMode = PLACEHOLDER, formula_to_mathml: bool = True, page_no: Optional[int] = None, html_lang: str = 'en', html_head: str = 'null', included_content_layers: Optional[set[ContentLayer]] = None, split_page_view: bool = False, include_annotations: bool = True) -> str\n
Serialize to HTML.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_markdown","title":"export_to_markdown","text":"export_to_markdown(delim: str = '\\n\\n', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, strict_text: bool = False, escape_underscores: bool = True, image_placeholder: str = '<!-- image -->', enable_chart_tables: bool = True, image_mode: ImageRefMode = PLACEHOLDER, indent: int = 4, text_width: int = -1, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, page_break_placeholder: Optional[str] = None, include_annotations: bool = True, mark_annotations: bool = False) -> str\n
Serialize to Markdown.
Operates on a slice of the document's body as defined through arguments from_element and to_element; defaulting to the whole document.
Parameters:
delim
(str
, default: '\\n\\n'
) \u2013 Deprecated.
from_element
(int
, default: 0
) \u2013 Body slicing start index (inclusive). (Default value = 0).
to_element
(int
, default: maxsize
) \u2013 Body slicing stop index (exclusive). (Default value = maxint).
labels
(Optional[set[DocItemLabel]]
, default: None
) \u2013 The set of document labels to include in the export. None falls back to the system-defined default.
strict_text
(bool
, default: False
) \u2013 Deprecated.
escape_underscores
(bool
, default: True
) \u2013 bool: Whether to escape underscores in the text content of the document. (Default value = True).
image_placeholder
(str
, default: '<!-- image -->'
) \u2013 The placeholder to include to position images in the markdown. (Default value = \"\\<!-- image -->\").
image_mode
(ImageRefMode
, default: PLACEHOLDER
) \u2013 The mode to use for including images in the markdown. (Default value = ImageRefMode.PLACEHOLDER).
indent
(int
, default: 4
) \u2013 The indent in spaces of the nested lists. (Default value = 4).
included_content_layers
(Optional[set[ContentLayer]]
, default: None
) \u2013 The set of layels to include in the export. None falls back to the system-defined default.
page_break_placeholder
(Optional[str]
, default: None
) \u2013 The placeholder to include for marking page breaks. None means no page break placeholder will be used.
include_annotations
(bool
, default: True
) \u2013 bool: Whether to include annotations in the export. (Default value = True).
mark_annotations
(bool
, default: False
) \u2013 bool: Whether to mark annotations in the export; only relevant if include_annotations is True. (Default value = False).
Returns:
str
\u2013 The exported Markdown representation.
export_to_text(delim: str = '\\n\\n', from_element: int = 0, to_element: int = 1000000, labels: Optional[set[DocItemLabel]] = None) -> str\n
export_to_text.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.extract_items_range","title":"extract_items_range","text":"extract_items_range(*, start: NodeItem, end: NodeItem, start_inclusive: bool = True, end_inclusive: bool = True, delete: bool = False) -> DoclingDocument\n
Extracts NodeItems and children in the range from the start NodeItem to the end as a new DoclingDocument.
Parameters:
start
(NodeItem
) \u2013 NodeItem: The starting NodeItem of the range (must be a direct child of the document body)
end
(NodeItem
) \u2013 NodeItem: The ending NodeItem of the range (must be a direct child of the document body)
start_inclusive
(bool
, default: True
) \u2013 bool: (Default value = True): If True, the start NodeItem will also be extracted
end_inclusive
(bool
, default: True
) \u2013 bool: (Default value = True): If True, the end NodeItem will also be extracted
delete
(bool
, default: False
) \u2013 bool: (Default value = False): If True, extracted items are deleted in the original document
Returns:
DoclingDocument
\u2013 DoclingDocument: A new document containing the extracted NodeItems and their children
get_visualization(show_label: bool = True, show_branch_numbering: bool = False) -> dict[Optional[int], Image]\n
Get visualization of the document as images by page.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_code","title":"insert_code","text":"insert_code(sibling: NodeItem, text: str, code_language: Optional[CodeLanguageLabel] = None, orig: Optional[str] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> CodeItem\n
Creates a new CodeItem item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
text
(str
) \u2013 str:
code_language
(Optional[CodeLanguageLabel]
, default: None
) \u2013 Optional[str]: (Default value = None)
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
caption
(Optional[Union[TextItem, RefItem]]
, default: None
) \u2013 Optional[Union[TextItem, RefItem]]: (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer
(Optional[ContentLayer]
, default: None
) \u2013 Optional[ContentLayer]: (Default value = None)
formatting
(Optional[Formatting]
, default: None
) \u2013 Optional[Formatting]: (Default value = None)
hyperlink
(Optional[Union[AnyUrl, Path]]
, default: None
) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
CodeItem
\u2013 CodeItem: The newly created CodeItem item.
insert_document(doc: DoclingDocument, sibling: NodeItem, after: bool = True) -> None\n
Inserts the content from the body of a DoclingDocument into this document at a specific position.
Parameters:
doc
(DoclingDocument
) \u2013 DoclingDocument: The document whose content will be inserted
sibling
(NodeItem
) \u2013 NodeItem: The NodeItem after/before which the new items will be inserted
after
(bool
, default: True
) \u2013 bool: If True, insert after the sibling; if False, insert before (Default value = True)
Returns:
None
\u2013 None
insert_form(sibling: NodeItem, graph: GraphData, prov: Optional[ProvenanceItem] = None, after: bool = True) -> FormItem\n
Creates a new FormItem item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
graph
(GraphData
) \u2013 GraphData:
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
FormItem
\u2013 FormItem: The newly created FormItem item.
insert_formula(sibling: NodeItem, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> FormulaItem\n
Creates a new FormulaItem item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
text
(str
) \u2013 str:
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer
(Optional[ContentLayer]
, default: None
) \u2013 Optional[ContentLayer]: (Default value = None)
formatting
(Optional[Formatting]
, default: None
) \u2013 Optional[Formatting]: (Default value = None)
hyperlink
(Optional[Union[AnyUrl, Path]]
, default: None
) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
FormulaItem
\u2013 FormulaItem: The newly created FormulaItem item.
insert_group(sibling: NodeItem, label: Optional[GroupLabel] = None, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> GroupItem\n
Creates a new GroupItem item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
label
(Optional[GroupLabel]
, default: None
) \u2013 Optional[GroupLabel]: (Default value = None)
name
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
content_layer
(Optional[ContentLayer]
, default: None
) \u2013 Optional[ContentLayer]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
GroupItem
\u2013 GroupItem: The newly created GroupItem.
insert_heading(sibling: NodeItem, text: str, orig: Optional[str] = None, level: LevelNumber = 1, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> SectionHeaderItem\n
Creates a new SectionHeaderItem item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
text
(str
) \u2013 str:
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
level
(LevelNumber
, default: 1
) \u2013 LevelNumber: (Default value = 1)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer
(Optional[ContentLayer]
, default: None
) \u2013 Optional[ContentLayer]: (Default value = None)
formatting
(Optional[Formatting]
, default: None
) \u2013 Optional[Formatting]: (Default value = None)
hyperlink
(Optional[Union[AnyUrl, Path]]
, default: None
) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
SectionHeaderItem
\u2013 SectionHeaderItem: The newly created SectionHeaderItem item.
insert_inline_group(sibling: NodeItem, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> InlineGroup\n
Creates a new InlineGroup item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
name
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
content_layer
(Optional[ContentLayer]
, default: None
) \u2013 Optional[ContentLayer]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
InlineGroup
\u2013 InlineGroup: The newly created InlineGroup item.
insert_item_after_sibling(*, new_item: NodeItem, sibling: NodeItem) -> None\n
Inserts an item, given its node_item instance, after other as a sibling.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_item_before_sibling","title":"insert_item_before_sibling","text":"insert_item_before_sibling(*, new_item: NodeItem, sibling: NodeItem) -> None\n
Inserts an item, given its node_item instance, before other as a sibling.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_key_values","title":"insert_key_values","text":"insert_key_values(sibling: NodeItem, graph: GraphData, prov: Optional[ProvenanceItem] = None, after: bool = True) -> KeyValueItem\n
Creates a new KeyValueItem item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
graph
(GraphData
) \u2013 GraphData:
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
KeyValueItem
\u2013 KeyValueItem: The newly created KeyValueItem item.
insert_list_group(sibling: NodeItem, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> ListGroup\n
Creates a new ListGroup item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
name
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
content_layer
(Optional[ContentLayer]
, default: None
) \u2013 Optional[ContentLayer]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
ListGroup
\u2013 ListGroup: The newly created ListGroup item.
insert_list_item(sibling: NodeItem, text: str, enumerated: bool = False, marker: Optional[str] = None, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> ListItem\n
Creates a new ListItem item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
text
(str
) \u2013 str:
enumerated
(bool
, default: False
) \u2013 bool: (Default value = False)
marker
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer
(Optional[ContentLayer]
, default: None
) \u2013 Optional[ContentLayer]: (Default value = None)
formatting
(Optional[Formatting]
, default: None
) \u2013 Optional[Formatting]: (Default value = None)
hyperlink
(Optional[Union[AnyUrl, Path]]
, default: None
) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
ListItem
\u2013 ListItem: The newly created ListItem item.
insert_node_items(sibling: NodeItem, node_items: List[NodeItem], doc: DoclingDocument, after: bool = True) -> None\n
Insert multiple NodeItems and their children at a specific position in the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem: The NodeItem after/before which the new items will be inserted
node_items
(List[NodeItem]
) \u2013 list[NodeItem]: The NodeItems to be inserted
doc
(DoclingDocument
) \u2013 DoclingDocument: The document to which the NodeItems and their children belong
after
(bool
, default: True
) \u2013 bool: If True, insert after the sibling; if False, insert before (Default value = True)
Returns:
None
\u2013 None
insert_picture(sibling: NodeItem, annotations: Optional[List[PictureDataType]] = None, image: Optional[ImageRef] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> PictureItem\n
Creates a new PictureItem item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
annotations
(Optional[List[PictureDataType]]
, default: None
) \u2013 Optional[List[PictureDataType]]: (Default value = None)
image
(Optional[ImageRef]
, default: None
) \u2013 Optional[ImageRef]: (Default value = None)
caption
(Optional[Union[TextItem, RefItem]]
, default: None
) \u2013 Optional[Union[TextItem, RefItem]]: (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer
(Optional[ContentLayer]
, default: None
) \u2013 Optional[ContentLayer]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
PictureItem
\u2013 PictureItem: The newly created PictureItem item.
insert_table(sibling: NodeItem, data: TableData, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, label: DocItemLabel = TABLE, content_layer: Optional[ContentLayer] = None, annotations: Optional[list[TableAnnotationType]] = None, after: bool = True) -> TableItem\n
Creates a new TableItem item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
data
(TableData
) \u2013 TableData:
caption
(Optional[Union[TextItem, RefItem]]
, default: None
) \u2013 Optional[Union[TextItem, RefItem]]: (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
label
(DocItemLabel
, default: TABLE
) \u2013 DocItemLabel: (Default value = DocItemLabel.TABLE)
content_layer
(Optional[ContentLayer]
, default: None
) \u2013 Optional[ContentLayer]: (Default value = None)
annotations
(Optional[list[TableAnnotationType]]
, default: None
) \u2013 Optional[List[TableAnnotationType]]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
TableItem
\u2013 TableItem: The newly created TableItem item.
insert_text(sibling: NodeItem, label: DocItemLabel, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> TextItem\n
Creates a new TextItem item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
label
(DocItemLabel
) \u2013 DocItemLabel:
text
(str
) \u2013 str:
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer
(Optional[ContentLayer]
, default: None
) \u2013 Optional[ContentLayer]: (Default value = None)
formatting
(Optional[Formatting]
, default: None
) \u2013 Optional[Formatting]: (Default value = None)
hyperlink
(Optional[Union[AnyUrl, Path]]
, default: None
) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
TextItem
\u2013 TextItem: The newly created TextItem item.
insert_title(sibling: NodeItem, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> TitleItem\n
Creates a new TitleItem item and inserts it into the document.
Parameters:
sibling
(NodeItem
) \u2013 NodeItem:
text
(str
) \u2013 str:
orig
(Optional[str]
, default: None
) \u2013 Optional[str]: (Default value = None)
prov
(Optional[ProvenanceItem]
, default: None
) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer
(Optional[ContentLayer]
, default: None
) \u2013 Optional[ContentLayer]: (Default value = None)
formatting
(Optional[Formatting]
, default: None
) \u2013 Optional[Formatting]: (Default value = None)
hyperlink
(Optional[Union[AnyUrl, Path]]
, default: None
) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after
(bool
, default: True
) \u2013 bool: (Default value = True)
Returns:
TitleItem
\u2013 TitleItem: The newly created TitleItem item.
iterate_items(root: Optional[NodeItem] = None, with_groups: bool = False, traverse_pictures: bool = False, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, _level: int = 0) -> Iterable[Tuple[NodeItem, int]]\n
Iterate elements with level.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.load_from_doctags","title":"load_from_doctags","text":"load_from_doctags(doctag_document: DocTagsDocument, document_name: str = 'Document') -> DoclingDocument\n
Load Docling document from lists of DocTags and Images.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.load_from_json","title":"load_from_json","text":"load_from_json(filename: Union[str, Path]) -> DoclingDocument\n
load_from_json.
Parameters:
filename
(Union[str, Path]
) \u2013 The filename to load a saved DoclingDocument from a .json.
Returns:
DoclingDocument
\u2013 The loaded DoclingDocument.
load_from_yaml(filename: Union[str, Path]) -> DoclingDocument\n
load_from_yaml.
Args: filename: The filename to load a YAML-serialized DoclingDocument from.
Returns: DoclingDocument: the loaded DoclingDocument
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.num_pages","title":"num_pages","text":"num_pages()\n
num_pages.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.print_element_tree","title":"print_element_tree","text":"print_element_tree()\n
Print_element_tree.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.replace_item","title":"replace_item","text":"replace_item(*, new_item: NodeItem, old_item: NodeItem) -> None\n
Replace item with new item.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_doctags","title":"save_as_doctags","text":"save_as_doctags(filename: Union[str, Path], delim: str = '', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True, add_page_index: bool = True, add_table_cell_location: bool = False, add_table_cell_text: bool = True, minified: bool = False)\n
Save the document content to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_document_tokens","title":"save_as_document_tokens","text":"save_as_document_tokens(*args, **kwargs)\n
Save the document content to a DocumentToken format.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_html","title":"save_as_html","text":"save_as_html(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, image_mode: ImageRefMode = PLACEHOLDER, formula_to_mathml: bool = True, page_no: Optional[int] = None, html_lang: str = 'en', html_head: str = 'null', included_content_layers: Optional[set[ContentLayer]] = None, split_page_view: bool = False, include_annotations: bool = True)\n
Save to HTML.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_json","title":"save_as_json","text":"save_as_json(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, image_mode: ImageRefMode = EMBEDDED, indent: int = 2, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None)\n
Save as json.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_markdown","title":"save_as_markdown","text":"save_as_markdown(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, delim: str = '\\n\\n', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, strict_text: bool = False, escaping_underscores: bool = True, image_placeholder: str = '<!-- image -->', image_mode: ImageRefMode = PLACEHOLDER, indent: int = 4, text_width: int = -1, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, page_break_placeholder: Optional[str] = None, include_annotations: bool = True)\n
Save to markdown.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_yaml","title":"save_as_yaml","text":"save_as_yaml(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, image_mode: ImageRefMode = EMBEDDED, default_flow_style: bool = False, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None)\n
Save as yaml.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.transform_to_content_layer","title":"transform_to_content_layer","text":"transform_to_content_layer(data: dict) -> dict\n
transform_to_content_layer.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_document","title":"validate_document","text":"validate_document(d: DoclingDocument)\n
validate_document.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_misplaced_list_items","title":"validate_misplaced_list_items","text":"validate_misplaced_list_items()\n
validate_misplaced_list_items.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_tree","title":"validate_tree","text":"validate_tree(root) -> bool\n
validate_tree.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin","title":"DocumentOrigin","text":" Bases: BaseModel
FileSource.
Methods:
parse_hex_string
\u2013 parse_hex_string.
validate_mimetype
\u2013 validate_mimetype.
Attributes:
binary_hash
(Uint64
) \u2013 filename
(str
) \u2013 mimetype
(str
) \u2013 uri
(Optional[AnyUrl]
) \u2013 binary_hash: Uint64\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.filename","title":"filename","text":"filename: str\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.mimetype","title":"mimetype","text":"mimetype: str\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.uri","title":"uri","text":"uri: Optional[AnyUrl] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.parse_hex_string","title":"parse_hex_string","text":"parse_hex_string(value)\n
parse_hex_string.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.validate_mimetype","title":"validate_mimetype","text":"validate_mimetype(v)\n
validate_mimetype.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem","title":"DocItem","text":" Bases: NodeItem
DocItem.
Methods:
get_annotations
\u2013 Get the annotations of this DocItem.
get_image
\u2013 Returns the image of this DocItem.
get_location_tokens
\u2013 Get the location string for the BaseCell.
get_ref
\u2013 get_ref.
Attributes:
children
(List[RefItem]
) \u2013 content_layer
(ContentLayer
) \u2013 label
(DocItemLabel
) \u2013 model_config
\u2013 parent
(Optional[RefItem]
) \u2013 prov
(List[ProvenanceItem]
) \u2013 self_ref
(str
) \u2013 children: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.label","title":"label","text":"label: DocItemLabel\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n
Get the annotations of this DocItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n
Returns the image of this DocItem.
The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n
Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n
get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel","title":"DocItemLabel","text":" Bases: str
, Enum
DocItemLabel.
Methods:
get_color
\u2013 Return the RGB color associated with a given label.
Attributes:
CAPTION
\u2013 CHART
\u2013 CHECKBOX_SELECTED
\u2013 CHECKBOX_UNSELECTED
\u2013 CODE
\u2013 DOCUMENT_INDEX
\u2013 EMPTY_VALUE
\u2013 FOOTNOTE
\u2013 FORM
\u2013 FORMULA
\u2013 GRADING_SCALE
\u2013 HANDWRITTEN_TEXT
\u2013 KEY_VALUE_REGION
\u2013 LIST_ITEM
\u2013 PAGE_FOOTER
\u2013 PAGE_HEADER
\u2013 PARAGRAPH
\u2013 PICTURE
\u2013 REFERENCE
\u2013 SECTION_HEADER
\u2013 TABLE
\u2013 TEXT
\u2013 TITLE
\u2013 CAPTION = 'caption'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHART","title":"CHART","text":"CHART = 'chart'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHECKBOX_SELECTED","title":"CHECKBOX_SELECTED","text":"CHECKBOX_SELECTED = 'checkbox_selected'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHECKBOX_UNSELECTED","title":"CHECKBOX_UNSELECTED","text":"CHECKBOX_UNSELECTED = 'checkbox_unselected'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CODE","title":"CODE","text":"CODE = 'code'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.DOCUMENT_INDEX","title":"DOCUMENT_INDEX","text":"DOCUMENT_INDEX = 'document_index'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.EMPTY_VALUE","title":"EMPTY_VALUE","text":"EMPTY_VALUE = 'empty_value'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FOOTNOTE","title":"FOOTNOTE","text":"FOOTNOTE = 'footnote'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FORM","title":"FORM","text":"FORM = 'form'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FORMULA","title":"FORMULA","text":"FORMULA = 'formula'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.GRADING_SCALE","title":"GRADING_SCALE","text":"GRADING_SCALE = 'grading_scale'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.HANDWRITTEN_TEXT","title":"HANDWRITTEN_TEXT","text":"HANDWRITTEN_TEXT = 'handwritten_text'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.KEY_VALUE_REGION","title":"KEY_VALUE_REGION","text":"KEY_VALUE_REGION = 'key_value_region'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.LIST_ITEM","title":"LIST_ITEM","text":"LIST_ITEM = 'list_item'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PAGE_FOOTER","title":"PAGE_FOOTER","text":"PAGE_FOOTER = 'page_footer'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PAGE_HEADER","title":"PAGE_HEADER","text":"PAGE_HEADER = 'page_header'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PARAGRAPH","title":"PARAGRAPH","text":"PARAGRAPH = 'paragraph'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PICTURE","title":"PICTURE","text":"PICTURE = 'picture'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.REFERENCE","title":"REFERENCE","text":"REFERENCE = 'reference'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.SECTION_HEADER","title":"SECTION_HEADER","text":"SECTION_HEADER = 'section_header'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TABLE","title":"TABLE","text":"TABLE = 'table'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TEXT","title":"TEXT","text":"TEXT = 'text'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TITLE","title":"TITLE","text":"TITLE = 'title'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.get_color","title":"get_color","text":"get_color(label: DocItemLabel) -> Tuple[int, int, int]\n
Return the RGB color associated with a given label.
"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem","title":"ProvenanceItem","text":" Bases: BaseModel
ProvenanceItem.
Attributes:
bbox
(BoundingBox
) \u2013 charspan
(Tuple[int, int]
) \u2013 page_no
(int
) \u2013 bbox: BoundingBox\n
"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem.charspan","title":"charspan","text":"charspan: Tuple[int, int]\n
"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem.page_no","title":"page_no","text":"page_no: int\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem","title":"GroupItem","text":" Bases: NodeItem
GroupItem.
Methods:
get_ref
\u2013 get_ref.
Attributes:
children
(List[RefItem]
) \u2013 content_layer
(ContentLayer
) \u2013 label
(GroupLabel
) \u2013 model_config
\u2013 name
(str
) \u2013 parent
(Optional[RefItem]
) \u2013 self_ref
(str
) \u2013 children: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.label","title":"label","text":"label: GroupLabel = UNSPECIFIED\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.name","title":"name","text":"name: str = 'group'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n
get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel","title":"GroupLabel","text":" Bases: str
, Enum
GroupLabel.
Attributes:
CHAPTER
\u2013 COMMENT_SECTION
\u2013 FORM_AREA
\u2013 INLINE
\u2013 KEY_VALUE_AREA
\u2013 LIST
\u2013 ORDERED_LIST
\u2013 PICTURE_AREA
\u2013 SECTION
\u2013 SHEET
\u2013 SLIDE
\u2013 UNSPECIFIED
\u2013 CHAPTER = 'chapter'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.COMMENT_SECTION","title":"COMMENT_SECTION","text":"COMMENT_SECTION = 'comment_section'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.FORM_AREA","title":"FORM_AREA","text":"FORM_AREA = 'form_area'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.INLINE","title":"INLINE","text":"INLINE = 'inline'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.KEY_VALUE_AREA","title":"KEY_VALUE_AREA","text":"KEY_VALUE_AREA = 'key_value_area'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.LIST","title":"LIST","text":"LIST = 'list'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.ORDERED_LIST","title":"ORDERED_LIST","text":"ORDERED_LIST = 'ordered_list'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.PICTURE_AREA","title":"PICTURE_AREA","text":"PICTURE_AREA = 'picture_area'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SECTION","title":"SECTION","text":"SECTION = 'section'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SHEET","title":"SHEET","text":"SHEET = 'sheet'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SLIDE","title":"SLIDE","text":"SLIDE = 'slide'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.UNSPECIFIED","title":"UNSPECIFIED","text":"UNSPECIFIED = 'unspecified'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem","title":"NodeItem","text":" Bases: BaseModel
NodeItem.
Methods:
get_ref
\u2013 get_ref.
Attributes:
children
(List[RefItem]
) \u2013 content_layer
(ContentLayer
) \u2013 model_config
\u2013 parent
(Optional[RefItem]
) \u2013 self_ref
(str
) \u2013 children: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n
"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n
get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem","title":"PageItem","text":" Bases: BaseModel
PageItem.
Attributes:
image
(Optional[ImageRef]
) \u2013 page_no
(int
) \u2013 size
(Size
) \u2013 image: Optional[ImageRef] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem.page_no","title":"page_no","text":"page_no: int\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem.size","title":"size","text":"size: Size\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem","title":"FloatingItem","text":" Bases: DocItem
FloatingItem.
Methods:
caption_text
\u2013 Computes the caption as a single text.
get_annotations
\u2013 Get the annotations of this DocItem.
get_image
\u2013 Returns the image corresponding to this FloatingItem.
get_location_tokens
\u2013 Get the location string for the BaseCell.
get_ref
\u2013 get_ref.
Attributes:
captions
(List[RefItem]
) \u2013 children
(List[RefItem]
) \u2013 content_layer
(ContentLayer
) \u2013 footnotes
(List[RefItem]
) \u2013 image
(Optional[ImageRef]
) \u2013 label
(DocItemLabel
) \u2013 model_config
\u2013 parent
(Optional[RefItem]
) \u2013 prov
(List[ProvenanceItem]
) \u2013 references
(List[RefItem]
) \u2013 self_ref
(str
) \u2013 captions: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.children","title":"children","text":"children: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.footnotes","title":"footnotes","text":"footnotes: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.image","title":"image","text":"image: Optional[ImageRef] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.label","title":"label","text":"label: DocItemLabel\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.references","title":"references","text":"references: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.caption_text","title":"caption_text","text":"caption_text(doc: DoclingDocument) -> str\n
Computes the caption as a single text.
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n
Get the annotations of this DocItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n
Returns the image corresponding to this FloatingItem.
This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.
In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n
Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n
get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem","title":"TextItem","text":" Bases: DocItem
TextItem.
Methods:
export_to_doctags
\u2013 Export text element to document tokens format.
export_to_document_tokens
\u2013 Export to DocTags format.
get_annotations
\u2013 Get the annotations of this DocItem.
get_image
\u2013 Returns the image of this DocItem.
get_location_tokens
\u2013 Get the location string for the BaseCell.
get_ref
\u2013 get_ref.
Attributes:
children
(List[RefItem]
) \u2013 content_layer
(ContentLayer
) \u2013 formatting
(Optional[Formatting]
) \u2013 hyperlink
(Optional[Union[AnyUrl, Path]]
) \u2013 label
(Literal[CAPTION, CHECKBOX_SELECTED, CHECKBOX_UNSELECTED, FOOTNOTE, PAGE_FOOTER, PAGE_HEADER, PARAGRAPH, REFERENCE, TEXT, EMPTY_VALUE]
) \u2013 model_config
\u2013 orig
(str
) \u2013 parent
(Optional[RefItem]
) \u2013 prov
(List[ProvenanceItem]
) \u2013 self_ref
(str
) \u2013 text
(str
) \u2013 children: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.formatting","title":"formatting","text":"formatting: Optional[Formatting] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.hyperlink","title":"hyperlink","text":"hyperlink: Optional[Union[AnyUrl, Path]] = Field(union_mode='left_to_right', default=None)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.label","title":"label","text":"label: Literal[CAPTION, CHECKBOX_SELECTED, CHECKBOX_UNSELECTED, FOOTNOTE, PAGE_FOOTER, PAGE_HEADER, PARAGRAPH, REFERENCE, TEXT, EMPTY_VALUE]\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.orig","title":"orig","text":"orig: str\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.text","title":"text","text":"text: str\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.export_to_doctags","title":"export_to_doctags","text":"export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n
Export text element to document tokens format.
Parameters:
doc
(DoclingDocument
) \u2013 \"DoclingDocument\":
new_line
(str
, default: ''
) \u2013 str (Default value = \"\") Deprecated
xsize
(int
, default: 500
) \u2013 int: (Default value = 500)
ysize
(int
, default: 500
) \u2013 int: (Default value = 500)
add_location
(bool
, default: True
) \u2013 bool: (Default value = True)
add_content
(bool
, default: True
) \u2013 bool: (Default value = True)
export_to_document_tokens(*args, **kwargs)\n
Export to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n
Get the annotations of this DocItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n
Returns the image of this DocItem.
The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n
Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n
get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem","title":"TableItem","text":" Bases: FloatingItem
TableItem.
Methods:
add_annotation
\u2013 Add an annotation to the table.
caption_text
\u2013 Computes the caption as a single text.
export_to_dataframe
\u2013 Export the table as a Pandas DataFrame.
export_to_doctags
\u2013 Export table to document tokens format.
export_to_document_tokens
\u2013 Export to DocTags format.
export_to_html
\u2013 Export the table as html.
export_to_markdown
\u2013 Export the table as markdown.
export_to_otsl
\u2013 Export the table as OTSL.
get_annotations
\u2013 Get the annotations of this TableItem.
get_image
\u2013 Returns the image corresponding to this FloatingItem.
get_location_tokens
\u2013 Get the location string for the BaseCell.
get_ref
\u2013 get_ref.
Attributes:
annotations
(List[TableAnnotationType]
) \u2013 captions
(List[RefItem]
) \u2013 children
(List[RefItem]
) \u2013 content_layer
(ContentLayer
) \u2013 data
(TableData
) \u2013 footnotes
(List[RefItem]
) \u2013 image
(Optional[ImageRef]
) \u2013 label
(Literal[DOCUMENT_INDEX, TABLE]
) \u2013 model_config
\u2013 parent
(Optional[RefItem]
) \u2013 prov
(List[ProvenanceItem]
) \u2013 references
(List[RefItem]
) \u2013 self_ref
(str
) \u2013 annotations: List[TableAnnotationType] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.captions","title":"captions","text":"captions: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.children","title":"children","text":"children: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.data","title":"data","text":"data: TableData\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.footnotes","title":"footnotes","text":"footnotes: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.image","title":"image","text":"image: Optional[ImageRef] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.label","title":"label","text":"label: Literal[DOCUMENT_INDEX, TABLE] = TABLE\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.references","title":"references","text":"references: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.add_annotation","title":"add_annotation","text":"add_annotation(annotation: TableAnnotationType) -> None\n
Add an annotation to the table.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.caption_text","title":"caption_text","text":"caption_text(doc: DoclingDocument) -> str\n
Computes the caption as a single text.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_dataframe","title":"export_to_dataframe","text":"export_to_dataframe() -> DataFrame\n
Export the table as a Pandas DataFrame.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_doctags","title":"export_to_doctags","text":"export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_cell_location: bool = True, add_cell_text: bool = True, add_caption: bool = True)\n
Export table to document tokens format.
Parameters:
doc
(DoclingDocument
) \u2013 \"DoclingDocument\":
new_line
(str
, default: ''
) \u2013 str (Default value = \"\") Deprecated
xsize
(int
, default: 500
) \u2013 int: (Default value = 500)
ysize
(int
, default: 500
) \u2013 int: (Default value = 500)
add_location
(bool
, default: True
) \u2013 bool: (Default value = True)
add_cell_location
(bool
, default: True
) \u2013 bool: (Default value = True)
add_cell_text
(bool
, default: True
) \u2013 bool: (Default value = True)
add_caption
(bool
, default: True
) \u2013 bool: (Default value = True)
export_to_document_tokens(*args, **kwargs)\n
Export to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_html","title":"export_to_html","text":"export_to_html(doc: Optional[DoclingDocument] = None, add_caption: bool = True) -> str\n
Export the table as html.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_markdown","title":"export_to_markdown","text":"export_to_markdown(doc: Optional[DoclingDocument] = None) -> str\n
Export the table as markdown.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_otsl","title":"export_to_otsl","text":"export_to_otsl(doc: DoclingDocument, add_cell_location: bool = True, add_cell_text: bool = True, xsize: int = 500, ysize: int = 500) -> str\n
Export the table as OTSL.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n
Get the annotations of this TableItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n
Returns the image corresponding to this FloatingItem.
This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.
In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n
Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n
get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell","title":"TableCell","text":" Bases: BaseModel
TableCell.
Methods:
from_dict_format
\u2013 from_dict_format.
Attributes:
bbox
(Optional[BoundingBox]
) \u2013 col_span
(int
) \u2013 column_header
(bool
) \u2013 end_col_offset_idx
(int
) \u2013 end_row_offset_idx
(int
) \u2013 row_header
(bool
) \u2013 row_section
(bool
) \u2013 row_span
(int
) \u2013 start_col_offset_idx
(int
) \u2013 start_row_offset_idx
(int
) \u2013 text
(str
) \u2013 bbox: Optional[BoundingBox] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.col_span","title":"col_span","text":"col_span: int = 1\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.column_header","title":"column_header","text":"column_header: bool = False\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.end_col_offset_idx","title":"end_col_offset_idx","text":"end_col_offset_idx: int\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.end_row_offset_idx","title":"end_row_offset_idx","text":"end_row_offset_idx: int\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_header","title":"row_header","text":"row_header: bool = False\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_section","title":"row_section","text":"row_section: bool = False\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_span","title":"row_span","text":"row_span: int = 1\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.start_col_offset_idx","title":"start_col_offset_idx","text":"start_col_offset_idx: int\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.start_row_offset_idx","title":"start_row_offset_idx","text":"start_row_offset_idx: int\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.text","title":"text","text":"text: str\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.from_dict_format","title":"from_dict_format","text":"from_dict_format(data: Any) -> Any\n
from_dict_format.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableData","title":"TableData","text":" Bases: BaseModel
BaseTableData.
Methods:
add_row
\u2013 Add a new row to the table from a list of strings.
add_rows
\u2013 Add multiple new rows to the table from a list of lists of strings.
get_column_bounding_boxes
\u2013 Get the minimal bounding box for each column in the table.
get_row_bounding_boxes
\u2013 Get the minimal bounding box for each row in the table.
insert_row
\u2013 Insert a new row from a list of strings before/after a specific index in the table.
insert_rows
\u2013 Insert multiple new rows from a list of lists of strings before/after a specific index in the table.
pop_row
\u2013 Remove and return the last row from the table.
remove_row
\u2013 Remove a row from the table by its index.
remove_rows
\u2013 Remove rows from the table by their indices.
Attributes:
grid
(List[List[TableCell]]
) \u2013 grid.
num_cols
(int
) \u2013 num_rows
(int
) \u2013 table_cells
(List[TableCell]
) \u2013 grid: List[List[TableCell]]\n
grid.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.num_cols","title":"num_cols","text":"num_cols: int = 0\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.num_rows","title":"num_rows","text":"num_rows: int = 0\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.table_cells","title":"table_cells","text":"table_cells: List[TableCell] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.add_row","title":"add_row","text":"add_row(row: List[str]) -> None\n
Add a new row to the table from a list of strings.
Parameters:
row
(List[str]
) \u2013 List[str]: A list of strings representing the content of the new row.
Returns:
None
\u2013 None
add_rows(rows: List[List[str]]) -> None\n
Add multiple new rows to the table from a list of lists of strings.
Parameters:
rows
(List[List[str]]
) \u2013 List[List[str]]: A list of lists, where each inner list represents the content of a new row.
Returns:
None
\u2013 None
get_column_bounding_boxes() -> dict[int, BoundingBox]\n
Get the minimal bounding box for each column in the table.
Returns: List[Optional[BoundingBox]]: A list where each element is the minimal bounding box that encompasses all cells in that column, or None if no cells in the column have bounding boxes.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.get_row_bounding_boxes","title":"get_row_bounding_boxes","text":"get_row_bounding_boxes() -> dict[int, BoundingBox]\n
Get the minimal bounding box for each row in the table.
Returns: List[Optional[BoundingBox]]: A list where each element is the minimal bounding box that encompasses all cells in that row, or None if no cells in the row have bounding boxes.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.insert_row","title":"insert_row","text":"insert_row(row_index: int, row: List[str], after: bool = False) -> None\n
Insert a new row from a list of strings before/after a specific index in the table.
Parameters:
row_index
(int
) \u2013 int: The index at which to insert the new row. (Starting from 0)
row
(List[str]
) \u2013 List[str]: A list of strings representing the content of the new row.
after
(bool
, default: False
) \u2013 bool: If True, insert the row after the specified index, otherwise before it. (Default is False)
Returns:
None
\u2013 None
insert_rows(row_index: int, rows: List[List[str]], after: bool = False) -> None\n
Insert multiple new rows from a list of lists of strings before/after a specific index in the table.
Parameters:
row_index
(int
) \u2013 int: The index at which to insert the new rows. (Starting from 0)
rows
(List[List[str]]
) \u2013 List[List[str]]: A list of lists, where each inner list represents the content of a new row.
after
(bool
, default: False
) \u2013 bool: If True, insert the rows after the specified index, otherwise before it. (Default is False)
Returns:
None
\u2013 None
pop_row() -> List[TableCell]\n
Remove and return the last row from the table.
Returns:
List[TableCell]
\u2013 List[TableCell]: A list of TableCell objects representing the popped row.
remove_row(row_index: int) -> List[TableCell]\n
Remove a row from the table by its index.
Parameters:
row_index
(int
) \u2013 int: The index of the row to remove. (Starting from 0)
Returns:
List[TableCell]
\u2013 List[TableCell]: A list of TableCell objects representing the removed row.
remove_rows(indices: List[int]) -> List[List[TableCell]]\n
Remove rows from the table by their indices.
Parameters:
indices
(List[int]
) \u2013 List[int]: A list of indices of the rows to remove. (Starting from 0)
Returns:
List[List[TableCell]]
\u2013 List[List[TableCell]]: A list representation of the removed rows as lists of TableCell objects.
Bases: str
, Enum
TableCellLabel.
Methods:
get_color
\u2013 Return the RGB color associated with a given label.
Attributes:
BODY
\u2013 COLUMN_HEADER
\u2013 ROW_HEADER
\u2013 ROW_SECTION
\u2013 BODY = 'body'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.COLUMN_HEADER","title":"COLUMN_HEADER","text":"COLUMN_HEADER = 'col_header'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.ROW_HEADER","title":"ROW_HEADER","text":"ROW_HEADER = 'row_header'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.ROW_SECTION","title":"ROW_SECTION","text":"ROW_SECTION = 'row_section'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.get_color","title":"get_color","text":"get_color(label: TableCellLabel) -> Tuple[int, int, int]\n
Return the RGB color associated with a given label.
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem","title":"KeyValueItem","text":" Bases: FloatingItem
KeyValueItem.
Methods:
caption_text
\u2013 Computes the caption as a single text.
export_to_document_tokens
\u2013 Export key value item to document tokens format.
get_annotations
\u2013 Get the annotations of this DocItem.
get_image
\u2013 Returns the image corresponding to this FloatingItem.
get_location_tokens
\u2013 Get the location string for the BaseCell.
get_ref
\u2013 get_ref.
Attributes:
captions
(List[RefItem]
) \u2013 children
(List[RefItem]
) \u2013 content_layer
(ContentLayer
) \u2013 footnotes
(List[RefItem]
) \u2013 graph
(GraphData
) \u2013 image
(Optional[ImageRef]
) \u2013 label
(Literal[KEY_VALUE_REGION]
) \u2013 model_config
\u2013 parent
(Optional[RefItem]
) \u2013 prov
(List[ProvenanceItem]
) \u2013 references
(List[RefItem]
) \u2013 self_ref
(str
) \u2013 captions: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.children","title":"children","text":"children: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.footnotes","title":"footnotes","text":"footnotes: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.graph","title":"graph","text":"graph: GraphData\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.image","title":"image","text":"image: Optional[ImageRef] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.label","title":"label","text":"label: Literal[KEY_VALUE_REGION] = KEY_VALUE_REGION\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.references","title":"references","text":"references: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.caption_text","title":"caption_text","text":"caption_text(doc: DoclingDocument) -> str\n
Computes the caption as a single text.
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.export_to_document_tokens","title":"export_to_document_tokens","text":"export_to_document_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n
Export key value item to document tokens format.
Parameters:
doc
(DoclingDocument
) \u2013 \"DoclingDocument\":
new_line
(str
, default: ''
) \u2013 str (Default value = \"\") Deprecated
xsize
(int
, default: 500
) \u2013 int: (Default value = 500)
ysize
(int
, default: 500
) \u2013 int: (Default value = 500)
add_location
(bool
, default: True
) \u2013 bool: (Default value = True)
add_content
(bool
, default: True
) \u2013 bool: (Default value = True)
get_annotations() -> Sequence[BaseAnnotation]\n
Get the annotations of this DocItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n
Returns the image corresponding to this FloatingItem.
This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.
In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n
Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n
get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem","title":"SectionHeaderItem","text":" Bases: TextItem
SectionItem.
Methods:
export_to_doctags
\u2013 Export text element to document tokens format.
export_to_document_tokens
\u2013 Export to DocTags format.
get_annotations
\u2013 Get the annotations of this DocItem.
get_image
\u2013 Returns the image of this DocItem.
get_location_tokens
\u2013 Get the location string for the BaseCell.
get_ref
\u2013 get_ref.
Attributes:
children
(List[RefItem]
) \u2013 content_layer
(ContentLayer
) \u2013 formatting
(Optional[Formatting]
) \u2013 hyperlink
(Optional[Union[AnyUrl, Path]]
) \u2013 label
(Literal[SECTION_HEADER]
) \u2013 level
(LevelNumber
) \u2013 model_config
\u2013 orig
(str
) \u2013 parent
(Optional[RefItem]
) \u2013 prov
(List[ProvenanceItem]
) \u2013 self_ref
(str
) \u2013 text
(str
) \u2013 children: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.formatting","title":"formatting","text":"formatting: Optional[Formatting] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.hyperlink","title":"hyperlink","text":"hyperlink: Optional[Union[AnyUrl, Path]] = Field(union_mode='left_to_right', default=None)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.label","title":"label","text":"label: Literal[SECTION_HEADER] = SECTION_HEADER\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.level","title":"level","text":"level: LevelNumber = 1\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.orig","title":"orig","text":"orig: str\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.text","title":"text","text":"text: str\n
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.export_to_doctags","title":"export_to_doctags","text":"export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n
Export text element to document tokens format.
Parameters:
doc
(DoclingDocument
) \u2013 \"DoclingDocument\":
new_line
(str
, default: ''
) \u2013 str (Default value = \"\") Deprecated
xsize
(int
, default: 500
) \u2013 int: (Default value = 500)
ysize
(int
, default: 500
) \u2013 int: (Default value = 500)
add_location
(bool
, default: True
) \u2013 bool: (Default value = True)
add_content
(bool
, default: True
) \u2013 bool: (Default value = True)
export_to_document_tokens(*args, **kwargs)\n
Export to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n
Get the annotations of this DocItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n
Returns the image of this DocItem.
The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n
Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n
get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem","title":"PictureItem","text":" Bases: FloatingItem
PictureItem.
Methods:
caption_text
\u2013 Computes the caption as a single text.
export_to_doctags
\u2013 Export picture to document tokens format.
export_to_document_tokens
\u2013 Export to DocTags format.
export_to_html
\u2013 Export picture to HTML format.
export_to_markdown
\u2013 Export picture to Markdown format.
get_annotations
\u2013 Get the annotations of this PictureItem.
get_image
\u2013 Returns the image corresponding to this FloatingItem.
get_location_tokens
\u2013 Get the location string for the BaseCell.
get_ref
\u2013 get_ref.
Attributes:
annotations
(List[PictureDataType]
) \u2013 captions
(List[RefItem]
) \u2013 children
(List[RefItem]
) \u2013 content_layer
(ContentLayer
) \u2013 footnotes
(List[RefItem]
) \u2013 image
(Optional[ImageRef]
) \u2013 label
(Literal[PICTURE, CHART]
) \u2013 model_config
\u2013 parent
(Optional[RefItem]
) \u2013 prov
(List[ProvenanceItem]
) \u2013 references
(List[RefItem]
) \u2013 self_ref
(str
) \u2013 annotations: List[PictureDataType] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.captions","title":"captions","text":"captions: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.children","title":"children","text":"children: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.footnotes","title":"footnotes","text":"footnotes: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.image","title":"image","text":"image: Optional[ImageRef] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.label","title":"label","text":"label: Literal[PICTURE, CHART] = PICTURE\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.references","title":"references","text":"references: List[RefItem] = []\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.caption_text","title":"caption_text","text":"caption_text(doc: DoclingDocument) -> str\n
Computes the caption as a single text.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_doctags","title":"export_to_doctags","text":"export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_caption: bool = True, add_content: bool = True)\n
Export picture to document tokens format.
Parameters:
doc
(DoclingDocument
) \u2013 \"DoclingDocument\":
new_line
(str
, default: ''
) \u2013 str (Default value = \"\") Deprecated
xsize
(int
, default: 500
) \u2013 int: (Default value = 500)
ysize
(int
, default: 500
) \u2013 int: (Default value = 500)
add_location
(bool
, default: True
) \u2013 bool: (Default value = True)
add_caption
(bool
, default: True
) \u2013 bool: (Default value = True)
add_content
(bool
, default: True
) \u2013 bool: (Default value = True)
export_to_document_tokens(*args, **kwargs)\n
Export to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_html","title":"export_to_html","text":"export_to_html(doc: DoclingDocument, add_caption: bool = True, image_mode: ImageRefMode = PLACEHOLDER) -> str\n
Export picture to HTML format.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_markdown","title":"export_to_markdown","text":"export_to_markdown(doc: DoclingDocument, add_caption: bool = True, image_mode: ImageRefMode = EMBEDDED, image_placeholder: str = '<!-- image -->') -> str\n
Export picture to Markdown format.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n
Get the annotations of this PictureItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n
Returns the image corresponding to this FloatingItem.
This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.
In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n
Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n
get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef","title":"ImageRef","text":" Bases: BaseModel
ImageRef.
Methods:
from_pil
\u2013 Construct ImageRef from a PIL Image.
validate_mimetype
\u2013 validate_mimetype.
Attributes:
dpi
(int
) \u2013 mimetype
(str
) \u2013 pil_image
(Optional[Image]
) \u2013 Return the PIL Image.
size
(Size
) \u2013 uri
(Union[AnyUrl, Path]
) \u2013 dpi: int\n
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.mimetype","title":"mimetype","text":"mimetype: str\n
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.pil_image","title":"pil_image","text":"pil_image: Optional[Image]\n
Return the PIL Image.
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.size","title":"size","text":"size: Size\n
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.uri","title":"uri","text":"uri: Union[AnyUrl, Path] = Field(union_mode='left_to_right')\n
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.from_pil","title":"from_pil","text":"from_pil(image: Image, dpi: int) -> Self\n
Construct ImageRef from a PIL Image.
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.validate_mimetype","title":"validate_mimetype","text":"validate_mimetype(v)\n
validate_mimetype.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationClass","title":"PictureClassificationClass","text":" Bases: BaseModel
PictureClassificationData.
Attributes:
class_name
(str
) \u2013 confidence
(float
) \u2013 class_name: str\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationClass.confidence","title":"confidence","text":"confidence: float\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData","title":"PictureClassificationData","text":" Bases: BaseAnnotation
PictureClassificationData.
Attributes:
kind
(Literal['classification']
) \u2013 predicted_classes
(List[PictureClassificationClass]
) \u2013 provenance
(str
) \u2013 kind: Literal['classification'] = 'classification'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData.predicted_classes","title":"predicted_classes","text":"predicted_classes: List[PictureClassificationClass]\n
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData.provenance","title":"provenance","text":"provenance: str\n
"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem","title":"RefItem","text":" Bases: BaseModel
RefItem.
Methods:
get_ref
\u2013 get_ref.
resolve
\u2013 Resolve the path in the document.
Attributes:
cref
(str
) \u2013 model_config
\u2013 cref: str = Field(alias='$ref', pattern=_JSON_POINTER_REGEX)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.model_config","title":"model_config","text":"model_config = ConfigDict(populate_by_name=True)\n
"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.get_ref","title":"get_ref","text":"get_ref()\n
get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.resolve","title":"resolve","text":"resolve(doc: DoclingDocument)\n
Resolve the path in the document.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox","title":"BoundingBox","text":" Bases: BaseModel
BoundingBox.
Methods:
area
\u2013 area.
as_tuple
\u2013 as_tuple.
enclosing_bbox
\u2013 Create a bounding box that covers all of the given boxes.
expand_by_scale
\u2013 expand_to_size.
from_tuple
\u2013 from_tuple.
intersection_area_with
\u2013 Calculate the intersection area with another bounding box.
intersection_over_self
\u2013 intersection_over_self.
intersection_over_union
\u2013 intersection_over_union.
is_above
\u2013 is_above.
is_horizontally_connected
\u2013 is_horizontally_connected.
is_left_of
\u2013 is_left_of.
is_strictly_above
\u2013 is_strictly_above.
is_strictly_left_of
\u2013 is_strictly_left_of.
normalized
\u2013 normalized.
overlaps
\u2013 overlaps.
overlaps_horizontally
\u2013 Check if two bounding boxes overlap horizontally.
overlaps_vertically
\u2013 Check if two bounding boxes overlap vertically.
overlaps_vertically_with_iou
\u2013 overlaps_y_with_iou.
resize_by_scale
\u2013 resize_by_scale.
scale_to_size
\u2013 scale_to_size.
scaled
\u2013 scaled.
to_bottom_left_origin
\u2013 to_bottom_left_origin.
to_top_left_origin
\u2013 to_top_left_origin.
union_area_with
\u2013 Calculates the union area with another bounding box.
x_overlap_with
\u2013 Calculates the horizontal overlap with another bounding box.
x_union_with
\u2013 Calculates the horizontal union dimension with another bounding box.
y_overlap_with
\u2013 Calculates the vertical overlap with another bounding box, respecting coordinate origin.
y_union_with
\u2013 Calculates the vertical union dimension with another bounding box, respecting coordinate origin.
Attributes:
b
(float
) \u2013 coord_origin
(CoordOrigin
) \u2013 height
\u2013 height.
l
(float
) \u2013 r
(float
) \u2013 t
(float
) \u2013 width
\u2013 width.
b: float\n
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.coord_origin","title":"coord_origin","text":"coord_origin: CoordOrigin = TOPLEFT\n
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.height","title":"height","text":"height\n
height.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.l","title":"l","text":"l: float\n
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.r","title":"r","text":"r: float\n
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.t","title":"t","text":"t: float\n
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.width","title":"width","text":"width\n
width.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.area","title":"area","text":"area() -> float\n
area.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.as_tuple","title":"as_tuple","text":"as_tuple() -> Tuple[float, float, float, float]\n
as_tuple.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.enclosing_bbox","title":"enclosing_bbox","text":"enclosing_bbox(boxes: List[BoundingBox]) -> BoundingBox\n
Create a bounding box that covers all of the given boxes.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.expand_by_scale","title":"expand_by_scale","text":"expand_by_scale(x_scale: float, y_scale: float) -> BoundingBox\n
expand_to_size.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.from_tuple","title":"from_tuple","text":"from_tuple(coord: Tuple[float, ...], origin: CoordOrigin)\n
from_tuple.
Parameters:
coord
(Tuple[float, ...]
) \u2013 Tuple[float:
...]
\u2013 origin
(CoordOrigin
) \u2013 CoordOrigin:
intersection_area_with(other: BoundingBox) -> float\n
Calculate the intersection area with another bounding box.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.intersection_over_self","title":"intersection_over_self","text":"intersection_over_self(other: BoundingBox, eps: float = 1e-06) -> float\n
intersection_over_self.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.intersection_over_union","title":"intersection_over_union","text":"intersection_over_union(other: BoundingBox, eps: float = 1e-06) -> float\n
intersection_over_union.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_above","title":"is_above","text":"is_above(other: BoundingBox) -> bool\n
is_above.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_horizontally_connected","title":"is_horizontally_connected","text":"is_horizontally_connected(elem_i: BoundingBox, elem_j: BoundingBox) -> bool\n
is_horizontally_connected.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_left_of","title":"is_left_of","text":"is_left_of(other: BoundingBox) -> bool\n
is_left_of.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_strictly_above","title":"is_strictly_above","text":"is_strictly_above(other: BoundingBox, eps: float = 0.001) -> bool\n
is_strictly_above.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_strictly_left_of","title":"is_strictly_left_of","text":"is_strictly_left_of(other: BoundingBox, eps: float = 0.001) -> bool\n
is_strictly_left_of.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.normalized","title":"normalized","text":"normalized(page_size: Size)\n
normalized.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps","title":"overlaps","text":"overlaps(other: BoundingBox) -> bool\n
overlaps.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_horizontally","title":"overlaps_horizontally","text":"overlaps_horizontally(other: BoundingBox) -> bool\n
Check if two bounding boxes overlap horizontally.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_vertically","title":"overlaps_vertically","text":"overlaps_vertically(other: BoundingBox) -> bool\n
Check if two bounding boxes overlap vertically.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_vertically_with_iou","title":"overlaps_vertically_with_iou","text":"overlaps_vertically_with_iou(other: BoundingBox, iou: float) -> bool\n
overlaps_y_with_iou.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.resize_by_scale","title":"resize_by_scale","text":"resize_by_scale(x_scale: float, y_scale: float)\n
resize_by_scale.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.scale_to_size","title":"scale_to_size","text":"scale_to_size(old_size: Size, new_size: Size)\n
scale_to_size.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.scaled","title":"scaled","text":"scaled(scale: float)\n
scaled.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.to_bottom_left_origin","title":"to_bottom_left_origin","text":"to_bottom_left_origin(page_height: float) -> BoundingBox\n
to_bottom_left_origin.
Parameters:
page_height
(float
) \u2013 to_top_left_origin(page_height: float) -> BoundingBox\n
to_top_left_origin.
Parameters:
page_height
(float
) \u2013 union_area_with(other: BoundingBox) -> float\n
Calculates the union area with another bounding box.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.x_overlap_with","title":"x_overlap_with","text":"x_overlap_with(other: BoundingBox) -> float\n
Calculates the horizontal overlap with another bounding box.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.x_union_with","title":"x_union_with","text":"x_union_with(other: BoundingBox) -> float\n
Calculates the horizontal union dimension with another bounding box.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.y_overlap_with","title":"y_overlap_with","text":"y_overlap_with(other: BoundingBox) -> float\n
Calculates the vertical overlap with another bounding box, respecting coordinate origin.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.y_union_with","title":"y_union_with","text":"y_union_with(other: BoundingBox) -> float\n
Calculates the vertical union dimension with another bounding box, respecting coordinate origin.
"},{"location":"reference/docling_document/#docling_core.types.doc.CoordOrigin","title":"CoordOrigin","text":" Bases: str
, Enum
CoordOrigin.
Attributes:
BOTTOMLEFT
\u2013 TOPLEFT
\u2013 BOTTOMLEFT = 'BOTTOMLEFT'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.CoordOrigin.TOPLEFT","title":"TOPLEFT","text":"TOPLEFT = 'TOPLEFT'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode","title":"ImageRefMode","text":" Bases: str
, Enum
ImageRefMode.
Attributes:
EMBEDDED
\u2013 PLACEHOLDER
\u2013 REFERENCED
\u2013 EMBEDDED = 'embedded'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode.PLACEHOLDER","title":"PLACEHOLDER","text":"PLACEHOLDER = 'placeholder'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode.REFERENCED","title":"REFERENCED","text":"REFERENCED = 'referenced'\n
"},{"location":"reference/docling_document/#docling_core.types.doc.Size","title":"Size","text":" Bases: BaseModel
Size.
Methods:
as_tuple
\u2013 as_tuple.
Attributes:
height
(float
) \u2013 width
(float
) \u2013 height: float = 0.0\n
"},{"location":"reference/docling_document/#docling_core.types.doc.Size.width","title":"width","text":"width: float = 0.0\n
"},{"location":"reference/docling_document/#docling_core.types.doc.Size.as_tuple","title":"as_tuple","text":"as_tuple()\n
as_tuple.
"},{"location":"reference/document_converter/","title":"Document converter","text":"This is an automatic generated API reference of the main components of Docling.
"},{"location":"reference/document_converter/#docling.document_converter","title":"document_converter","text":"Classes:
DocumentConverter
\u2013 ConversionResult
\u2013 ConversionStatus
\u2013 FormatOption
\u2013 InputFormat
\u2013 A document format supported by document backend parsers.
PdfFormatOption
\u2013 ImageFormatOption
\u2013 StandardPdfPipeline
\u2013 WordFormatOption
\u2013 PowerpointFormatOption
\u2013 MarkdownFormatOption
\u2013 AsciiDocFormatOption
\u2013 HTMLFormatOption
\u2013 SimplePipeline
\u2013 SimpleModelPipeline.
DocumentConverter(allowed_formats: Optional[List[InputFormat]] = None, format_options: Optional[Dict[InputFormat, FormatOption]] = None)\n
Methods:
convert
\u2013 convert_all
\u2013 initialize_pipeline
\u2013 Initialize the conversion pipeline for the selected format.
Attributes:
allowed_formats
\u2013 format_to_options
\u2013 initialized_pipelines
(Dict[Tuple[Type[BasePipeline], str], BasePipeline]
) \u2013 instance-attribute
","text":"allowed_formats = allowed_formats if allowed_formats is not None else list(InputFormat)\n
"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.format_to_options","title":"format_to_options instance-attribute
","text":"format_to_options = {format: (_get_default_option(format=format) if (custom_option := (get(format))) is None else custom_option)for format in (allowed_formats)}\n
"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.initialized_pipelines","title":"initialized_pipelines instance-attribute
","text":"initialized_pipelines: Dict[Tuple[Type[BasePipeline], str], BasePipeline] = {}\n
"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.convert","title":"convert","text":"convert(source: Union[Path, str, DocumentStream], headers: Optional[Dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -> ConversionResult\n
"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.convert_all","title":"convert_all","text":"convert_all(source: Iterable[Union[Path, str, DocumentStream]], headers: Optional[Dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -> Iterator[ConversionResult]\n
"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.initialize_pipeline","title":"initialize_pipeline","text":"initialize_pipeline(format: InputFormat)\n
Initialize the conversion pipeline for the selected format.
"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult","title":"ConversionResult","text":" Bases: BaseModel
Attributes:
assembled
(AssembledUnit
) \u2013 confidence
(ConfidenceReport
) \u2013 document
(DoclingDocument
) \u2013 errors
(List[ErrorItem]
) \u2013 input
(InputDocument
) \u2013 legacy_document
\u2013 pages
(List[Page]
) \u2013 status
(ConversionStatus
) \u2013 timings
(Dict[str, ProfilingItem]
) \u2013 class-attribute
instance-attribute
","text":"assembled: AssembledUnit = AssembledUnit()\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.confidence","title":"confidence class-attribute
instance-attribute
","text":"confidence: ConfidenceReport = Field(default_factory=ConfidenceReport)\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.document","title":"document class-attribute
instance-attribute
","text":"document: DoclingDocument = _EMPTY_DOCLING_DOC\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.errors","title":"errors class-attribute
instance-attribute
","text":"errors: List[ErrorItem] = []\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.input","title":"input instance-attribute
","text":"input: InputDocument\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.legacy_document","title":"legacy_document property
","text":"legacy_document\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.pages","title":"pages class-attribute
instance-attribute
","text":"pages: List[Page] = []\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.status","title":"status class-attribute
instance-attribute
","text":"status: ConversionStatus = PENDING\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.timings","title":"timings class-attribute
instance-attribute
","text":"timings: Dict[str, ProfilingItem] = {}\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus","title":"ConversionStatus","text":" Bases: str
, Enum
Attributes:
FAILURE
\u2013 PARTIAL_SUCCESS
\u2013 PENDING
\u2013 SKIPPED
\u2013 STARTED
\u2013 SUCCESS
\u2013 class-attribute
instance-attribute
","text":"FAILURE = 'failure'\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.PARTIAL_SUCCESS","title":"PARTIAL_SUCCESS class-attribute
instance-attribute
","text":"PARTIAL_SUCCESS = 'partial_success'\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.PENDING","title":"PENDING class-attribute
instance-attribute
","text":"PENDING = 'pending'\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.SKIPPED","title":"SKIPPED class-attribute
instance-attribute
","text":"SKIPPED = 'skipped'\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.STARTED","title":"STARTED class-attribute
instance-attribute
","text":"STARTED = 'started'\n
"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.SUCCESS","title":"SUCCESS class-attribute
instance-attribute
","text":"SUCCESS = 'success'\n
"},{"location":"reference/document_converter/#docling.document_converter.FormatOption","title":"FormatOption","text":" Bases: BaseModel
Methods:
set_optional_field_default
\u2013 Attributes:
backend
(Type[AbstractDocumentBackend]
) \u2013 model_config
\u2013 pipeline_cls
(Type[BasePipeline]
) \u2013 pipeline_options
(Optional[PipelineOptions]
) \u2013 instance-attribute
","text":"backend: Type[AbstractDocumentBackend]\n
"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n
"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.pipeline_cls","title":"pipeline_cls instance-attribute
","text":"pipeline_cls: Type[BasePipeline]\n
"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.pipeline_options","title":"pipeline_options class-attribute
instance-attribute
","text":"pipeline_options: Optional[PipelineOptions] = None\n
"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> FormatOption\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat","title":"InputFormat","text":" Bases: str
, Enum
A document format supported by document backend parsers.
Attributes:
ASCIIDOC
\u2013 AUDIO
\u2013 CSV
\u2013 DOCX
\u2013 HTML
\u2013 IMAGE
\u2013 JSON_DOCLING
\u2013 MD
\u2013 PDF
\u2013 PPTX
\u2013 XLSX
\u2013 XML_JATS
\u2013 XML_USPTO
\u2013 class-attribute
instance-attribute
","text":"ASCIIDOC = 'asciidoc'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.AUDIO","title":"AUDIO class-attribute
instance-attribute
","text":"AUDIO = 'audio'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.CSV","title":"CSV class-attribute
instance-attribute
","text":"CSV = 'csv'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.DOCX","title":"DOCX class-attribute
instance-attribute
","text":"DOCX = 'docx'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.HTML","title":"HTML class-attribute
instance-attribute
","text":"HTML = 'html'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.IMAGE","title":"IMAGE class-attribute
instance-attribute
","text":"IMAGE = 'image'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.JSON_DOCLING","title":"JSON_DOCLING class-attribute
instance-attribute
","text":"JSON_DOCLING = 'json_docling'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.MD","title":"MD class-attribute
instance-attribute
","text":"MD = 'md'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.PDF","title":"PDF class-attribute
instance-attribute
","text":"PDF = 'pdf'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.PPTX","title":"PPTX class-attribute
instance-attribute
","text":"PPTX = 'pptx'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XLSX","title":"XLSX class-attribute
instance-attribute
","text":"XLSX = 'xlsx'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XML_JATS","title":"XML_JATS class-attribute
instance-attribute
","text":"XML_JATS = 'xml_jats'\n
"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XML_USPTO","title":"XML_USPTO class-attribute
instance-attribute
","text":"XML_USPTO = 'xml_uspto'\n
"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption","title":"PdfFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default
\u2013 Attributes:
backend
(Type[AbstractDocumentBackend]
) \u2013 model_config
\u2013 pipeline_cls
(Type
) \u2013 pipeline_options
(Optional[PipelineOptions]
) \u2013 class-attribute
instance-attribute
","text":"backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend\n
"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n
"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.pipeline_cls","title":"pipeline_cls class-attribute
instance-attribute
","text":"pipeline_cls: Type = StandardPdfPipeline\n
"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.pipeline_options","title":"pipeline_options class-attribute
instance-attribute
","text":"pipeline_options: Optional[PipelineOptions] = None\n
"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> FormatOption\n
"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption","title":"ImageFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default
\u2013 Attributes:
backend
(Type[AbstractDocumentBackend]
) \u2013 model_config
\u2013 pipeline_cls
(Type
) \u2013 pipeline_options
(Optional[PipelineOptions]
) \u2013 class-attribute
instance-attribute
","text":"backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend\n
"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n
"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.pipeline_cls","title":"pipeline_cls class-attribute
instance-attribute
","text":"pipeline_cls: Type = StandardPdfPipeline\n
"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.pipeline_options","title":"pipeline_options class-attribute
instance-attribute
","text":"pipeline_options: Optional[PipelineOptions] = None\n
"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> FormatOption\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline","title":"StandardPdfPipeline","text":"StandardPdfPipeline(pipeline_options: PdfPipelineOptions)\n
Bases: PaginatedPipeline
Methods:
download_models_hf
\u2013 execute
\u2013 get_default_options
\u2013 get_ocr_model
\u2013 get_picture_description_model
\u2013 initialize_page
\u2013 is_backend_supported
\u2013 Attributes:
build_pipe
\u2013 enrichment_pipe
\u2013 keep_backend
\u2013 keep_images
\u2013 pipeline_options
(PdfPipelineOptions
) \u2013 reading_order_model
\u2013 instance-attribute
","text":"build_pipe = [PagePreprocessingModel(options=PagePreprocessingOptions(images_scale=images_scale)), ocr_model, LayoutModel(artifacts_path=artifacts_path, accelerator_options=accelerator_options, options=layout_options), TableStructureModel(enabled=do_table_structure, artifacts_path=artifacts_path, options=table_structure_options, accelerator_options=accelerator_options), PageAssembleModel(options=PageAssembleOptions())]\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.enrichment_pipe","title":"enrichment_pipe instance-attribute
","text":"enrichment_pipe = [CodeFormulaModel(enabled=do_code_enrichment or do_formula_enrichment, artifacts_path=artifacts_path, options=CodeFormulaModelOptions(do_code_enrichment=do_code_enrichment, do_formula_enrichment=do_formula_enrichment), accelerator_options=accelerator_options), DocumentPictureClassifier(enabled=do_picture_classification, artifacts_path=artifacts_path, options=DocumentPictureClassifierOptions(), accelerator_options=accelerator_options), picture_description_model]\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.keep_backend","title":"keep_backend instance-attribute
","text":"keep_backend = True\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.keep_images","title":"keep_images instance-attribute
","text":"keep_images = generate_page_images or generate_picture_images or generate_table_images\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.pipeline_options","title":"pipeline_options instance-attribute
","text":"pipeline_options: PdfPipelineOptions\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.reading_order_model","title":"reading_order_model instance-attribute
","text":"reading_order_model = ReadingOrderModel(options=ReadingOrderOptions())\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.download_models_hf","title":"download_models_hf staticmethod
","text":"download_models_hf(local_dir: Optional[Path] = None, force: bool = False) -> Path\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.execute","title":"execute","text":"execute(in_doc: InputDocument, raises_on_error: bool) -> ConversionResult\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.get_default_options","title":"get_default_options classmethod
","text":"get_default_options() -> PdfPipelineOptions\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.get_ocr_model","title":"get_ocr_model","text":"get_ocr_model(artifacts_path: Optional[Path] = None) -> BaseOcrModel\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.get_picture_description_model","title":"get_picture_description_model","text":"get_picture_description_model(artifacts_path: Optional[Path] = None) -> Optional[PictureDescriptionBaseModel]\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.initialize_page","title":"initialize_page","text":"initialize_page(conv_res: ConversionResult, page: Page) -> Page\n
"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.is_backend_supported","title":"is_backend_supported classmethod
","text":"is_backend_supported(backend: AbstractDocumentBackend)\n
"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption","title":"WordFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default
\u2013 Attributes:
backend
(Type[AbstractDocumentBackend]
) \u2013 model_config
\u2013 pipeline_cls
(Type
) \u2013 pipeline_options
(Optional[PipelineOptions]
) \u2013 class-attribute
instance-attribute
","text":"backend: Type[AbstractDocumentBackend] = MsWordDocumentBackend\n
"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n
"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.pipeline_cls","title":"pipeline_cls class-attribute
instance-attribute
","text":"pipeline_cls: Type = SimplePipeline\n
"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.pipeline_options","title":"pipeline_options class-attribute
instance-attribute
","text":"pipeline_options: Optional[PipelineOptions] = None\n
"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> FormatOption\n
"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption","title":"PowerpointFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default
\u2013 Attributes:
backend
(Type[AbstractDocumentBackend]
) \u2013 model_config
\u2013 pipeline_cls
(Type
) \u2013 pipeline_options
(Optional[PipelineOptions]
) \u2013 class-attribute
instance-attribute
","text":"backend: Type[AbstractDocumentBackend] = MsPowerpointDocumentBackend\n
"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n
"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.pipeline_cls","title":"pipeline_cls class-attribute
instance-attribute
","text":"pipeline_cls: Type = SimplePipeline\n
"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.pipeline_options","title":"pipeline_options class-attribute
instance-attribute
","text":"pipeline_options: Optional[PipelineOptions] = None\n
"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> FormatOption\n
"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption","title":"MarkdownFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default
\u2013 Attributes:
backend
(Type[AbstractDocumentBackend]
) \u2013 model_config
\u2013 pipeline_cls
(Type
) \u2013 pipeline_options
(Optional[PipelineOptions]
) \u2013 class-attribute
instance-attribute
","text":"backend: Type[AbstractDocumentBackend] = MarkdownDocumentBackend\n
"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n
"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.pipeline_cls","title":"pipeline_cls class-attribute
instance-attribute
","text":"pipeline_cls: Type = SimplePipeline\n
"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.pipeline_options","title":"pipeline_options class-attribute
instance-attribute
","text":"pipeline_options: Optional[PipelineOptions] = None\n
"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> FormatOption\n
"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption","title":"AsciiDocFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default
\u2013 Attributes:
backend
(Type[AbstractDocumentBackend]
) \u2013 model_config
\u2013 pipeline_cls
(Type
) \u2013 pipeline_options
(Optional[PipelineOptions]
) \u2013 class-attribute
instance-attribute
","text":"backend: Type[AbstractDocumentBackend] = AsciiDocBackend\n
"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n
"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.pipeline_cls","title":"pipeline_cls class-attribute
instance-attribute
","text":"pipeline_cls: Type = SimplePipeline\n
"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.pipeline_options","title":"pipeline_options class-attribute
instance-attribute
","text":"pipeline_options: Optional[PipelineOptions] = None\n
"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> FormatOption\n
"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption","title":"HTMLFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default
\u2013 Attributes:
backend
(Type[AbstractDocumentBackend]
) \u2013 model_config
\u2013 pipeline_cls
(Type
) \u2013 pipeline_options
(Optional[PipelineOptions]
) \u2013 class-attribute
instance-attribute
","text":"backend: Type[AbstractDocumentBackend] = HTMLDocumentBackend\n
"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n
"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.pipeline_cls","title":"pipeline_cls class-attribute
instance-attribute
","text":"pipeline_cls: Type = SimplePipeline\n
"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.pipeline_options","title":"pipeline_options class-attribute
instance-attribute
","text":"pipeline_options: Optional[PipelineOptions] = None\n
"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> FormatOption\n
"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline","title":"SimplePipeline","text":"SimplePipeline(pipeline_options: PipelineOptions)\n
Bases: BasePipeline
SimpleModelPipeline.
This class is used at the moment for formats / backends which produce straight DoclingDocument output.
Methods:
execute
\u2013 get_default_options
\u2013 is_backend_supported
\u2013 Attributes:
build_pipe
(List[Callable]
) \u2013 enrichment_pipe
(List[GenericEnrichmentModel[Any]]
) \u2013 keep_images
\u2013 pipeline_options
\u2013 instance-attribute
","text":"build_pipe: List[Callable] = []\n
"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.enrichment_pipe","title":"enrichment_pipe instance-attribute
","text":"enrichment_pipe: List[GenericEnrichmentModel[Any]] = []\n
"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.keep_images","title":"keep_images instance-attribute
","text":"keep_images = False\n
"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.pipeline_options","title":"pipeline_options instance-attribute
","text":"pipeline_options = pipeline_options\n
"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.execute","title":"execute","text":"execute(in_doc: InputDocument, raises_on_error: bool) -> ConversionResult\n
"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.get_default_options","title":"get_default_options classmethod
","text":"get_default_options() -> PipelineOptions\n
"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.is_backend_supported","title":"is_backend_supported classmethod
","text":"is_backend_supported(backend: AbstractDocumentBackend)\n
"},{"location":"reference/pipeline_options/","title":"Pipeline options","text":"Pipeline options allow to customize the execution of the models during the conversion pipeline. This includes options for the OCR engines, the table model as well as enrichment options which can be enabled with do_xyz = True
.
This is an automatic generated API reference of the all the pipeline options available in Docling.
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options","title":"pipeline_options","text":"Classes:
AsrPipelineOptions
\u2013 BaseOptions
\u2013 Base class for options.
EasyOcrOptions
\u2013 Options for the EasyOCR engine.
LayoutOptions
\u2013 Options for layout processing.
OcrEngine
\u2013 Enum of valid OCR engines.
OcrMacOptions
\u2013 Options for the Mac OCR engine.
OcrOptions
\u2013 OCR options.
PaginatedPipelineOptions
\u2013 PdfBackend
\u2013 Enum of valid PDF backends.
PdfPipelineOptions
\u2013 Options for the PDF pipeline.
PictureDescriptionApiOptions
\u2013 PictureDescriptionBaseOptions
\u2013 PictureDescriptionVlmOptions
\u2013 PipelineOptions
\u2013 Base pipeline options.
ProcessingPipeline
\u2013 RapidOcrOptions
\u2013 Options for the RapidOCR engine.
TableFormerMode
\u2013 Modes for the TableFormer model.
TableStructureOptions
\u2013 Options for the table structure.
TesseractCliOcrOptions
\u2013 Options for the TesseractCli engine.
TesseractOcrOptions
\u2013 Options for the Tesseract engine.
VlmPipelineOptions
\u2013 Attributes:
granite_picture_description
\u2013 smolvlm_picture_description
\u2013 module-attribute
","text":"granite_picture_description = PictureDescriptionVlmOptions(repo_id='ibm-granite/granite-vision-3.3-2b', prompt='What is shown in this image?')\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.smolvlm_picture_description","title":"smolvlm_picture_description module-attribute
","text":"smolvlm_picture_description = PictureDescriptionVlmOptions(repo_id='HuggingFaceTB/SmolVLM-256M-Instruct')\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions","title":"AsrPipelineOptions","text":" Bases: PipelineOptions
Attributes:
accelerator_options
(AcceleratorOptions
) \u2013 allow_external_plugins
(bool
) \u2013 artifacts_path
(Optional[Union[Path, str]]
) \u2013 asr_options
(Union[InlineAsrOptions]
) \u2013 create_legacy_output
(bool
) \u2013 document_timeout
(Optional[float]
) \u2013 enable_remote_services
(bool
) \u2013 class-attribute
instance-attribute
","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute
instance-attribute
","text":"allow_external_plugins: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.artifacts_path","title":"artifacts_path class-attribute
instance-attribute
","text":"artifacts_path: Optional[Union[Path, str]] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.asr_options","title":"asr_options class-attribute
instance-attribute
","text":"asr_options: Union[InlineAsrOptions] = WHISPER_TINY\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.create_legacy_output","title":"create_legacy_output class-attribute
instance-attribute
","text":"create_legacy_output: bool = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.document_timeout","title":"document_timeout class-attribute
instance-attribute
","text":"document_timeout: Optional[float] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute
instance-attribute
","text":"enable_remote_services: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseOptions","title":"BaseOptions","text":" Bases: BaseModel
Base class for options.
Attributes:
kind
(str
) \u2013 class-attribute
","text":"kind: str\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions","title":"EasyOcrOptions","text":" Bases: OcrOptions
Options for the EasyOCR engine.
Attributes:
bitmap_area_threshold
(float
) \u2013 confidence_threshold
(float
) \u2013 download_enabled
(bool
) \u2013 force_full_page_ocr
(bool
) \u2013 kind
(Literal['easyocr']
) \u2013 lang
(List[str]
) \u2013 model_config
\u2013 model_storage_directory
(Optional[str]
) \u2013 recog_network
(Optional[str]
) \u2013 use_gpu
(Optional[bool]
) \u2013 class-attribute
instance-attribute
","text":"bitmap_area_threshold: float = 0.05\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.confidence_threshold","title":"confidence_threshold class-attribute
instance-attribute
","text":"confidence_threshold: float = 0.5\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.download_enabled","title":"download_enabled class-attribute
instance-attribute
","text":"download_enabled: bool = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute
instance-attribute
","text":"force_full_page_ocr: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.kind","title":"kind class-attribute
","text":"kind: Literal['easyocr'] = 'easyocr'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.lang","title":"lang class-attribute
instance-attribute
","text":"lang: List[str] = ['fr', 'de', 'es', 'en']\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(extra='forbid', protected_namespaces=())\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.model_storage_directory","title":"model_storage_directory class-attribute
instance-attribute
","text":"model_storage_directory: Optional[str] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.recog_network","title":"recog_network class-attribute
instance-attribute
","text":"recog_network: Optional[str] = 'standard'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.use_gpu","title":"use_gpu class-attribute
instance-attribute
","text":"use_gpu: Optional[bool] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions","title":"LayoutOptions","text":" Bases: BaseModel
Options for layout processing.
Attributes:
create_orphan_clusters
(bool
) \u2013 keep_empty_clusters
(bool
) \u2013 model_spec
(LayoutModelConfig
) \u2013 class-attribute
instance-attribute
","text":"create_orphan_clusters: bool = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.keep_empty_clusters","title":"keep_empty_clusters class-attribute
instance-attribute
","text":"keep_empty_clusters: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.model_spec","title":"model_spec class-attribute
instance-attribute
","text":"model_spec: LayoutModelConfig = DOCLING_LAYOUT_V2\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine","title":"OcrEngine","text":" Bases: str
, Enum
Enum of valid OCR engines.
Attributes:
EASYOCR
\u2013 OCRMAC
\u2013 RAPIDOCR
\u2013 TESSERACT
\u2013 TESSERACT_CLI
\u2013 class-attribute
instance-attribute
","text":"EASYOCR = 'easyocr'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.OCRMAC","title":"OCRMAC class-attribute
instance-attribute
","text":"OCRMAC = 'ocrmac'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.RAPIDOCR","title":"RAPIDOCR class-attribute
instance-attribute
","text":"RAPIDOCR = 'rapidocr'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.TESSERACT","title":"TESSERACT class-attribute
instance-attribute
","text":"TESSERACT = 'tesseract'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.TESSERACT_CLI","title":"TESSERACT_CLI class-attribute
instance-attribute
","text":"TESSERACT_CLI = 'tesseract_cli'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions","title":"OcrMacOptions","text":" Bases: OcrOptions
Options for the Mac OCR engine.
Attributes:
bitmap_area_threshold
(float
) \u2013 force_full_page_ocr
(bool
) \u2013 framework
(str
) \u2013 kind
(Literal['ocrmac']
) \u2013 lang
(List[str]
) \u2013 model_config
\u2013 recognition
(str
) \u2013 class-attribute
instance-attribute
","text":"bitmap_area_threshold: float = 0.05\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute
instance-attribute
","text":"force_full_page_ocr: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.framework","title":"framework class-attribute
instance-attribute
","text":"framework: str = 'vision'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.kind","title":"kind class-attribute
","text":"kind: Literal['ocrmac'] = 'ocrmac'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.lang","title":"lang class-attribute
instance-attribute
","text":"lang: List[str] = ['fr-FR', 'de-DE', 'es-ES', 'en-US']\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.recognition","title":"recognition class-attribute
instance-attribute
","text":"recognition: str = 'accurate'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions","title":"OcrOptions","text":" Bases: BaseOptions
OCR options.
Attributes:
bitmap_area_threshold
(float
) \u2013 force_full_page_ocr
(bool
) \u2013 kind
(str
) \u2013 lang
(List[str]
) \u2013 class-attribute
instance-attribute
","text":"bitmap_area_threshold: float = 0.05\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute
instance-attribute
","text":"force_full_page_ocr: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.kind","title":"kind class-attribute
","text":"kind: str\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.lang","title":"lang instance-attribute
","text":"lang: List[str]\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions","title":"PaginatedPipelineOptions","text":" Bases: PipelineOptions
Attributes:
accelerator_options
(AcceleratorOptions
) \u2013 allow_external_plugins
(bool
) \u2013 artifacts_path
(Optional[Union[Path, str]]
) \u2013 create_legacy_output
(bool
) \u2013 document_timeout
(Optional[float]
) \u2013 enable_remote_services
(bool
) \u2013 generate_page_images
(bool
) \u2013 generate_picture_images
(bool
) \u2013 images_scale
(float
) \u2013 class-attribute
instance-attribute
","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute
instance-attribute
","text":"allow_external_plugins: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.artifacts_path","title":"artifacts_path class-attribute
instance-attribute
","text":"artifacts_path: Optional[Union[Path, str]] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.create_legacy_output","title":"create_legacy_output class-attribute
instance-attribute
","text":"create_legacy_output: bool = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.document_timeout","title":"document_timeout class-attribute
instance-attribute
","text":"document_timeout: Optional[float] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute
instance-attribute
","text":"enable_remote_services: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.generate_page_images","title":"generate_page_images class-attribute
instance-attribute
","text":"generate_page_images: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.generate_picture_images","title":"generate_picture_images class-attribute
instance-attribute
","text":"generate_picture_images: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.images_scale","title":"images_scale class-attribute
instance-attribute
","text":"images_scale: float = 1.0\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend","title":"PdfBackend","text":" Bases: str
, Enum
Enum of valid PDF backends.
Attributes:
DLPARSE_V1
\u2013 DLPARSE_V2
\u2013 DLPARSE_V4
\u2013 PYPDFIUM2
\u2013 class-attribute
instance-attribute
","text":"DLPARSE_V1 = 'dlparse_v1'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.DLPARSE_V2","title":"DLPARSE_V2 class-attribute
instance-attribute
","text":"DLPARSE_V2 = 'dlparse_v2'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.DLPARSE_V4","title":"DLPARSE_V4 class-attribute
instance-attribute
","text":"DLPARSE_V4 = 'dlparse_v4'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.PYPDFIUM2","title":"PYPDFIUM2 class-attribute
instance-attribute
","text":"PYPDFIUM2 = 'pypdfium2'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions","title":"PdfPipelineOptions","text":" Bases: PaginatedPipelineOptions
Options for the PDF pipeline.
Attributes:
accelerator_options
(AcceleratorOptions
) \u2013 allow_external_plugins
(bool
) \u2013 artifacts_path
(Optional[Union[Path, str]]
) \u2013 create_legacy_output
(bool
) \u2013 do_code_enrichment
(bool
) \u2013 do_formula_enrichment
(bool
) \u2013 do_ocr
(bool
) \u2013 do_picture_classification
(bool
) \u2013 do_picture_description
(bool
) \u2013 do_table_structure
(bool
) \u2013 document_timeout
(Optional[float]
) \u2013 enable_remote_services
(bool
) \u2013 force_backend_text
(bool
) \u2013 generate_page_images
(bool
) \u2013 generate_parsed_pages
(Literal[True]
) \u2013 generate_picture_images
(bool
) \u2013 generate_table_images
(bool
) \u2013 images_scale
(float
) \u2013 layout_options
(LayoutOptions
) \u2013 ocr_options
(OcrOptions
) \u2013 picture_description_options
(PictureDescriptionBaseOptions
) \u2013 table_structure_options
(TableStructureOptions
) \u2013 class-attribute
instance-attribute
","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute
instance-attribute
","text":"allow_external_plugins: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.artifacts_path","title":"artifacts_path class-attribute
instance-attribute
","text":"artifacts_path: Optional[Union[Path, str]] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.create_legacy_output","title":"create_legacy_output class-attribute
instance-attribute
","text":"create_legacy_output: bool = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_code_enrichment","title":"do_code_enrichment class-attribute
instance-attribute
","text":"do_code_enrichment: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_formula_enrichment","title":"do_formula_enrichment class-attribute
instance-attribute
","text":"do_formula_enrichment: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_ocr","title":"do_ocr class-attribute
instance-attribute
","text":"do_ocr: bool = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_picture_classification","title":"do_picture_classification class-attribute
instance-attribute
","text":"do_picture_classification: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_picture_description","title":"do_picture_description class-attribute
instance-attribute
","text":"do_picture_description: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_table_structure","title":"do_table_structure class-attribute
instance-attribute
","text":"do_table_structure: bool = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.document_timeout","title":"document_timeout class-attribute
instance-attribute
","text":"document_timeout: Optional[float] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute
instance-attribute
","text":"enable_remote_services: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.force_backend_text","title":"force_backend_text class-attribute
instance-attribute
","text":"force_backend_text: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_page_images","title":"generate_page_images class-attribute
instance-attribute
","text":"generate_page_images: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_parsed_pages","title":"generate_parsed_pages class-attribute
instance-attribute
","text":"generate_parsed_pages: Literal[True] = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_picture_images","title":"generate_picture_images class-attribute
instance-attribute
","text":"generate_picture_images: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_table_images","title":"generate_table_images class-attribute
instance-attribute
","text":"generate_table_images: bool = Field(default=False, deprecated='Field `generate_table_images` is deprecated. To obtain table images, set `PdfPipelineOptions.generate_page_images = True` before conversion and then use the `TableItem.get_image` function.')\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.images_scale","title":"images_scale class-attribute
instance-attribute
","text":"images_scale: float = 1.0\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.layout_options","title":"layout_options class-attribute
instance-attribute
","text":"layout_options: LayoutOptions = LayoutOptions()\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.ocr_options","title":"ocr_options class-attribute
instance-attribute
","text":"ocr_options: OcrOptions = EasyOcrOptions()\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.picture_description_options","title":"picture_description_options class-attribute
instance-attribute
","text":"picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.table_structure_options","title":"table_structure_options class-attribute
instance-attribute
","text":"table_structure_options: TableStructureOptions = TableStructureOptions()\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions","title":"PictureDescriptionApiOptions","text":" Bases: PictureDescriptionBaseOptions
Attributes:
batch_size
(int
) \u2013 concurrency
(int
) \u2013 headers
(Dict[str, str]
) \u2013 kind
(Literal['api']
) \u2013 params
(Dict[str, Any]
) \u2013 picture_area_threshold
(float
) \u2013 prompt
(str
) \u2013 provenance
(str
) \u2013 scale
(float
) \u2013 timeout
(float
) \u2013 url
(AnyUrl
) \u2013 class-attribute
instance-attribute
","text":"batch_size: int = 8\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.concurrency","title":"concurrency class-attribute
instance-attribute
","text":"concurrency: int = 1\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.headers","title":"headers class-attribute
instance-attribute
","text":"headers: Dict[str, str] = {}\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.kind","title":"kind class-attribute
","text":"kind: Literal['api'] = 'api'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.params","title":"params class-attribute
instance-attribute
","text":"params: Dict[str, Any] = {}\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.picture_area_threshold","title":"picture_area_threshold class-attribute
instance-attribute
","text":"picture_area_threshold: float = 0.05\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.prompt","title":"prompt class-attribute
instance-attribute
","text":"prompt: str = 'Describe this image in a few sentences.'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.provenance","title":"provenance class-attribute
instance-attribute
","text":"provenance: str = ''\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.scale","title":"scale class-attribute
instance-attribute
","text":"scale: float = 2\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.timeout","title":"timeout class-attribute
instance-attribute
","text":"timeout: float = 20\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.url","title":"url class-attribute
instance-attribute
","text":"url: AnyUrl = AnyUrl('http://localhost:8000/v1/chat/completions')\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions","title":"PictureDescriptionBaseOptions","text":" Bases: BaseOptions
Attributes:
batch_size
(int
) \u2013 kind
(str
) \u2013 picture_area_threshold
(float
) \u2013 scale
(float
) \u2013 class-attribute
instance-attribute
","text":"batch_size: int = 8\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.kind","title":"kind class-attribute
","text":"kind: str\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.picture_area_threshold","title":"picture_area_threshold class-attribute
instance-attribute
","text":"picture_area_threshold: float = 0.05\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.scale","title":"scale class-attribute
instance-attribute
","text":"scale: float = 2\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions","title":"PictureDescriptionVlmOptions","text":" Bases: PictureDescriptionBaseOptions
Attributes:
batch_size
(int
) \u2013 generation_config
(Dict[str, Any]
) \u2013 kind
(Literal['vlm']
) \u2013 picture_area_threshold
(float
) \u2013 prompt
(str
) \u2013 repo_cache_folder
(str
) \u2013 repo_id
(str
) \u2013 scale
(float
) \u2013 class-attribute
instance-attribute
","text":"batch_size: int = 8\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.generation_config","title":"generation_config class-attribute
instance-attribute
","text":"generation_config: Dict[str, Any] = dict(max_new_tokens=200, do_sample=False)\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.kind","title":"kind class-attribute
","text":"kind: Literal['vlm'] = 'vlm'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.picture_area_threshold","title":"picture_area_threshold class-attribute
instance-attribute
","text":"picture_area_threshold: float = 0.05\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.prompt","title":"prompt class-attribute
instance-attribute
","text":"prompt: str = 'Describe this image in a few sentences.'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.repo_cache_folder","title":"repo_cache_folder property
","text":"repo_cache_folder: str\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.repo_id","title":"repo_id instance-attribute
","text":"repo_id: str\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.scale","title":"scale class-attribute
instance-attribute
","text":"scale: float = 2\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions","title":"PipelineOptions","text":" Bases: BaseModel
Base pipeline options.
Attributes:
accelerator_options
(AcceleratorOptions
) \u2013 allow_external_plugins
(bool
) \u2013 create_legacy_output
(bool
) \u2013 document_timeout
(Optional[float]
) \u2013 enable_remote_services
(bool
) \u2013 class-attribute
instance-attribute
","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute
instance-attribute
","text":"allow_external_plugins: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.create_legacy_output","title":"create_legacy_output class-attribute
instance-attribute
","text":"create_legacy_output: bool = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.document_timeout","title":"document_timeout class-attribute
instance-attribute
","text":"document_timeout: Optional[float] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute
instance-attribute
","text":"enable_remote_services: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline","title":"ProcessingPipeline","text":" Bases: str
, Enum
Attributes:
ASR
\u2013 STANDARD
\u2013 VLM
\u2013 class-attribute
instance-attribute
","text":"ASR = 'asr'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.STANDARD","title":"STANDARD class-attribute
instance-attribute
","text":"STANDARD = 'standard'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.VLM","title":"VLM class-attribute
instance-attribute
","text":"VLM = 'vlm'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions","title":"RapidOcrOptions","text":" Bases: OcrOptions
Options for the RapidOCR engine.
Attributes:
bitmap_area_threshold
(float
) \u2013 cls_model_path
(Optional[str]
) \u2013 det_model_path
(Optional[str]
) \u2013 force_full_page_ocr
(bool
) \u2013 kind
(Literal['rapidocr']
) \u2013 lang
(List[str]
) \u2013 model_config
\u2013 print_verbose
(bool
) \u2013 rec_keys_path
(Optional[str]
) \u2013 rec_model_path
(Optional[str]
) \u2013 text_score
(float
) \u2013 use_cls
(Optional[bool]
) \u2013 use_det
(Optional[bool]
) \u2013 use_rec
(Optional[bool]
) \u2013 class-attribute
instance-attribute
","text":"bitmap_area_threshold: float = 0.05\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.cls_model_path","title":"cls_model_path class-attribute
instance-attribute
","text":"cls_model_path: Optional[str] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.det_model_path","title":"det_model_path class-attribute
instance-attribute
","text":"det_model_path: Optional[str] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute
instance-attribute
","text":"force_full_page_ocr: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.kind","title":"kind class-attribute
","text":"kind: Literal['rapidocr'] = 'rapidocr'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.lang","title":"lang class-attribute
instance-attribute
","text":"lang: List[str] = ['english', 'chinese']\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.print_verbose","title":"print_verbose class-attribute
instance-attribute
","text":"print_verbose: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rec_keys_path","title":"rec_keys_path class-attribute
instance-attribute
","text":"rec_keys_path: Optional[str] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rec_model_path","title":"rec_model_path class-attribute
instance-attribute
","text":"rec_model_path: Optional[str] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.text_score","title":"text_score class-attribute
instance-attribute
","text":"text_score: float = 0.5\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_cls","title":"use_cls class-attribute
instance-attribute
","text":"use_cls: Optional[bool] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_det","title":"use_det class-attribute
instance-attribute
","text":"use_det: Optional[bool] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_rec","title":"use_rec class-attribute
instance-attribute
","text":"use_rec: Optional[bool] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableFormerMode","title":"TableFormerMode","text":" Bases: str
, Enum
Modes for the TableFormer model.
Attributes:
ACCURATE
\u2013 FAST
\u2013 class-attribute
instance-attribute
","text":"ACCURATE = 'accurate'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableFormerMode.FAST","title":"FAST class-attribute
instance-attribute
","text":"FAST = 'fast'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions","title":"TableStructureOptions","text":" Bases: BaseModel
Options for the table structure.
Attributes:
do_cell_matching
(bool
) \u2013 mode
(TableFormerMode
) \u2013 class-attribute
instance-attribute
","text":"do_cell_matching: bool = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions.mode","title":"mode class-attribute
instance-attribute
","text":"mode: TableFormerMode = ACCURATE\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions","title":"TesseractCliOcrOptions","text":" Bases: OcrOptions
Options for the TesseractCli engine.
Attributes:
bitmap_area_threshold
(float
) \u2013 force_full_page_ocr
(bool
) \u2013 kind
(Literal['tesseract']
) \u2013 lang
(List[str]
) \u2013 model_config
\u2013 path
(Optional[str]
) \u2013 tesseract_cmd
(str
) \u2013 class-attribute
instance-attribute
","text":"bitmap_area_threshold: float = 0.05\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute
instance-attribute
","text":"force_full_page_ocr: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.kind","title":"kind class-attribute
","text":"kind: Literal['tesseract'] = 'tesseract'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.lang","title":"lang class-attribute
instance-attribute
","text":"lang: List[str] = ['fra', 'deu', 'spa', 'eng']\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.path","title":"path class-attribute
instance-attribute
","text":"path: Optional[str] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.tesseract_cmd","title":"tesseract_cmd class-attribute
instance-attribute
","text":"tesseract_cmd: str = 'tesseract'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions","title":"TesseractOcrOptions","text":" Bases: OcrOptions
Options for the Tesseract engine.
Attributes:
bitmap_area_threshold
(float
) \u2013 force_full_page_ocr
(bool
) \u2013 kind
(Literal['tesserocr']
) \u2013 lang
(List[str]
) \u2013 model_config
\u2013 path
(Optional[str]
) \u2013 class-attribute
instance-attribute
","text":"bitmap_area_threshold: float = 0.05\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute
instance-attribute
","text":"force_full_page_ocr: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.kind","title":"kind class-attribute
","text":"kind: Literal['tesserocr'] = 'tesserocr'\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.lang","title":"lang class-attribute
instance-attribute
","text":"lang: List[str] = ['fra', 'deu', 'spa', 'eng']\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.model_config","title":"model_config class-attribute
instance-attribute
","text":"model_config = ConfigDict(extra='forbid')\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.path","title":"path class-attribute
instance-attribute
","text":"path: Optional[str] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions","title":"VlmPipelineOptions","text":" Bases: PaginatedPipelineOptions
Attributes:
accelerator_options
(AcceleratorOptions
) \u2013 allow_external_plugins
(bool
) \u2013 artifacts_path
(Optional[Union[Path, str]]
) \u2013 create_legacy_output
(bool
) \u2013 document_timeout
(Optional[float]
) \u2013 enable_remote_services
(bool
) \u2013 force_backend_text
(bool
) \u2013 generate_page_images
(bool
) \u2013 generate_picture_images
(bool
) \u2013 images_scale
(float
) \u2013 vlm_options
(Union[InlineVlmOptions, ApiVlmOptions]
) \u2013 class-attribute
instance-attribute
","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute
instance-attribute
","text":"allow_external_plugins: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.artifacts_path","title":"artifacts_path class-attribute
instance-attribute
","text":"artifacts_path: Optional[Union[Path, str]] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.create_legacy_output","title":"create_legacy_output class-attribute
instance-attribute
","text":"create_legacy_output: bool = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.document_timeout","title":"document_timeout class-attribute
instance-attribute
","text":"document_timeout: Optional[float] = None\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute
instance-attribute
","text":"enable_remote_services: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.force_backend_text","title":"force_backend_text class-attribute
instance-attribute
","text":"force_backend_text: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.generate_page_images","title":"generate_page_images class-attribute
instance-attribute
","text":"generate_page_images: bool = True\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.generate_picture_images","title":"generate_picture_images class-attribute
instance-attribute
","text":"generate_picture_images: bool = False\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.images_scale","title":"images_scale class-attribute
instance-attribute
","text":"images_scale: float = 1.0\n
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.vlm_options","title":"vlm_options class-attribute
instance-attribute
","text":"vlm_options: Union[InlineVlmOptions, ApiVlmOptions] = SMOLDOCLING_TRANSFORMERS\n
"},{"location":"usage/","title":"Usage","text":""},{"location":"usage/#conversion","title":"Conversion","text":""},{"location":"usage/#convert-a-single-document","title":"Convert a single document","text":"To convert individual PDF documents, use convert()
, for example:
from docling.document_converter import DocumentConverter\n\nsource = \"https://arxiv.org/pdf/2408.09869\" # PDF path or URL\nconverter = DocumentConverter()\nresult = converter.convert(source)\nprint(result.document.export_to_markdown()) # output: \"### Docling Technical Report[...]\"\n
"},{"location":"usage/#cli","title":"CLI","text":"You can also use Docling directly from your command line to convert individual files \u2014be it local or by URL\u2014 or whole directories.
docling https://arxiv.org/pdf/2206.01062\n
You can also use \ud83e\udd5aSmolDocling and other VLMs via Docling CLI: docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062\n
This will use MLX acceleration on supported Apple Silicon hardware. To see all available options (export formats etc.) run docling --help
. More details in the CLI reference page.
By default, models are downloaded automatically upon first usage. If you would prefer to explicitly prefetch them for offline use (e.g. in air-gapped environments) you can do that as follows:
Step 1: Prefetch the models
Use the docling-tools models download
utility:
$ docling-tools models download\nDownloading layout model...\nDownloading tableformer model...\nDownloading picture classifier model...\nDownloading code formula model...\nDownloading easyocr models...\nModels downloaded into $HOME/.cache/docling/models.\n
Alternatively, models can be programmatically downloaded using docling.utils.model_downloader.download_models()
.
Step 2: Use the prefetched models
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\nartifacts_path = \"/local/path/to/models\"\n\npipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path)\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n
Or using the CLI:
docling --artifacts-path=\"/local/path/to/models\" FILE\n
Or using the DOCLING_ARTIFACTS_PATH
environment variable:
export DOCLING_ARTIFACTS_PATH=\"/local/path/to/models\"\npython my_docling_script.py\n
"},{"location":"usage/#using-remote-services","title":"Using remote services","text":"The main purpose of Docling is to run local models which are not sharing any user data with remote services. Anyhow, there are valid use cases for processing part of the pipeline using remote services, for example invoking OCR engines from cloud vendors or the usage of hosted LLMs.
In Docling we decided to allow such models, but we require the user to explicitly opt-in in communicating with external services.
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions(enable_remote_services=True)\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n
When the value enable_remote_services=True
is not set, the system will raise an exception OperationNotAllowed()
.
Note: This option is only related to the system sending user data to remote services. Control of pulling data (e.g. model weights) follows the logic described in Model prefetching and offline usage.
"},{"location":"usage/#list-of-remote-model-services","title":"List of remote model services","text":"The options in this list require the explicit enable_remote_services=True
when processing the documents.
PictureDescriptionApiOptions
: Using vision models via API calls.The example file custom_convert.py contains multiple ways one can adjust the conversion pipeline and features.
"},{"location":"usage/#control-pdf-table-extraction-options","title":"Control PDF table extraction options","text":"You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself. This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\n\npipeline_options = PdfPipelineOptions(do_table_structure=True)\npipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model\n\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n
Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between TableFormerMode.FAST
(faster but less accurate) and TableFormerMode.ACCURATE
(default) to receive better quality with difficult table structures.
from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode\n\npipeline_options = PdfPipelineOptions(do_table_structure=True)\npipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model\n\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n
"},{"location":"usage/#impose-limits-on-the-document-size","title":"Impose limits on the document size","text":"You can limit the file size and number of pages which should be allowed to process per document:
from pathlib import Path\nfrom docling.document_converter import DocumentConverter\n\nsource = \"https://arxiv.org/pdf/2408.09869\"\nconverter = DocumentConverter()\nresult = converter.convert(source, max_num_pages=100, max_file_size=20971520)\n
"},{"location":"usage/#convert-from-binary-pdf-streams","title":"Convert from binary PDF streams","text":"You can convert PDFs from a binary stream instead of from the filesystem as follows:
from io import BytesIO\nfrom docling.datamodel.base_models import DocumentStream\nfrom docling.document_converter import DocumentConverter\n\nbuf = BytesIO(your_binary_stream)\nsource = DocumentStream(name=\"my_doc.pdf\", stream=buf)\nconverter = DocumentConverter()\nresult = converter.convert(source)\n
"},{"location":"usage/#limit-resource-usage","title":"Limit resource usage","text":"You can limit the CPU threads used by Docling by setting the environment variable OMP_NUM_THREADS
accordingly. The default setting is using 4 CPU threads.
Note
This section discusses directly invoking a backend, i.e. using a low-level API. This should only be done when necessary. For most cases, using a DocumentConverter
(high-level API) as discussed in the sections above should suffice\u00a0\u2014\u00a0and is the recommended way.
By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of supported formats). You can restrict the DocumentConverter
to a set of allowed document formats, as shown in the Multi-format conversion example. Alternatively, you can also use the specific backend that matches your document content. For instance, you can use HTMLDocumentBackend
for HTML pages:
import urllib.request\nfrom io import BytesIO\nfrom docling.backend.html_backend import HTMLDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.document import InputDocument\n\nurl = \"https://en.wikipedia.org/wiki/Duck\"\ntext = urllib.request.urlopen(url).read()\nin_doc = InputDocument(\n path_or_stream=BytesIO(text),\n format=InputFormat.HTML,\n backend=HTMLDocumentBackend,\n filename=\"duck.html\",\n)\nbackend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))\ndl_doc = backend.convert()\nprint(dl_doc.export_to_markdown())\n
"},{"location":"usage/#chunking","title":"Chunking","text":"You can chunk a Docling document using a chunker, such as a HybridChunker
, as shown below (for more details check out this example):
from docling.document_converter import DocumentConverter\nfrom docling.chunking import HybridChunker\n\nconv_res = DocumentConverter().convert(\"https://arxiv.org/pdf/2206.01062\")\ndoc = conv_res.document\n\nchunker = HybridChunker(tokenizer=\"BAAI/bge-small-en-v1.5\") # set tokenizer as needed\nchunk_iter = chunker.chunk(doc)\n
An example chunk would look like this:
print(list(chunk_iter)[11])\n# {\n# \"text\": \"In this paper, we present the DocLayNet dataset. [...]\",\n# \"meta\": {\n# \"doc_items\": [{\n# \"self_ref\": \"#/texts/28\",\n# \"label\": \"text\",\n# \"prov\": [{\n# \"page_no\": 2,\n# \"bbox\": {\"l\": 53.29, \"t\": 287.14, \"r\": 295.56, \"b\": 212.37, ...},\n# }], ...,\n# }, ...],\n# \"headings\": [\"1 INTRODUCTION\"],\n# }\n# }\n
"},{"location":"usage/enrichments/","title":"Enrichment features","text":"Docling allows to enrich the conversion pipeline with additional steps which process specific document components, e.g. code blocks, pictures, etc. The extra steps usually require extra models executions which may increase the processing time consistently. For this reason most enrichment models are disabled by default.
The following table provides an overview of the default enrichment models available in Docling.
Feature Parameter Processed item Description Code understandingdo_code_enrichment
CodeItem
See docs below. Formula understanding do_formula_enrichment
TextItem
with label FORMULA
See docs below. Picture classification do_picture_classification
PictureItem
See docs below. Picture description do_picture_description
PictureItem
See docs below."},{"location":"usage/enrichments/#enrichments-details","title":"Enrichments details","text":""},{"location":"usage/enrichments/#code-understanding","title":"Code understanding","text":"The code understanding step allows to use advance parsing for code blocks found in the document. This enrichment model also set the code_language
property of the CodeItem
.
Model specs: see the CodeFormula
model card.
Example command line:
docling --enrich-code FILE\n
Example code:
from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_code_enrichment = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n
"},{"location":"usage/enrichments/#formula-understanding","title":"Formula understanding","text":"The formula understanding step will analize the equation formulas in documents and extract their LaTeX representation. The HTML export functions in the DoclingDocument will leverage the formula and visualize the result using the mathml html syntax.
Model specs: see the CodeFormula
model card.
Example command line:
docling --enrich-formula FILE\n
Example code:
from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_formula_enrichment = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n
"},{"location":"usage/enrichments/#picture-classification","title":"Picture classification","text":"The picture classification step classifies the PictureItem
elements in the document with the DocumentFigureClassifier
model. This model is specialized to understand the classes of pictures found in documents, e.g. different chart types, flow diagrams, logos, signatures, etc.
Model specs: see the DocumentFigureClassifier
model card.
Example command line:
docling --enrich-picture-classes FILE\n
Example code:
from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.generate_picture_images = True\npipeline_options.images_scale = 2\npipeline_options.do_picture_classification = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n
"},{"location":"usage/enrichments/#picture-description","title":"Picture description","text":"The picture description step allows to annotate a picture with a vision model. This is also known as a \"captioning\" task. The Docling pipeline allows to load and run models completely locally as well as connecting to remote API which support the chat template. Below follow a few examples on how to use some common vision model and remote services.
from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n
"},{"location":"usage/enrichments/#granite-vision-model","title":"Granite Vision model","text":"Model specs: see the ibm-granite/granite-vision-3.1-2b-preview
model card.
Usage in Docling:
from docling.datamodel.pipeline_options import granite_picture_description\n\npipeline_options.picture_description_options = granite_picture_description\n
"},{"location":"usage/enrichments/#smolvlm-model","title":"SmolVLM model","text":"Model specs: see the HuggingFaceTB/SmolVLM-256M-Instruct
model card.
Usage in Docling:
from docling.datamodel.pipeline_options import smolvlm_picture_description\n\npipeline_options.picture_description_options = smolvlm_picture_description\n
"},{"location":"usage/enrichments/#other-vision-models","title":"Other vision models","text":"The option class PictureDescriptionVlmOptions
allows to use any another model from the Hugging Face Hub.
from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions\n\npipeline_options.picture_description_options = PictureDescriptionVlmOptions(\n repo_id=\"\", # <-- add here the Hugging Face repo_id of your favorite VLM\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n)\n
"},{"location":"usage/enrichments/#remote-vision-model","title":"Remote vision model","text":"The option class PictureDescriptionApiOptions
allows to use models hosted on remote platforms, e.g. on local endpoints served by VLLM, Ollama and others, or cloud providers like IBM watsonx.ai, etc.
Note: in most cases this option will send your data to the remote service provider.
Usage in Docling:
from docling.datamodel.pipeline_options import PictureDescriptionApiOptions\n\n# Enable connections to remote services\npipeline_options.enable_remote_services=True # <-- this is required!\n\n# Example using a model running locally, e.g. via VLLM\n# $ vllm serve MODEL_NAME\npipeline_options.picture_description_options = PictureDescriptionApiOptions(\n url=\"http://localhost:8000/v1/chat/completions\",\n params=dict(\n model=\"MODEL NAME\",\n seed=42,\n max_completion_tokens=200,\n ),\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n timeout=90,\n)\n
End-to-end code snippets for cloud providers are available in the examples section:
Beside looking at the implementation of all the models listed above, the Docling documentation has a few examples dedicated to the implementation of enrichment models.
Docling can parse various documents formats into a unified representation (Docling Document), which it can export to different formats too \u2014 check out Architecture for more details.
Below you can find a listing of all supported input and output formats.
"},{"location":"usage/supported_formats/#supported-input-formats","title":"Supported input formats","text":"Format Description PDF DOCX, XLSX, PPTX Default formats in MS Office 2007+, based on Office Open XML Markdown AsciiDoc HTML, XHTML CSV PNG, JPEG, TIFF, BMP, WEBP Image formatsSchema-specific support:
Format Description USPTO XML XML format followed by USPTO patents JATS XML XML format followed by JATS articles Docling JSON JSON-serialized Docling Document"},{"location":"usage/supported_formats/#supported-output-formats","title":"Supported output formats","text":"Format Description HTML Both image embedding and referencing are supported Markdown JSON Lossless serialization of Docling Document Text Plain text, i.e. without Markdown markers Doctags"},{"location":"usage/vision_models/","title":"Vision models","text":"The VlmPipeline
in Docling allows you to convert documents end-to-end using a vision-language model.
Docling supports vision-language models which output:
For running Docling using local models with the VlmPipeline
:
docling --pipeline vlm FILE\n
See also the example minimal_vlm_pipeline.py.
from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n ),\n }\n)\n\ndoc = converter.convert(source=\"FILE\").document\n
"},{"location":"usage/vision_models/#available-local-models","title":"Available local models","text":"By default, the vision-language models are running locally. Docling allows to choose between the Hugging Face Transformers framework and the MLX (for Apple devices with MPS acceleration) one.
The following table reports the models currently available out-of-the-box.
Model instance Model Framework Device Num pages Inference time (sec)vlm_model_specs.SMOLDOCLING_TRANSFORMERS
ds4sd/SmolDocling-256M-preview Transformers/AutoModelForVision2Seq
MPS 1 102.212 vlm_model_specs.SMOLDOCLING_MLX
ds4sd/SmolDocling-256M-preview-mlx-bf16 MLX
MPS 1 6.15453 vlm_model_specs.QWEN25_VL_3B_MLX
mlx-community/Qwen2.5-VL-3B-Instruct-bf16 MLX
MPS 1 23.4951 vlm_model_specs.PIXTRAL_12B_MLX
mlx-community/pixtral-12b-bf16 MLX
MPS 1 308.856 vlm_model_specs.GEMMA3_12B_MLX
mlx-community/gemma-3-12b-it-bf16 MLX
MPS 1 378.486 vlm_model_specs.GRANITE_VISION_TRANSFORMERS
ibm-granite/granite-vision-3.2-2b Transformers/AutoModelForVision2Seq
MPS 1 104.75 vlm_model_specs.PHI4_TRANSFORMERS
microsoft/Phi-4-multimodal-instruct Transformers/AutoModelForCasualLM
CPU 1 1175.67 vlm_model_specs.PIXTRAL_12B_TRANSFORMERS
mistral-community/pixtral-12b Transformers/AutoModelForVision2Seq
CPU 1 1828.21 Inference time is computed on a Macbook M3 Max using the example page tests/data/pdf/2305.03393v1-pg9.pdf
. The comparison is done with the example compare_vlm_models.py.
For choosing the model, the code snippet above can be extended as follow
from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.datamodel import vlm_model_specs\n\npipeline_options = VlmPipelineOptions(\n vlm_options=vlm_model_specs.SMOLDOCLING_MLX, # <-- change the model here\n)\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n }\n)\n\ndoc = converter.convert(source=\"FILE\").document\n
"},{"location":"usage/vision_models/#other-models","title":"Other models","text":"Other models can be configured by directly providing the Hugging Face repo_id
, the prompt and a few more options.
For example:
from docling.datamodel.pipeline_options_vlm_model import InlineVlmOptions, InferenceFramework, TransformersModelType\n\npipeline_options = VlmPipelineOptions(\n vlm_options=InlineVlmOptions(\n repo_id=\"ibm-granite/granite-vision-3.2-2b\",\n prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\",\n response_format=ResponseFormat.MARKDOWN,\n inference_framework=InferenceFramework.TRANSFORMERS,\n transformers_model_type=TransformersModelType.AUTOMODEL_VISION2SEQ,\n supported_devices=[\n AcceleratorDevice.CPU,\n AcceleratorDevice.CUDA,\n AcceleratorDevice.MPS,\n ],\n scale=2.0,\n temperature=0.0,\n )\n)\n
"},{"location":"usage/vision_models/#remote-models","title":"Remote models","text":"Additionally to local models, the VlmPipeline
allows to offload the inference to a remote service hosting the models. Many remote inference services are provided, the key requirement is to offer an OpenAI-compatible API. This includes vLLM, Ollama, etc.
More examples on how to connect with the remote inference services can be found in the following examples: