{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Documentation","text":"
Docling simplifies document processing, parsing diverse formats \u2014 including advanced PDF understanding \u2014 and providing seamless integrations with the gen AI ecosystem.
"},{"location":"#getting-started","title":"Getting started","text":"\ud83d\udc23 Ready to kick off your Docling journey? Let's dive right into it!
\u2b07\ufe0f InstallationQuickly install Docling in your environment \u25b6\ufe0f QuickstartGet a jumpstart on basic Docling usage \ud83e\udde9 ConceptsLearn Docling fundamentals and get a glimpse under the hood \ud83e\uddd1\ud83c\udffd\u200d\ud83c\udf73 ExamplesTry out recipes for various use cases, including conversion, RAG, and more \ud83e\udd16 IntegrationsCheck out integrations with popular AI tools and frameworks \ud83d\udcd6 ReferenceSee more API details"},{"location":"#features","title":"Features","text":"\ud83d\ude80 The journey has just begun! Join us and become a part of the growing Docling community.
Do you want to leverage the power of AI and get live support on Docling? Try out the Chat with Dosu functionalities provided by our friends at Dosu.
"},{"location":"#lf-ai-data","title":"LF AI & Data","text":"Docling is hosted as a project in the LF AI & Data Foundation.
"},{"location":"#ibm-open-source-ai","title":"IBM \u2764\ufe0f Open Source AI","text":"The project was started by the AI for knowledge team at IBM Research Zurich.
"},{"location":"v2/","title":"V2","text":""},{"location":"v2/#whats-new","title":"What's new","text":"Docling v2 introduces several new features:
We updated the command line syntax of Docling v2 to support many formats. Examples are seen below.
# Convert a single file to Markdown (default)\ndocling myfile.pdf\n\n# Convert a single file to Markdown and JSON, without OCR\ndocling myfile.pdf --to json --to md --no-ocr\n\n# Convert PDF files in input directory to Markdown (default)\ndocling ./input/dir --from pdf\n\n# Convert PDF and Word files in input directory to Markdown and JSON\ndocling ./input/dir --from pdf --from docx --to md --to json --output ./scratch\n\n# Convert all supported files in input directory to Markdown, but abort on first error\ndocling ./input/dir --output ./scratch --abort-on-error\n Notable changes from Docling v1:
--from and --to arguments, to define input and output formats respectively.--abort-on-error will abort any batch conversion as soon an error is encountered--backend option for PDFs was removedDocumentConverter","text":"To accommodate many input formats, we changed the way you need to set up your DocumentConverter object. You can now define a list of allowed formats on the DocumentConverter initialization, and specify custom options per-format if desired. By default, all supported formats are allowed. If you don't provide format_options, defaults will be used for all allowed_formats.
Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend. They are provided as format-specific types, such as PdfFormatOption or WordFormatOption, as seen below.
from docling.document_converter import DocumentConverter\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import (\n DocumentConverter,\n PdfFormatOption,\n WordFormatOption,\n)\nfrom docling.pipeline.simple_pipeline import SimplePipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\n\n## Default initialization still works as before:\n# doc_converter = DocumentConverter()\n\n\n# previous `PipelineOptions` is now `PdfPipelineOptions`\npipeline_options = PdfPipelineOptions()\npipeline_options.do_ocr = False\npipeline_options.do_table_structure = True\n#...\n\n## Custom options are now defined per format.\ndoc_converter = (\n DocumentConverter( # all of the below is optional, has internal defaults.\n allowed_formats=[\n InputFormat.PDF,\n InputFormat.IMAGE,\n InputFormat.DOCX,\n InputFormat.HTML,\n InputFormat.PPTX,\n ], # whitelist formats, non-matching files are ignored.\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options, # pipeline options go here.\n backend=PyPdfiumDocumentBackend # optional: pick an alternative backend\n ),\n InputFormat.DOCX: WordFormatOption(\n pipeline_cls=SimplePipeline # default for office formats and HTML\n ),\n },\n )\n)\n Note: If you work only with defaults, all remains the same as in Docling v1.
More options are shown in the following example units:
We have simplified the way you can feed input to the DocumentConverter and renamed the conversion methods for better semantics. You can now call the conversion directly with a single file, or a list of input files, or DocumentStream objects, without constructing a DocumentConversionInput object first.
DocumentConverter.convert now converts a single file input (previously DocumentConverter.convert_single).DocumentConverter.convert_all now converts many files at once (previously DocumentConverter.convert)....\nfrom docling.datamodel.document import ConversionResult\n## Convert a single file (from URL or local path)\nconv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Convert several files at once:\n\ninput_files = [\n \"tests/data/html/wiki_duck.html\",\n \"tests/data/docx/word_sample.docx\",\n \"tests/data/docx/lorem_ipsum.docx\",\n \"tests/data/pptx/powerpoint_sample.pptx\",\n \"tests/data/2305.03393v1-pg9-img.png\",\n \"tests/data/pdf/2206.01062.pdf\",\n]\n\n# Directly pass list of files or streams to `convert_all`\nconv_results_iter = doc_converter.convert_all(input_files) # previously `convert`\n Through the raises_on_error argument, you can also control if the conversion should raise exceptions when first encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status. By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed). ...\nconv_results_iter = doc_converter.convert_all(input_files, raises_on_error=False) # previously `convert`\n"},{"location":"v2/#access-document-structures","title":"Access document structures","text":"We have simplified how you can access and export the converted document data, too. Our universal document representation is now available in conversion results as a DoclingDocument object. DoclingDocument provides a neat set of APIs to construct, iterate and export content in the document, as shown below.
import pandas as pd\nfrom docling_core.types.doc import TextItem, TableItem\n\nconv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Inspect the converted document:\nconv_result.document.print_element_tree()\n\n## Iterate the elements in reading order, including hierarchy level:\nfor item, level in conv_result.document.iterate_items():\n if isinstance(item, TextItem):\n print(item.text)\n elif isinstance(item, TableItem):\n table_df: pd.DataFrame = item.export_to_dataframe(doc=conv_result.document)\n print(table_df.to_markdown())\n elif ...:\n #...\n Note: While it is deprecated, you can still work with the Docling v1 document representation, it is available as:
conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type\n"},{"location":"v2/#export-into-json-markdown-doctags","title":"Export into JSON, Markdown, Doctags","text":"Note: All render_... methods in ConversionResult have been removed in Docling v2, and are now available on DoclingDocument as:
DoclingDocument.export_to_dictDoclingDocument.export_to_markdownDoclingDocument.export_to_document_tokensconv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Export to desired format:\nprint(json.dumps(conv_res.document.export_to_dict()))\nprint(conv_res.document.export_to_markdown())\nprint(conv_res.document.export_to_document_tokens())\n Note: While it is deprecated, you can still export Docling v1 JSON format. This is available through the same methods as on the DoclingDocument type:
## Export legacy document representation to desired format, for v1 compatibility:\nprint(json.dumps(conv_res.legacy_document.export_to_dict()))\nprint(conv_res.legacy_document.export_to_markdown())\nprint(conv_res.legacy_document.export_to_document_tokens())\n"},{"location":"v2/#reload-a-doclingdocument-stored-as-json","title":"Reload a DoclingDocument stored as JSON","text":"You can save and reload a DoclingDocument to disk in JSON format using the following codes:
# Save to disk:\ndoc: DoclingDocument = conv_res.document # produced from conversion result...\n\nwith Path(\"./doc.json\").open(\"w\") as fp:\n fp.write(json.dumps(doc.export_to_dict())) # use `export_to_dict` to ensure consistency\n\n# Load from disk:\nwith Path(\"./doc.json\").open(\"r\") as fp:\n doc_dict = json.loads(fp.read())\n doc = DoclingDocument.model_validate(doc_dict) # use standard pydantic API to populate doc\n"},{"location":"v2/#chunking","title":"Chunking","text":"Docling v2 defines new base classes for chunking:
BaseMeta for chunk metadataBaseChunk containing the chunk text and metadata, andBaseChunker for chunkers, producing chunks out of a DoclingDocument.Additionally, it provides an updated HierarchicalChunker implementation, which leverages the new DoclingDocument and provides a new, richer chunk output format, including:
For an example, check out Chunking usage.
"},{"location":"concepts/","title":"Concepts","text":"In this space, you can peek under the hood and learn some fundamental Docling concepts!
Here some of our picks to get you started:
\ud83d\udc48 ... and there is much more: explore all the concepts using the navigation menu on the side
Docling architecture outline"},{"location":"concepts/architecture/","title":"Architecture","text":"In a nutshell, Docling's architecture is outlined in the diagram above.
For each document format, the document converter knows which format-specific backend to employ for parsing the document and which pipeline to use for orchestrating the execution, along with any relevant options.
Tip
While the document converter holds a default mapping, this configuration is parametrizable, so e.g. for the PDF format, different backends and different pipeline options can be used \u2014 see Usage.
The conversion result contains the Docling document, Docling's fundamental document representation.
Some typical scenarios for using a Docling document include directly calling its export methods, such as for markdown, dictionary etc., or having it serialized by a serializer or chunked by a chunker.
For more details on Docling's architecture, check out the Docling Technical Report.
Note
The components illustrated with dashed outline indicate base classes that can be subclassed for specialized implementations.
"},{"location":"concepts/chunking/","title":"Chunking","text":""},{"location":"concepts/chunking/#introduction","title":"Introduction","text":"Chunking approaches
Starting from a DoclingDocument, there are in principle two possible chunking approaches:
DoclingDocument to Markdown (or similar format) and then performing user-defined chunking as a post-processing step, orDoclingDocumentThis page is about the latter, i.e. using native Docling chunkers. For an example of using approach (1) check out e.g. this recipe looking at the Markdown export mode.
A chunker is a Docling abstraction that, given a DoclingDocument, returns a stream of chunks, each of which captures some part of the document as a string accompanied by respective metadata.
To enable both flexibility for downstream applications and out-of-the-box utility, Docling defines a chunker class hierarchy, providing a base type, BaseChunker, as well as specific subclasses.
Docling integration with gen AI frameworks like LlamaIndex is done using the BaseChunker interface, so users can easily plug in any built-in, self-defined, or third-party BaseChunker implementation.
The BaseChunker base class API defines that any chunker should provide the following:
def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]: Returning the chunks for the provided document.def contextualize(self, chunk: BaseChunk) -> str: Returning the potentially metadata-enriched serialization of the chunk, typically used to feed an embedding model (or generation model).To access HybridChunker
docling package, you can import as follows: from docling.chunking import HybridChunker\ndocling-core package, you must ensure to install the chunking extra if you want to use HuggingFace tokenizers, e.g. pip install 'docling-core[chunking]'\n or the chunking-openai extra if you prefer Open AI tokenizers (tiktoken), e.g. pip install 'docling-core[chunking-openai]'\n and then you can import as follows: from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\nThe HybridChunker implementation uses a hybrid approach, applying tokenization-aware refinements on top of document-based hierarchical chunking.
More precisely:
merge_peers (by default True)\ud83d\udc49 Usage examples:
The HierarchicalChunker implementation uses the document structure information from the DoclingDocument to create one chunk for each individual detected document element, by default only merging together list items (can be opted out via param merge_list_items). It also takes care of attaching all relevant document metadata, including headers and captions.
Confidence grades were introduced in v2.34.0 to help users understand how well a conversion performed and guide decisions about post-processing workflows. They are available in the confidence field of the ConversionResult object returned by the document converter.
Complex layouts, poor scan quality, or challenging formatting can lead to suboptimal document conversion results that may require additional attention or alternative conversion pipelines.
Confidence scores provide a quantitative assessment of document conversion quality. Each confidence report includes a numerical score (0.0 to 1.0) measuring conversion accuracy, and a quality grade (poor, fair, good, excellent) for quick interpretation.
Focus on quality grades!
Users can and should safely focus on the document-level grade fields \u2014 mean_grade and low_grade \u2014 to assess overall conversion quality. Numerical scores are used internally and are for informational purposes only; their computation and weighting may change in the future.
Use cases for confidence grades include:
A confidence report contains scores and grades:
POORFAIRGOODEXCELLENTEach confidence report includes four component scores and grades:
layout_score: Overall quality of document element recognition ocr_score: Quality of OCR-extracted contentparse_score: 10th percentile score of digital text cells (emphasizes problem areas)table_score: Table extraction quality (not yet implemented)Two aggregate grades provide overall document quality assessment:
mean_grade: Average of the four component scoreslow_grade: 5th percentile score (highlights worst-performing areas)Confidence grades are calculated at two levels:
pages fieldConfidenceReportWith Docling v2, we introduced a unified document representation format called DoclingDocument. It is defined as a pydantic datatype, which can express several features common to documents, such as:
The definition of the Pydantic types is implemented in the module docling_core.types.doc, more details in source code definitions.
It also brings a set of document construction APIs to build up a DoclingDocument from scratch.
To illustrate the features of the DoclingDocument format, in the subsections below we consider the DoclingDocument converted from tests/data/word_sample.docx and we present some side-by-side comparisons, where the left side shows snippets from the converted document serialized as YAML and the right one shows the corresponding parts of the original MS Word.
A DoclingDocument exposes top-level fields for the document content, organized in two categories. The first category is the content items, which are stored in these fields:
texts: All items that have a text representation (paragraph, section heading, equation, ...). Base class is TextItem.tables: All tables, type TableItem. Can carry structure annotations.pictures: All pictures, type PictureItem. Can carry structure annotations.key_value_items: All key-value items.All of the above fields are lists and store items inheriting from the DocItem type. They can express different data structures depending on their type, and reference parents and children through JSON pointers.
The second category is content structure, which is encapsulated in:
body: The root node of a tree-structure for the main document bodyfurniture: The root node of a tree-structure for all items that don't belong into the body (headers, footers, ...)groups: A set of items that don't represent content, but act as containers for other content items (e.g. a list, a chapter)All of the above fields are only storing NodeItem instances, which reference children and parents through JSON pointers.
The reading order of the document is encapsulated through the body tree and the order of children in each item in the tree.
Below example shows how all items in the first page are nested below the title item (#/texts/1).
Below example shows how all items under the heading \"Let's swim\" (#/texts/5) are nested as children. The children of \"Let's swim\" are both text items and groups, which contain the list elements. The group items are stored in the top-level groups field.
Docling allows to be extended with third-party plugins which extend the choice of options provided in several steps of the pipeline.
Plugins are loaded via the pluggy system which allows third-party developers to register the new capabilities using the setuptools entrypoint.
The actual entrypoint definition might vary, depending on the packaging system you are using. Here are a few examples:
pyproject.tomlpoetry v1 pyproject.tomlsetup.cfgsetup.py[project.entry-points.\"docling\"]\nyour_plugin_name = \"your_package.module\"\n [tool.poetry.plugins.\"docling\"]\nyour_plugin_name = \"your_package.module\"\n [options.entry_points]\ndocling =\n your_plugin_name = your_package.module\n from setuptools import setup\n\nsetup(\n # ...,\n entry_points = {\n 'docling': [\n 'your_plugin_name = \"your_package.module\"'\n ]\n }\n)\n your_plugin_name is the name you choose for your plugin. This must be unique among the broader Docling ecosystem.your_package.module is the reference to the module in your package which is responsible for the plugin registration.The OCR factory allows to provide more OCR engines to the Docling users.
The content of your_package.module registers the OCR engines with a code similar to:
# Factory registration\ndef ocr_engines():\n return {\n \"ocr_engines\": [\n YourOcrModel,\n ]\n }\n where YourOcrModel must implement the BaseOcrModel and provide an options class derived from OcrOptions.
If you look for an example, the default Docling plugins is a good starting point.
"},{"location":"concepts/plugins/#third-party-plugins","title":"Third-party plugins","text":"When the plugin is not provided by the main docling package but by a third-party package this have to be enabled explicitly via the allow_external_plugins option.
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions()\npipeline_options.allow_external_plugins = True # <-- enabled the external plugins\npipeline_options.ocr_options = YourOptions # <-- your options here\n\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options\n )\n }\n)\n"},{"location":"concepts/plugins/#using-the-docling-cli","title":"Using the docling CLI","text":"Similarly, when using the docling users have to enable external plugins before selecting the new one.
# Show the external plugins\ndocling --show-external-plugins\n\n# Run docling with the new plugin\ndocling --allow-external-plugins --ocr-engine=NAME\n"},{"location":"concepts/serialization/","title":"Serialization","text":""},{"location":"concepts/serialization/#introduction","title":"Introduction","text":"A document serializer (AKA simply serializer) is a Docling abstraction that is initialized with a given DoclingDocument and returns a textual representation for that document.
Besides the document serializer, Docling defines similar abstractions for several document subcomponents, for example: text serializer, table serializer, picture serializer, list serializer, inline serializer, and more.
Last but not least, a serializer provider is a wrapper that abstracts the document serialization strategy from the document instance.
"},{"location":"concepts/serialization/#base-classes","title":"Base classes","text":"To enable both flexibility for downstream applications and out-of-the-box utility, Docling defines a serialization class hierarchy, providing:
BaseDocSerializer, as well as BaseTextSerializer, BaseTableSerializer etc, and BaseSerializerProvider, andMarkdownDocSerializer.You can review all methods required to define the above base classes here.
From a client perspective, the most relevant is BaseDocSerializer.serialize(), which returns the textual representation,\u00a0as well as relevant metadata on which document components contributed to that serialization.
DoclingDocument export methods","text":"Docling provides predefined serializers for Markdown, HTML, and DocTags.
The respective DoclingDocument export methods (e.g. export_to_markdown()) are provided as user shorthands \u2014 internally directly instantiating and delegating to respective serializers.
For an example showcasing how to use serializers, see here.
"},{"location":"examples/","title":"Examples","text":"In this space, you can explore numerous Docling application recipes & end-to-end workflows!
Here some of our picks to get you started:
\ud83d\udc48 ... and there is much more: explore all the examples using the navigation menu on the side
Visual grounding Picture annotations"},{"location":"examples/advanced_chunking_and_serialization/","title":"Advanced chunking & serialization","text":"In this notebook we show how to customize the serialization strategies that come into play during chunking.
We will work with a document that contains some picture annotations:
In\u00a0[1]: Copied!from docling_core.types.doc.document import DoclingDocument\n\nSOURCE = \"./data/2408.09869v3_enriched.json\"\n\ndoc = DoclingDocument.load_from_json(SOURCE)\nfrom docling_core.types.doc.document import DoclingDocument SOURCE = \"./data/2408.09869v3_enriched.json\" doc = DoclingDocument.load_from_json(SOURCE)
Below we define the chunker (for more details check out Hybrid Chunking):
In\u00a0[2]: Copied!from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\nfrom docling_core.transforms.chunker.tokenizer.base import BaseTokenizer\nfrom docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\nfrom transformers import AutoTokenizer\n\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n\ntokenizer: BaseTokenizer = HuggingFaceTokenizer(\n tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n)\nchunker = HybridChunker(tokenizer=tokenizer)\nfrom docling_core.transforms.chunker.hybrid_chunker import HybridChunker from docling_core.transforms.chunker.tokenizer.base import BaseTokenizer from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer from transformers import AutoTokenizer EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" tokenizer: BaseTokenizer = HuggingFaceTokenizer( tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID), ) chunker = HybridChunker(tokenizer=tokenizer) In\u00a0[3]: Copied!
print(f\"{tokenizer.get_max_tokens()=}\")\n print(f\"{tokenizer.get_max_tokens()=}\") tokenizer.get_max_tokens()=512\n
Defining some helper methods:
In\u00a0[4]: Copied!from typing import Iterable, Optional\n\nfrom docling_core.transforms.chunker.base import BaseChunk\nfrom docling_core.transforms.chunker.hierarchical_chunker import DocChunk\nfrom docling_core.types.doc.labels import DocItemLabel\nfrom rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console(\n width=200, # for getting Markdown tables rendered nicely\n)\n\n\ndef find_n_th_chunk_with_label(\n iter: Iterable[BaseChunk], n: int, label: DocItemLabel\n) -> Optional[DocChunk]:\n num_found = -1\n for i, chunk in enumerate(iter):\n doc_chunk = DocChunk.model_validate(chunk)\n for it in doc_chunk.meta.doc_items:\n if it.label == label:\n num_found += 1\n if num_found == n:\n return i, chunk\n return None, None\n\n\ndef print_chunk(chunks, chunk_pos):\n chunk = chunks[chunk_pos]\n ctx_text = chunker.contextualize(chunk=chunk)\n num_tokens = tokenizer.count_tokens(text=ctx_text)\n doc_items_refs = [it.self_ref for it in chunk.meta.doc_items]\n title = f\"{chunk_pos=} {num_tokens=} {doc_items_refs=}\"\n console.print(Panel(ctx_text, title=title))\n from typing import Iterable, Optional from docling_core.transforms.chunker.base import BaseChunk from docling_core.transforms.chunker.hierarchical_chunker import DocChunk from docling_core.types.doc.labels import DocItemLabel from rich.console import Console from rich.panel import Panel console = Console( width=200, # for getting Markdown tables rendered nicely ) def find_n_th_chunk_with_label( iter: Iterable[BaseChunk], n: int, label: DocItemLabel ) -> Optional[DocChunk]: num_found = -1 for i, chunk in enumerate(iter): doc_chunk = DocChunk.model_validate(chunk) for it in doc_chunk.meta.doc_items: if it.label == label: num_found += 1 if num_found == n: return i, chunk return None, None def print_chunk(chunks, chunk_pos): chunk = chunks[chunk_pos] ctx_text = chunker.contextualize(chunk=chunk) num_tokens = tokenizer.count_tokens(text=ctx_text) doc_items_refs = [it.self_ref for it in chunk.meta.doc_items] title = f\"{chunk_pos=} {num_tokens=} {doc_items_refs=}\" console.print(Panel(ctx_text, title=title)) Below we inspect the first chunk containing a table \u2014 using the default serialization strategy:
In\u00a0[5]: Copied!chunker = HybridChunker(tokenizer=tokenizer)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\nchunker = HybridChunker(tokenizer=tokenizer) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE) print_chunk( chunks=chunks, chunk_pos=i, )
Token indices sequence length is longer than the specified maximum sequence length for this model (652 > 512). Running this sequence through the model will result in indexing errors\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=13 num_tokens=426 doc_items_refs=['#/texts/72', '#/tables/0'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 4 Performance \u2502\n\u2502 Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u2502\n\u2502 (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. \u2502\n\u2502 \u2502\n\u2502 Apple M3 Max, Thread budget. = 4. Apple M3 Max, native backend.TTS = 177 s 167 s. Apple M3 Max, native backend.Pages/s = 1.27 1.34. Apple M3 Max, native backend.Mem = 6.20 GB. Apple M3 Max, \u2502\n\u2502 pypdfium backend.TTS = 103 s 92 s. Apple M3 Max, pypdfium backend.Pages/s = 2.18 2.45. Apple M3 Max, pypdfium backend.Mem = 2.56 GB. (16 cores) Intel(R) Xeon E5-2690, Thread budget. = 16 4 16. (16 \u2502\n\u2502 cores) Intel(R) Xeon E5-2690, native backend.TTS = 375 s 244 s. (16 cores) Intel(R) Xeon E5-2690, native backend.Pages/s = 0.60 0.92. (16 cores) Intel(R) Xeon E5-2690, native backend.Mem = 6.16 \u2502\n\u2502 GB. (16 cores) Intel(R) Xeon E5-2690, pypdfium backend.TTS = 239 s 143 s. (16 cores) Intel(R) Xeon E5-2690, pypdfium backend.Pages/s = 0.94 1.57. (16 cores) Intel(R) Xeon E5-2690, pypdfium \u2502\n\u2502 backend.Mem = 2.42 GB \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\nINFO: As you see above, using the
HybridChunker can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here. We can configure a different serialization strategy. In the example below, we specify a different table serializer that serializes tables to Markdown instead of the triplet notation used by default:
In\u00a0[6]: Copied!from docling_core.transforms.chunker.hierarchical_chunker import (\n ChunkingDocSerializer,\n ChunkingSerializerProvider,\n)\nfrom docling_core.transforms.serializer.markdown import MarkdownTableSerializer\n\n\nclass MDTableSerializerProvider(ChunkingSerializerProvider):\n def get_serializer(self, doc):\n return ChunkingDocSerializer(\n doc=doc,\n table_serializer=MarkdownTableSerializer(), # configuring a different table serializer\n )\n\n\nchunker = HybridChunker(\n tokenizer=tokenizer,\n serializer_provider=MDTableSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\nfrom docling_core.transforms.chunker.hierarchical_chunker import ( ChunkingDocSerializer, ChunkingSerializerProvider, ) from docling_core.transforms.serializer.markdown import MarkdownTableSerializer class MDTableSerializerProvider(ChunkingSerializerProvider): def get_serializer(self, doc): return ChunkingDocSerializer( doc=doc, table_serializer=MarkdownTableSerializer(), # configuring a different table serializer ) chunker = HybridChunker( tokenizer=tokenizer, serializer_provider=MDTableSerializerProvider(), ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE) print_chunk( chunks=chunks, chunk_pos=i, )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=13 num_tokens=431 doc_items_refs=['#/texts/72', '#/tables/0'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 4 Performance \u2502\n\u2502 Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u2502\n\u2502 (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. \u2502\n\u2502 \u2502\n\u2502 | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | \u2502\n\u2502 |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| \u2502\n\u2502 | | | TTS | Pages/s | Mem | TTS | Pages/s | Mem | \u2502\n\u2502 | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB | \u2502\n\u2502 | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB | \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
Below we inspect the first chunk containing a picture.
Even when using the default strategy, we can modify the relevant parameters, e.g. which placeholder is used for pictures:
In\u00a0[7]: Copied!from docling_core.transforms.serializer.markdown import MarkdownParams\n\n\nclass ImgPlaceholderSerializerProvider(ChunkingSerializerProvider):\n def get_serializer(self, doc):\n return ChunkingDocSerializer(\n doc=doc,\n params=MarkdownParams(\n image_placeholder=\"<!-- image -->\",\n ),\n )\n\n\nchunker = HybridChunker(\n tokenizer=tokenizer,\n serializer_provider=ImgPlaceholderSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\nfrom docling_core.transforms.serializer.markdown import MarkdownParams class ImgPlaceholderSerializerProvider(ChunkingSerializerProvider): def get_serializer(self, doc): return ChunkingDocSerializer( doc=doc, params=MarkdownParams( image_placeholder=\"\", ), ) chunker = HybridChunker( tokenizer=tokenizer, serializer_provider=ImgPlaceholderSerializerProvider(), ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE) print_chunk( chunks=chunks, chunk_pos=i, )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=0 num_tokens=117 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 <!-- image --> \u2502\n\u2502 Version 1.0 \u2502\n\u2502 Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta \u2502\n\u2502 Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar \u2502\n\u2502 AI4K Group, IBM Research R\u00a8 uschlikon, Switzerland \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
Below we define and use our custom picture serialization strategy which leverages picture annotations:
In\u00a0[8]: Copied!from typing import Any\n\nfrom docling_core.transforms.serializer.base import (\n BaseDocSerializer,\n SerializationResult,\n)\nfrom docling_core.transforms.serializer.common import create_ser_result\nfrom docling_core.transforms.serializer.markdown import MarkdownPictureSerializer\nfrom docling_core.types.doc.document import (\n PictureClassificationData,\n PictureDescriptionData,\n PictureItem,\n PictureMoleculeData,\n)\nfrom typing_extensions import override\n\n\nclass AnnotationPictureSerializer(MarkdownPictureSerializer):\n @override\n def serialize(\n self,\n *,\n item: PictureItem,\n doc_serializer: BaseDocSerializer,\n doc: DoclingDocument,\n **kwargs: Any,\n ) -> SerializationResult:\n text_parts: list[str] = []\n for annotation in item.annotations:\n if isinstance(annotation, PictureClassificationData):\n predicted_class = (\n annotation.predicted_classes[0].class_name\n if annotation.predicted_classes\n else None\n )\n if predicted_class is not None:\n text_parts.append(f\"Picture type: {predicted_class}\")\n elif isinstance(annotation, PictureMoleculeData):\n text_parts.append(f\"SMILES: {annotation.smi}\")\n elif isinstance(annotation, PictureDescriptionData):\n text_parts.append(f\"Picture description: {annotation.text}\")\n\n text_res = \"\\n\".join(text_parts)\n text_res = doc_serializer.post_process(text=text_res)\n return create_ser_result(text=text_res, span_source=item)\n from typing import Any from docling_core.transforms.serializer.base import ( BaseDocSerializer, SerializationResult, ) from docling_core.transforms.serializer.common import create_ser_result from docling_core.transforms.serializer.markdown import MarkdownPictureSerializer from docling_core.types.doc.document import ( PictureClassificationData, PictureDescriptionData, PictureItem, PictureMoleculeData, ) from typing_extensions import override class AnnotationPictureSerializer(MarkdownPictureSerializer): @override def serialize( self, *, item: PictureItem, doc_serializer: BaseDocSerializer, doc: DoclingDocument, **kwargs: Any, ) -> SerializationResult: text_parts: list[str] = [] for annotation in item.annotations: if isinstance(annotation, PictureClassificationData): predicted_class = ( annotation.predicted_classes[0].class_name if annotation.predicted_classes else None ) if predicted_class is not None: text_parts.append(f\"Picture type: {predicted_class}\") elif isinstance(annotation, PictureMoleculeData): text_parts.append(f\"SMILES: {annotation.smi}\") elif isinstance(annotation, PictureDescriptionData): text_parts.append(f\"Picture description: {annotation.text}\") text_res = \"\\n\".join(text_parts) text_res = doc_serializer.post_process(text=text_res) return create_ser_result(text=text_res, span_source=item) In\u00a0[9]: Copied! class ImgAnnotationSerializerProvider(ChunkingSerializerProvider):\n def get_serializer(self, doc: DoclingDocument):\n return ChunkingDocSerializer(\n doc=doc,\n picture_serializer=AnnotationPictureSerializer(), # configuring a different picture serializer\n )\n\n\nchunker = HybridChunker(\n tokenizer=tokenizer,\n serializer_provider=ImgAnnotationSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\nclass ImgAnnotationSerializerProvider(ChunkingSerializerProvider): def get_serializer(self, doc: DoclingDocument): return ChunkingDocSerializer( doc=doc, picture_serializer=AnnotationPictureSerializer(), # configuring a different picture serializer ) chunker = HybridChunker( tokenizer=tokenizer, serializer_provider=ImgAnnotationSerializerProvider(), ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE) print_chunk( chunks=chunks, chunk_pos=i, )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=0 num_tokens=128 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 Picture description: In this image we can see a cartoon image of a duck holding a paper. \u2502\n\u2502 Version 1.0 \u2502\n\u2502 Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta \u2502\n\u2502 Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar \u2502\n\u2502 AI4K Group, IBM Research R\u00a8 uschlikon, Switzerland \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/advanced_chunking_and_serialization/#advanced-chunking-serialization","title":"Advanced chunking & serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#overview","title":"Overview\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#setup","title":"Setup\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#table-serialization","title":"Table serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-the-default-strategy","title":"Using the default strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#configuring-a-different-strategy","title":"Configuring a different strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#picture-serialization","title":"Picture serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-the-default-strategy","title":"Using the default strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-a-custom-strategy","title":"Using a custom strategy\u00b6","text":""},{"location":"examples/asr_pipeline_performance_comparison/","title":"Asr pipeline performance comparison","text":"In\u00a0[\u00a0]: Copied!
\"\"\"\nPerformance comparison between CPU and MLX Whisper on Apple Silicon.\n\nThis script compares the performance of:\n1. Native Whisper (forced to CPU)\n2. MLX Whisper (Apple Silicon optimized)\n\nBoth use the same model size for fair comparison.\n\"\"\"\n\"\"\" Performance comparison between CPU and MLX Whisper on Apple Silicon. This script compares the performance of: 1. Native Whisper (forced to CPU) 2. MLX Whisper (Apple Silicon optimized) Both use the same model size for fair comparison. \"\"\" In\u00a0[\u00a0]: Copied!
import argparse\nimport sys\nimport time\nfrom pathlib import Path\nimport argparse import sys import time from pathlib import Path In\u00a0[\u00a0]: Copied!
# Add the repository root to the path so we can import docling\nsys.path.insert(0, str(Path(__file__).parent.parent.parent))\n# Add the repository root to the path so we can import docling sys.path.insert(0, str(Path(__file__).parent.parent.parent)) In\u00a0[\u00a0]: Copied!
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import AsrPipelineOptions\nfrom docling.datamodel.pipeline_options_asr_model import (\n InferenceAsrFramework,\n InlineAsrMlxWhisperOptions,\n InlineAsrNativeWhisperOptions,\n)\nfrom docling.document_converter import AudioFormatOption, DocumentConverter\nfrom docling.pipeline.asr_pipeline import AsrPipeline\nfrom docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import AsrPipelineOptions from docling.datamodel.pipeline_options_asr_model import ( InferenceAsrFramework, InlineAsrMlxWhisperOptions, InlineAsrNativeWhisperOptions, ) from docling.document_converter import AudioFormatOption, DocumentConverter from docling.pipeline.asr_pipeline import AsrPipeline In\u00a0[\u00a0]: Copied!
def create_cpu_whisper_options(model_size: str = \"turbo\"):\n \"\"\"Create native Whisper options forced to CPU.\"\"\"\n return InlineAsrNativeWhisperOptions(\n repo_id=model_size,\n inference_framework=InferenceAsrFramework.WHISPER,\n verbose=True,\n timestamps=True,\n word_timestamps=True,\n temperature=0.0,\n max_new_tokens=256,\n max_time_chunk=30.0,\n )\ndef create_cpu_whisper_options(model_size: str = \"turbo\"): \"\"\"Create native Whisper options forced to CPU.\"\"\" return InlineAsrNativeWhisperOptions( repo_id=model_size, inference_framework=InferenceAsrFramework.WHISPER, verbose=True, timestamps=True, word_timestamps=True, temperature=0.0, max_new_tokens=256, max_time_chunk=30.0, ) In\u00a0[\u00a0]: Copied!
def create_mlx_whisper_options(model_size: str = \"turbo\"):\n \"\"\"Create MLX Whisper options for Apple Silicon.\"\"\"\n model_map = {\n \"tiny\": \"mlx-community/whisper-tiny-mlx\",\n \"small\": \"mlx-community/whisper-small-mlx\",\n \"base\": \"mlx-community/whisper-base-mlx\",\n \"medium\": \"mlx-community/whisper-medium-mlx-8bit\",\n \"large\": \"mlx-community/whisper-large-mlx-8bit\",\n \"turbo\": \"mlx-community/whisper-turbo\",\n }\n\n return InlineAsrMlxWhisperOptions(\n repo_id=model_map[model_size],\n inference_framework=InferenceAsrFramework.MLX,\n language=\"en\",\n task=\"transcribe\",\n word_timestamps=True,\n no_speech_threshold=0.6,\n logprob_threshold=-1.0,\n compression_ratio_threshold=2.4,\n )\n def create_mlx_whisper_options(model_size: str = \"turbo\"): \"\"\"Create MLX Whisper options for Apple Silicon.\"\"\" model_map = { \"tiny\": \"mlx-community/whisper-tiny-mlx\", \"small\": \"mlx-community/whisper-small-mlx\", \"base\": \"mlx-community/whisper-base-mlx\", \"medium\": \"mlx-community/whisper-medium-mlx-8bit\", \"large\": \"mlx-community/whisper-large-mlx-8bit\", \"turbo\": \"mlx-community/whisper-turbo\", } return InlineAsrMlxWhisperOptions( repo_id=model_map[model_size], inference_framework=InferenceAsrFramework.MLX, language=\"en\", task=\"transcribe\", word_timestamps=True, no_speech_threshold=0.6, logprob_threshold=-1.0, compression_ratio_threshold=2.4, ) In\u00a0[\u00a0]: Copied! def run_transcription_test(\n audio_file: Path, asr_options, device: AcceleratorDevice, test_name: str\n):\n \"\"\"Run a single transcription test and return timing results.\"\"\"\n print(f\"\\n{'=' * 60}\")\n print(f\"Running {test_name}\")\n print(f\"Device: {device}\")\n print(f\"Model: {asr_options.repo_id}\")\n print(f\"Framework: {asr_options.inference_framework}\")\n print(f\"{'=' * 60}\")\n\n # Create pipeline options\n pipeline_options = AsrPipelineOptions(\n accelerator_options=AcceleratorOptions(device=device),\n asr_options=asr_options,\n )\n\n # Create document converter\n converter = DocumentConverter(\n format_options={\n InputFormat.AUDIO: AudioFormatOption(\n pipeline_cls=AsrPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n\n # Run transcription with timing\n start_time = time.time()\n try:\n result = converter.convert(audio_file)\n end_time = time.time()\n\n duration = end_time - start_time\n\n if result.status.value == \"success\":\n # Extract text for verification\n text_content = []\n for item in result.document.texts:\n text_content.append(item.text)\n\n print(f\"\u2705 Success! Duration: {duration:.2f} seconds\")\n print(f\"Transcribed text: {''.join(text_content)[:100]}...\")\n return duration, True\n else:\n print(f\"\u274c Failed! Status: {result.status}\")\n return duration, False\n\n except Exception as e:\n end_time = time.time()\n duration = end_time - start_time\n print(f\"\u274c Error: {e}\")\n return duration, False\n def run_transcription_test( audio_file: Path, asr_options, device: AcceleratorDevice, test_name: str ): \"\"\"Run a single transcription test and return timing results.\"\"\" print(f\"\\n{'=' * 60}\") print(f\"Running {test_name}\") print(f\"Device: {device}\") print(f\"Model: {asr_options.repo_id}\") print(f\"Framework: {asr_options.inference_framework}\") print(f\"{'=' * 60}\") # Create pipeline options pipeline_options = AsrPipelineOptions( accelerator_options=AcceleratorOptions(device=device), asr_options=asr_options, ) # Create document converter converter = DocumentConverter( format_options={ InputFormat.AUDIO: AudioFormatOption( pipeline_cls=AsrPipeline, pipeline_options=pipeline_options, ) } ) # Run transcription with timing start_time = time.time() try: result = converter.convert(audio_file) end_time = time.time() duration = end_time - start_time if result.status.value == \"success\": # Extract text for verification text_content = [] for item in result.document.texts: text_content.append(item.text) print(f\"\u2705 Success! Duration: {duration:.2f} seconds\") print(f\"Transcribed text: {''.join(text_content)[:100]}...\") return duration, True else: print(f\"\u274c Failed! Status: {result.status}\") return duration, False except Exception as e: end_time = time.time() duration = end_time - start_time print(f\"\u274c Error: {e}\") return duration, False In\u00a0[\u00a0]: Copied! def parse_args():\n \"\"\"Parse command line arguments.\"\"\"\n parser = argparse.ArgumentParser(\n description=\"Performance comparison between CPU and MLX Whisper on Apple Silicon\",\n formatter_class=argparse.RawDescriptionHelpFormatter,\n epilog=\"\"\"\nExamples:\n\n# Use default test audio file\npython asr_pipeline_performance_comparison.py\n\n# Use your own audio file\npython asr_pipeline_performance_comparison.py --audio /path/to/your/audio.mp3\n\n# Use a different audio file from the tests directory\npython asr_pipeline_performance_comparison.py --audio tests/data/audio/another_sample.wav\n \"\"\",\n )\n\n parser.add_argument(\n \"--audio\",\n type=str,\n help=\"Path to audio file for testing (default: tests/data/audio/sample_10s.mp3)\",\n )\n\n return parser.parse_args()\ndef parse_args(): \"\"\"Parse command line arguments.\"\"\" parser = argparse.ArgumentParser( description=\"Performance comparison between CPU and MLX Whisper on Apple Silicon\", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=\"\"\" Examples: # Use default test audio file python asr_pipeline_performance_comparison.py # Use your own audio file python asr_pipeline_performance_comparison.py --audio /path/to/your/audio.mp3 # Use a different audio file from the tests directory python asr_pipeline_performance_comparison.py --audio tests/data/audio/another_sample.wav \"\"\", ) parser.add_argument( \"--audio\", type=str, help=\"Path to audio file for testing (default: tests/data/audio/sample_10s.mp3)\", ) return parser.parse_args() In\u00a0[\u00a0]: Copied!
def main():\n \"\"\"Run performance comparison between CPU and MLX Whisper.\"\"\"\n args = parse_args()\n\n # Check if we're on Apple Silicon\n try:\n import torch\n\n has_mps = torch.backends.mps.is_built() and torch.backends.mps.is_available()\n except ImportError:\n has_mps = False\n\n try:\n import mlx_whisper\n\n has_mlx_whisper = True\n except ImportError:\n has_mlx_whisper = False\n\n print(\"ASR Pipeline Performance Comparison\")\n print(\"=\" * 50)\n print(f\"Apple Silicon (MPS) available: {has_mps}\")\n print(f\"MLX Whisper available: {has_mlx_whisper}\")\n\n if not has_mps:\n print(\"\u26a0\ufe0f Apple Silicon (MPS) not available - running CPU-only comparison\")\n print(\" For MLX Whisper performance benefits, run on Apple Silicon devices\")\n print(\" MLX Whisper is optimized for Apple Silicon devices.\")\n\n if not has_mlx_whisper:\n print(\"\u26a0\ufe0f MLX Whisper not installed - running CPU-only comparison\")\n print(\" Install with: pip install mlx-whisper\")\n print(\" Or: uv sync --extra asr\")\n print(\" For MLX Whisper performance benefits, install the dependency\")\n\n # Determine audio file path\n if args.audio:\n audio_file = Path(args.audio)\n if not audio_file.is_absolute():\n # If relative path, make it relative to the script's directory\n audio_file = Path(__file__).parent.parent.parent / audio_file\n else:\n # Use default test audio file\n audio_file = (\n Path(__file__).parent.parent.parent\n / \"tests\"\n / \"data\"\n / \"audio\"\n / \"sample_10s.mp3\"\n )\n\n if not audio_file.exists():\n print(f\"\u274c Audio file not found: {audio_file}\")\n print(\" Please check the path and try again.\")\n sys.exit(1)\n\n print(f\"Using test audio: {audio_file}\")\n print(f\"File size: {audio_file.stat().st_size / 1024:.1f} KB\")\n\n # Test different model sizes\n model_sizes = [\"tiny\", \"base\", \"turbo\"]\n results = {}\n\n for model_size in model_sizes:\n print(f\"\\n{'#' * 80}\")\n print(f\"Testing model size: {model_size}\")\n print(f\"{'#' * 80}\")\n\n model_results = {}\n\n # Test 1: Native Whisper (forced to CPU)\n cpu_options = create_cpu_whisper_options(model_size)\n cpu_duration, cpu_success = run_transcription_test(\n audio_file,\n cpu_options,\n AcceleratorDevice.CPU,\n f\"Native Whisper {model_size} (CPU)\",\n )\n model_results[\"cpu\"] = {\"duration\": cpu_duration, \"success\": cpu_success}\n\n # Test 2: MLX Whisper (Apple Silicon optimized) - only if available\n if has_mps and has_mlx_whisper:\n mlx_options = create_mlx_whisper_options(model_size)\n mlx_duration, mlx_success = run_transcription_test(\n audio_file,\n mlx_options,\n AcceleratorDevice.MPS,\n f\"MLX Whisper {model_size} (MPS)\",\n )\n model_results[\"mlx\"] = {\"duration\": mlx_duration, \"success\": mlx_success}\n else:\n print(f\"\\n{'=' * 60}\")\n print(f\"Skipping MLX Whisper {model_size} (MPS) - not available\")\n print(f\"{'=' * 60}\")\n model_results[\"mlx\"] = {\"duration\": 0.0, \"success\": False}\n\n results[model_size] = model_results\n\n # Print summary\n print(f\"\\n{'#' * 80}\")\n print(\"PERFORMANCE COMPARISON SUMMARY\")\n print(f\"{'#' * 80}\")\n print(\n f\"{'Model':<10} {'CPU (sec)':<12} {'MLX (sec)':<12} {'Speedup':<12} {'Status':<10}\"\n )\n print(\"-\" * 80)\n\n for model_size, model_results in results.items():\n cpu_duration = model_results[\"cpu\"][\"duration\"]\n mlx_duration = model_results[\"mlx\"][\"duration\"]\n cpu_success = model_results[\"cpu\"][\"success\"]\n mlx_success = model_results[\"mlx\"][\"success\"]\n\n if cpu_success and mlx_success:\n speedup = cpu_duration / mlx_duration\n status = \"\u2705 Both OK\"\n elif cpu_success:\n speedup = float(\"inf\")\n status = \"\u274c MLX Failed\"\n elif mlx_success:\n speedup = 0\n status = \"\u274c CPU Failed\"\n else:\n speedup = 0\n status = \"\u274c Both Failed\"\n\n print(\n f\"{model_size:<10} {cpu_duration:<12.2f} {mlx_duration:<12.2f} {speedup:<12.2f}x {status:<10}\"\n )\n\n # Calculate overall improvement\n successful_tests = [\n (r[\"cpu\"][\"duration\"], r[\"mlx\"][\"duration\"])\n for r in results.values()\n if r[\"cpu\"][\"success\"] and r[\"mlx\"][\"success\"]\n ]\n\n if successful_tests:\n avg_cpu = sum(cpu for cpu, mlx in successful_tests) / len(successful_tests)\n avg_mlx = sum(mlx for cpu, mlx in successful_tests) / len(successful_tests)\n avg_speedup = avg_cpu / avg_mlx\n\n print(\"-\" * 80)\n print(\n f\"{'AVERAGE':<10} {avg_cpu:<12.2f} {avg_mlx:<12.2f} {avg_speedup:<12.2f}x {'Overall':<10}\"\n )\n\n print(f\"\\n\ud83c\udfaf MLX Whisper provides {avg_speedup:.1f}x average speedup over CPU!\")\n else:\n if has_mps and has_mlx_whisper:\n print(\"\\n\u274c No successful comparisons available.\")\n else:\n print(\"\\n\u26a0\ufe0f MLX Whisper not available - only CPU results shown.\")\n print(\n \" Install MLX Whisper and run on Apple Silicon for performance comparison.\"\n )\n def main(): \"\"\"Run performance comparison between CPU and MLX Whisper.\"\"\" args = parse_args() # Check if we're on Apple Silicon try: import torch has_mps = torch.backends.mps.is_built() and torch.backends.mps.is_available() except ImportError: has_mps = False try: import mlx_whisper has_mlx_whisper = True except ImportError: has_mlx_whisper = False print(\"ASR Pipeline Performance Comparison\") print(\"=\" * 50) print(f\"Apple Silicon (MPS) available: {has_mps}\") print(f\"MLX Whisper available: {has_mlx_whisper}\") if not has_mps: print(\"\u26a0\ufe0f Apple Silicon (MPS) not available - running CPU-only comparison\") print(\" For MLX Whisper performance benefits, run on Apple Silicon devices\") print(\" MLX Whisper is optimized for Apple Silicon devices.\") if not has_mlx_whisper: print(\"\u26a0\ufe0f MLX Whisper not installed - running CPU-only comparison\") print(\" Install with: pip install mlx-whisper\") print(\" Or: uv sync --extra asr\") print(\" For MLX Whisper performance benefits, install the dependency\") # Determine audio file path if args.audio: audio_file = Path(args.audio) if not audio_file.is_absolute(): # If relative path, make it relative to the script's directory audio_file = Path(__file__).parent.parent.parent / audio_file else: # Use default test audio file audio_file = ( Path(__file__).parent.parent.parent / \"tests\" / \"data\" / \"audio\" / \"sample_10s.mp3\" ) if not audio_file.exists(): print(f\"\u274c Audio file not found: {audio_file}\") print(\" Please check the path and try again.\") sys.exit(1) print(f\"Using test audio: {audio_file}\") print(f\"File size: {audio_file.stat().st_size / 1024:.1f} KB\") # Test different model sizes model_sizes = [\"tiny\", \"base\", \"turbo\"] results = {} for model_size in model_sizes: print(f\"\\n{'#' * 80}\") print(f\"Testing model size: {model_size}\") print(f\"{'#' * 80}\") model_results = {} # Test 1: Native Whisper (forced to CPU) cpu_options = create_cpu_whisper_options(model_size) cpu_duration, cpu_success = run_transcription_test( audio_file, cpu_options, AcceleratorDevice.CPU, f\"Native Whisper {model_size} (CPU)\", ) model_results[\"cpu\"] = {\"duration\": cpu_duration, \"success\": cpu_success} # Test 2: MLX Whisper (Apple Silicon optimized) - only if available if has_mps and has_mlx_whisper: mlx_options = create_mlx_whisper_options(model_size) mlx_duration, mlx_success = run_transcription_test( audio_file, mlx_options, AcceleratorDevice.MPS, f\"MLX Whisper {model_size} (MPS)\", ) model_results[\"mlx\"] = {\"duration\": mlx_duration, \"success\": mlx_success} else: print(f\"\\n{'=' * 60}\") print(f\"Skipping MLX Whisper {model_size} (MPS) - not available\") print(f\"{'=' * 60}\") model_results[\"mlx\"] = {\"duration\": 0.0, \"success\": False} results[model_size] = model_results # Print summary print(f\"\\n{'#' * 80}\") print(\"PERFORMANCE COMPARISON SUMMARY\") print(f\"{'#' * 80}\") print( f\"{'Model':<10} {'CPU (sec)':<12} {'MLX (sec)':<12} {'Speedup':<12} {'Status':<10}\" ) print(\"-\" * 80) for model_size, model_results in results.items(): cpu_duration = model_results[\"cpu\"][\"duration\"] mlx_duration = model_results[\"mlx\"][\"duration\"] cpu_success = model_results[\"cpu\"][\"success\"] mlx_success = model_results[\"mlx\"][\"success\"] if cpu_success and mlx_success: speedup = cpu_duration / mlx_duration status = \"\u2705 Both OK\" elif cpu_success: speedup = float(\"inf\") status = \"\u274c MLX Failed\" elif mlx_success: speedup = 0 status = \"\u274c CPU Failed\" else: speedup = 0 status = \"\u274c Both Failed\" print( f\"{model_size:<10} {cpu_duration:<12.2f} {mlx_duration:<12.2f} {speedup:<12.2f}x {status:<10}\" ) # Calculate overall improvement successful_tests = [ (r[\"cpu\"][\"duration\"], r[\"mlx\"][\"duration\"]) for r in results.values() if r[\"cpu\"][\"success\"] and r[\"mlx\"][\"success\"] ] if successful_tests: avg_cpu = sum(cpu for cpu, mlx in successful_tests) / len(successful_tests) avg_mlx = sum(mlx for cpu, mlx in successful_tests) / len(successful_tests) avg_speedup = avg_cpu / avg_mlx print(\"-\" * 80) print( f\"{'AVERAGE':<10} {avg_cpu:<12.2f} {avg_mlx:<12.2f} {avg_speedup:<12.2f}x {'Overall':<10}\" ) print(f\"\\n\ud83c\udfaf MLX Whisper provides {avg_speedup:.1f}x average speedup over CPU!\") else: if has_mps and has_mlx_whisper: print(\"\\n\u274c No successful comparisons available.\") else: print(\"\\n\u26a0\ufe0f MLX Whisper not available - only CPU results shown.\") print( \" Install MLX Whisper and run on Apple Silicon for performance comparison.\" ) In\u00a0[\u00a0]: Copied! if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/backend_csv/","title":"Conversion of CSV files","text":"In\u00a0[59]: Copied!
from pathlib import Path\n\nfrom docling.document_converter import DocumentConverter\n\n# Convert CSV to Docling document\nconverter = DocumentConverter()\nresult = converter.convert(Path(\"../../tests/data/csv/csv-comma.csv\"))\noutput = result.document.export_to_markdown()\nfrom pathlib import Path from docling.document_converter import DocumentConverter # Convert CSV to Docling document converter = DocumentConverter() result = converter.convert(Path(\"../../tests/data/csv/csv-comma.csv\")) output = result.document.export_to_markdown()
This code generates the following output:
Index Customer Id First Name Last Name Company City Country Phone 1 Phone 2 Email Subscription Date Website 1 DD37Cf93aecA6Dc Sheryl Baxter Rasmussen Group East Leonard Chile 229.077.5154 397.884.0519x718 zunigavanessa@smith.info 2020-08-24 http://www.stephenson.com/ 2 1Ef7b82A4CAAD10 Preston Lozano, Dr Vega-Gentry East Jimmychester Djibouti 5153435776 686-620-1820x944 vmata@colon.com 2021-04-23 http://www.hobbs.com/ 3 6F94879bDAfE5a6 Roy Berry Murillo-Perry Isabelborough Antigua and Barbuda +1-539-402-0259 (496)978-3969x58947 beckycarr@hogan.com 2020-03-25 http://www.lawrence.com/ 4 5Cef8BFA16c5e3c Linda Olsen Dominguez, Mcmillan and Donovan Bensonview Dominican Republic 001-808-617-6467x12895 +1-813-324-8756 stanleyblackwell@benson.org 2020-06-02 http://www.good-lyons.com/ 5 053d585Ab6b3159 Joanna Bender Martin, Lang and Andrade West Priscilla Slovakia (Slovak Republic) 001-234-203-0635x76146 001-199-446-3860x3486 colinalvarado@miles.net 2021-04-17 https://goodwin-ingram.com/"},{"location":"examples/backend_csv/#conversion-of-csv-files","title":"Conversion of CSV files\u00b6","text":"This example shows how to convert CSV files to a structured Docling Document.
, ; | [tab]This is an example of using Docling for converting structured data (XML) into a unified document representation format, DoclingDocument, and leverage its riched structured content for RAG applications.
Data used in this example consist of patents from the United States Patent and Trademark Office (USPTO) and medical articles from PubMed Central\u00ae (PMC).
In this notebook, we accomplish the following:
For more details on document chunking with Docling, refer to the Chunking documentation. For RAG with Docling and LlamaIndex, also check the example RAG with LlamaIndex.
In\u00a0[1]: Copied!from docling.document_converter import DocumentConverter\n\n# a sample PMC article:\nsource = \"../../tests/data/jats/elife-56337.nxml\"\nconverter = DocumentConverter()\nresult = converter.convert(source)\nprint(result.status)\nfrom docling.document_converter import DocumentConverter # a sample PMC article: source = \"../../tests/data/jats/elife-56337.nxml\" converter = DocumentConverter() result = converter.convert(source) print(result.status)
ConversionStatus.SUCCESS\n
Once the document is converted, it can be exported to any format supported by Docling. For instance, to markdown (showing here the first lines only):
In\u00a0[2]: Copied!md_doc = result.document.export_to_markdown()\n\ndelim = \"\\n\"\nprint(delim.join(md_doc.split(delim)[:8]))\nmd_doc = result.document.export_to_markdown() delim = \"\\n\" print(delim.join(md_doc.split(delim)[:8]))
# KRAB-zinc finger protein gene expansion in response to active retrotransposons in the murine lineage\n\nGernot Wolf, Alberto de Iaco, Ming-An Sun, Melania Bruno, Matthew Tinkham, Don Hoang, Apratim Mitra, Sherry Ralls, Didier Trono, Todd S Macfarlan\n\nThe Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health, Bethesda, United States; School of Life Sciences, \u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL), Lausanne, Switzerland\n\n## Abstract\n\n
If the XML file is not supported, a ConversionError message will be raised.
from io import BytesIO\n\nfrom docling.datamodel.base_models import DocumentStream\nfrom docling.exceptions import ConversionError\n\nxml_content = (\n b'<?xml version=\"1.0\" encoding=\"UTF-8\"?><!DOCTYPE docling_test SYSTEM '\n b'\"test.dtd\"><docling>Random content</docling>'\n)\nstream = DocumentStream(name=\"docling_test.xml\", stream=BytesIO(xml_content))\ntry:\n result = converter.convert(stream)\nexcept ConversionError as ce:\n print(ce)\nfrom io import BytesIO from docling.datamodel.base_models import DocumentStream from docling.exceptions import ConversionError xml_content = ( b' Random content' ) stream = DocumentStream(name=\"docling_test.xml\", stream=BytesIO(xml_content)) try: result = converter.convert(stream) except ConversionError as ce: print(ce)
Input document docling_test.xml does not match any allowed format.\n
File format not allowed: docling_test.xml\n
You can always refer to the Usage documentation page for a list of supported formats.
Requirements can be installed as shown below. The --no-warn-conflicts argument is meant for Colab's pre-populated Python environment, feel free to remove for stricter usage.
%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv\n%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv
Note: you may need to restart the kernel to use updated packages.\n
This notebook uses HuggingFace's Inference API. For an increased LLM quota, a token can be provided via the environment variable HF_TOKEN.
If you're running this notebook in Google Colab, make sure you add your API key as a secret.
In\u00a0[5]: Copied!import os\nfrom warnings import filterwarnings\n\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\nfilterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")\nimport os from warnings import filterwarnings from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")
We can now define the main parameters:
In\u00a0[6]: Copied!from pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n\nEMBED_MODEL_ID = \"BAAI/bge-small-en-v1.5\"\nEMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID)\nTEMP_DIR = Path(mkdtemp())\nMILVUS_URI = str(TEMP_DIR / \"docling.db\")\nGEN_MODEL = HuggingFaceInferenceAPI(\n token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n)\nembed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\nfrom pathlib import Path from tempfile import mkdtemp from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI EMBED_MODEL_ID = \"BAAI/bge-small-en-v1.5\" EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID) TEMP_DIR = Path(mkdtemp()) MILVUS_URI = str(TEMP_DIR / \"docling.db\") GEN_MODEL = HuggingFaceInferenceAPI( token=_get_env_from_colab_or_os(\"HF_TOKEN\"), model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", ) embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\")) # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"
In this notebook we will use XML data from collections supported by Docling:
.tar.gz files. Each file contains the full article data in XML format, among other supplementary files like images or spreadsheets.The raw files will be downloaded form the source and saved in a temporary directory.
In\u00a0[7]: Copied!import tarfile\nfrom io import BytesIO\n\nimport requests\n\n# PMC article PMC11703268\nurl: str = \"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz\"\n\nprint(f\"Downloading {url}...\")\nbuf = BytesIO(requests.get(url).content)\nprint(\"Extracting and storing the XML file containing the article text...\")\nwith tarfile.open(fileobj=buf, mode=\"r:gz\") as tar_file:\n for tarinfo in tar_file:\n if tarinfo.isreg():\n file_path = Path(tarinfo.name)\n if file_path.suffix == \".nxml\":\n with open(TEMP_DIR / file_path.name, \"wb\") as file_obj:\n file_obj.write(tar_file.extractfile(tarinfo).read())\n print(f\"Stored XML file {file_path.name}\")\n import tarfile from io import BytesIO import requests # PMC article PMC11703268 url: str = \"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz\" print(f\"Downloading {url}...\") buf = BytesIO(requests.get(url).content) print(\"Extracting and storing the XML file containing the article text...\") with tarfile.open(fileobj=buf, mode=\"r:gz\") as tar_file: for tarinfo in tar_file: if tarinfo.isreg(): file_path = Path(tarinfo.name) if file_path.suffix == \".nxml\": with open(TEMP_DIR / file_path.name, \"wb\") as file_obj: file_obj.write(tar_file.extractfile(tarinfo).read()) print(f\"Stored XML file {file_path.name}\") Downloading https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz...\nExtracting and storing the XML file containing the article text...\nStored XML file nihpp-2024.12.26.630351v1.nxml\nIn\u00a0[8]: Copied!
import zipfile\n\n# Patent grants from December 17-23, 2024\nurl: str = (\n \"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip\"\n)\nXML_SPLITTER: str = '<?xml version=\"1.0\"'\ndoc_num: int = 0\n\nprint(f\"Downloading {url}...\")\nbuf = BytesIO(requests.get(url).content)\nprint(\"Parsing zip file, splitting into XML sections, and exporting to files...\")\nwith zipfile.ZipFile(buf) as zf:\n res = zf.testzip()\n if res:\n print(\"Error validating zip file\")\n else:\n with zf.open(zf.namelist()[0]) as xf:\n is_patent = False\n patent_buffer = BytesIO()\n for xf_line in xf:\n decoded_line = xf_line.decode(errors=\"ignore\").rstrip()\n xml_index = decoded_line.find(XML_SPLITTER)\n if xml_index != -1:\n if (\n xml_index > 0\n ): # cases like </sequence-cwu><?xml version=\"1.0\"...\n patent_buffer.write(xf_line[:xml_index])\n patent_buffer.write(b\"\\r\\n\")\n xf_line = xf_line[xml_index:]\n if patent_buffer.getbuffer().nbytes > 0 and is_patent:\n doc_num += 1\n patent_id = f\"ipg241217-{doc_num}\"\n with open(TEMP_DIR / f\"{patent_id}.xml\", \"wb\") as file_obj:\n file_obj.write(patent_buffer.getbuffer())\n is_patent = False\n patent_buffer = BytesIO()\n elif decoded_line.startswith(\"<!DOCTYPE\"):\n is_patent = True\n patent_buffer.write(xf_line)\n import zipfile # Patent grants from December 17-23, 2024 url: str = ( \"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip\" ) XML_SPLITTER: str = ' 0 ): # cases like 0 and is_patent: doc_num += 1 patent_id = f\"ipg241217-{doc_num}\" with open(TEMP_DIR / f\"{patent_id}.xml\", \"wb\") as file_obj: file_obj.write(patent_buffer.getbuffer()) is_patent = False patent_buffer = BytesIO() elif decoded_line.startswith(\" Downloading https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip...\nParsing zip file, splitting into XML sections, and exporting to files...\nIn\u00a0[9]: Copied!
print(f\"Fetched and exported {doc_num} documents.\")\n print(f\"Fetched and exported {doc_num} documents.\") Fetched and exported 4014 documents.\n
The DoclingDocument format of the converted patents has a rich hierarchical structure, inherited from the original XML document and preserved by the Docling custom backend. In this notebook, we will leverage:
SimpleDirectoryReader pattern to iterate over the exported XML files created in section Fetch the data.DoclingReader and DoclingNodeParser, to ingest the patent chunks into a Milvus vector store.HierarchicalChunker implementation, which applies a document-based hierarchical chunking, to leverage the patent structures like sections and paragraphs within sections.Refer to other possible implementations and usage patterns in the Chunking documentation and the RAG with LlamaIndex notebook.
In\u00a0[13]: Copied!from llama_index.core import SimpleDirectoryReader\nfrom llama_index.readers.docling import DoclingReader\n\nreader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\ndir_reader = SimpleDirectoryReader(\n input_dir=TEMP_DIR,\n exclude=[\"docling.db\", \"*.nxml\"],\n file_extractor={\".xml\": reader},\n filename_as_id=True,\n num_files_limit=100,\n)\n from llama_index.core import SimpleDirectoryReader from llama_index.readers.docling import DoclingReader reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) dir_reader = SimpleDirectoryReader( input_dir=TEMP_DIR, exclude=[\"docling.db\", \"*.nxml\"], file_extractor={\".xml\": reader}, filename_as_id=True, num_files_limit=100, ) In\u00a0[14]: Copied! from llama_index.node_parser.docling import DoclingNodeParser\n\nnode_parser = DoclingNodeParser()\nfrom llama_index.node_parser.docling import DoclingNodeParser node_parser = DoclingNodeParser() In\u00a0[\u00a0]: Copied!
from llama_index.core import StorageContext, VectorStoreIndex\nfrom llama_index.vector_stores.milvus import MilvusVectorStore\n\nvector_store = MilvusVectorStore(\n uri=MILVUS_URI,\n dim=embed_dim,\n overwrite=True,\n)\n\nindex = VectorStoreIndex.from_documents(\n documents=dir_reader.load_data(show_progress=True),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n show_progress=True,\n)\nfrom llama_index.core import StorageContext, VectorStoreIndex from llama_index.vector_stores.milvus import MilvusVectorStore vector_store = MilvusVectorStore( uri=MILVUS_URI, dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=dir_reader.load_data(show_progress=True), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, show_progress=True, )
Finally, add the PMC article to the vector store directly from the reader.
In\u00a0[14]: Copied!index.from_documents(\n documents=reader.load_data(TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\nindex.from_documents( documents=reader.load_data(TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) Out[14]:
<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x373a7f7d0>
The retriever can be used to identify highly relevant documents:
In\u00a0[15]: Copied!retriever = index.as_retriever(similarity_top_k=3)\nresults = retriever.retrieve(\"What patents are related to fitness devices?\")\n\nfor item in results:\n print(item)\nretriever = index.as_retriever(similarity_top_k=3) results = retriever.retrieve(\"What patents are related to fitness devices?\") for item in results: print(item)
Node ID: 5afd36c0-a739-4a88-a51c-6d0f75358db5\nText: The portable fitness monitoring device 102 may be a device such\nas, for example, a mobile phone, a personal digital assistant, a music\nfile player (e.g. and MP3 player), an intelligent article for wearing\n(e.g. a fitness monitoring garment, wrist band, or watch), a dongle\n(e.g. a small hardware device that protects software) that includes a\nfitn...\nScore: 0.772\n\nNode ID: f294b5fd-9089-43cb-8c4e-d1095a634ff1\nText: US Patent Application US 20120071306 entitled \u201cPortable\nMultipurpose Whole Body Exercise Device\u201d discloses a portable\nmultipurpose whole body exercise device which can be used for general\nfitness, Pilates-type, core strengthening, therapeutic, and\nrehabilitative exercises as well as stretching and physical therapy\nand which includes storable acc...\nScore: 0.749\n\nNode ID: 8251c7ef-1165-42e1-8c91-c99c8a711bf7\nText: Program products, methods, and systems for providing fitness\nmonitoring services of the present invention can include any software\napplication executed by one or more computing devices. A computing\ndevice can be any type of computing device having one or more\nprocessors. For example, a computing device can be a workstation,\nmobile device (e.g., ...\nScore: 0.744\n\n
With the query engine, we can run the question-answering with the RAG pattern on the set of indexed documents.
First, we can prompt the LLM directly:
In\u00a0[16]: Copied!from llama_index.core.base.llms.types import ChatMessage, MessageRole\nfrom rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console()\nquery = \"Do mosquitoes in high altitude expand viruses over large distances?\"\n\nusr_msg = ChatMessage(role=MessageRole.USER, content=query)\nresponse = GEN_MODEL.chat(messages=[usr_msg])\n\nconsole.print(Panel(query, title=\"Prompt\", border_style=\"bold red\"))\nconsole.print(\n Panel(\n response.message.content.strip(),\n title=\"Generated Content\",\n border_style=\"bold green\",\n )\n)\nfrom llama_index.core.base.llms.types import ChatMessage, MessageRole from rich.console import Console from rich.panel import Panel console = Console() query = \"Do mosquitoes in high altitude expand viruses over large distances?\" usr_msg = ChatMessage(role=MessageRole.USER, content=query) response = GEN_MODEL.chat(messages=[usr_msg]) console.print(Panel(query, title=\"Prompt\", border_style=\"bold red\")) console.print( Panel( response.message.content.strip(), title=\"Generated Content\", border_style=\"bold green\", ) )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Do mosquitoes in high altitude expand viruses over large distances? \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Mosquitoes can be found at high altitudes, but their ability to transmit viruses over long distances is not \u2502\n\u2502 primarily dependent on altitude. Mosquitoes are vectors for various diseases, such as malaria, dengue fever, \u2502\n\u2502 and Zika virus, and their transmission range is more closely related to their movement, the presence of a host, \u2502\n\u2502 and environmental conditions that support their survival and reproduction. \u2502\n\u2502 \u2502\n\u2502 At high altitudes, the environment can be less suitable for mosquitoes due to factors such as colder \u2502\n\u2502 temperatures, lower humidity, and stronger winds, which can limit their population size and distribution. \u2502\n\u2502 However, some species of mosquitoes have adapted to high-altitude environments and can still transmit diseases \u2502\n\u2502 in these areas. \u2502\n\u2502 \u2502\n\u2502 It is possible for mosquitoes to be transported by wind or human activities to higher altitudes, but this is \u2502\n\u2502 not a significant factor in their ability to transmit viruses over long distances. Instead, long-distance \u2502\n\u2502 transmission of viruses is more often associated with human travel and transportation, which can rapidly spread \u2502\n\u2502 infected mosquitoes or humans to new areas, leading to the spread of disease. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
Now, we can compare the response when the model is prompted with the indexed PMC article as supporting context:
In\u00a0[17]: Copied!from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters\n\nfilters = MetadataFilters(\n filters=[\n ExactMatchFilter(key=\"filename\", value=\"nihpp-2024.12.26.630351v1.nxml\"),\n ]\n)\n\nquery_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3)\nresult = query_engine.query(query)\n\nconsole.print(\n Panel(\n result.response.strip(),\n title=\"Generated Content with RAG\",\n border_style=\"bold green\",\n )\n)\nfrom llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters filters = MetadataFilters( filters=[ ExactMatchFilter(key=\"filename\", value=\"nihpp-2024.12.26.630351v1.nxml\"), ] ) query_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3) result = query_engine.query(query) console.print( Panel( result.response.strip(), title=\"Generated Content with RAG\", border_style=\"bold green\", ) )
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content with RAG \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female \u2502\n\u2502 mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with \u2502\n\u2502 arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with \u2502\n\u2502 flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens, \u2502\n\u2502 including three arboviruses that affect humans (dengue, West Nile, and M\u2019Poko viruses). The study provides \u2502\n\u2502 compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n"},{"location":"examples/backend_xml_rag/#conversion-of-custom-xml","title":"Conversion of custom XML\u00b6","text":""},{"location":"examples/backend_xml_rag/#overview","title":"Overview\u00b6","text":""},{"location":"examples/backend_xml_rag/#simple-conversion","title":"Simple conversion\u00b6","text":"
The XML file format defines and stores data in a format that is both human-readable and machine-readable. Because of this flexibility, Docling requires custom backend processors to interpret XML definitions and convert them into DoclingDocument objects.
Some public data collections in XML format are already supported by Docling (USTPO patents and PMC articles). In these cases, the document conversion is straightforward and the same as with any other supported format, such as PDF or HTML. The execution example in Simple Conversion is the recommended usage of Docling for a single file:
"},{"location":"examples/backend_xml_rag/#end-to-end-application","title":"End-to-end application\u00b6","text":"This section describes a step-by-step application for processing XML files from supported public collections and use them for question-answering.
"},{"location":"examples/backend_xml_rag/#setup","title":"Setup\u00b6","text":""},{"location":"examples/backend_xml_rag/#fetch-the-data","title":"Fetch the data\u00b6","text":""},{"location":"examples/backend_xml_rag/#pmc-articles","title":"PMC articles\u00b6","text":"The OA file is a manifest file of all the PMC articles, including the URL path to download the source files. In this notebook we will use as example the article Pathogens spread by high-altitude windborne mosquitoes, which is available in the archive file PMC11703268.tar.gz.
"},{"location":"examples/backend_xml_rag/#uspto-patents","title":"USPTO patents\u00b6","text":"Since each USPTO file is a concatenation of several patents, we need to split its content into valid XML pieces. The following code downloads a sample zip file, split its content in sections, and dumps each section as an XML file. For simplicity, this pipeline is shown here in a sequential manner, but it could be parallelized.
"},{"location":"examples/backend_xml_rag/#parse-chunk-and-index","title":"Parse, chunk, and index\u00b6","text":""},{"location":"examples/backend_xml_rag/#set-the-docling-reader-and-the-directory-reader","title":"Set the Docling reader and the directory reader\u00b6","text":"Note that DoclingReader uses Docling's DocumentConverter by default and therefore it will recognize the format of the XML files and leverage the PatentUsptoDocumentBackend automatically.
For demonstration purposes, we limit the scope of the analysis to the first 100 patents.
"},{"location":"examples/backend_xml_rag/#set-the-node-parser","title":"Set the node parser\u00b6","text":"Note that the HierarchicalChunker is the default chunking implementation of the DoclingNodeParser.
Batch convert multiple PDF files and export results in several formats.
What this example does
scratch/ in multiple formats (JSON, HTML, Markdown, text, doctags, YAML).Prerequisites
docling from your Python environment.Input documents
tests/data/pdf/ in the repo.input_doc_paths below to point to PDFs on your machine.Output formats (controlled by flags)
USE_V2 = True enables the current Docling document exports (recommended).USE_LEGACY = False keeps legacy Deep Search exports disabled. You can set it to True if you need legacy formats for compatibility tests.Notes
pipeline_options.generate_page_images = True to include page images in HTML.import json\nimport logging\nimport time\nfrom collections.abc import Iterable\nfrom pathlib import Path\n\nimport yaml\nfrom docling_core.types.doc import ImageRefMode\n\nfrom docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n_log = logging.getLogger(__name__)\n\n# Export toggles:\n# - USE_V2 controls modern Docling document exports.\n# - USE_LEGACY enables legacy Deep Search exports for comparison or migration.\nUSE_V2 = True\nUSE_LEGACY = False\n\n\ndef export_documents(\n conv_results: Iterable[ConversionResult],\n output_dir: Path,\n):\n output_dir.mkdir(parents=True, exist_ok=True)\n\n success_count = 0\n failure_count = 0\n partial_success_count = 0\n\n for conv_res in conv_results:\n if conv_res.status == ConversionStatus.SUCCESS:\n success_count += 1\n doc_filename = conv_res.input.file.stem\n\n if USE_V2:\n # Recommended modern Docling exports. These helpers mirror the\n # lower-level \"export_to_*\" methods used below, but handle\n # common details like image handling.\n conv_res.document.save_as_json(\n output_dir / f\"{doc_filename}.json\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n conv_res.document.save_as_html(\n output_dir / f\"{doc_filename}.html\",\n image_mode=ImageRefMode.EMBEDDED,\n )\n conv_res.document.save_as_doctags(\n output_dir / f\"{doc_filename}.doctags.txt\"\n )\n conv_res.document.save_as_markdown(\n output_dir / f\"{doc_filename}.md\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n conv_res.document.save_as_markdown(\n output_dir / f\"{doc_filename}.txt\",\n image_mode=ImageRefMode.PLACEHOLDER,\n strict_text=True,\n )\n\n # Export Docling document format to YAML:\n with (output_dir / f\"{doc_filename}.yaml\").open(\"w\") as fp:\n fp.write(yaml.safe_dump(conv_res.document.export_to_dict()))\n\n # Export Docling document format to doctags:\n with (output_dir / f\"{doc_filename}.doctags.txt\").open(\"w\") as fp:\n fp.write(conv_res.document.export_to_doctags())\n\n # Export Docling document format to markdown:\n with (output_dir / f\"{doc_filename}.md\").open(\"w\") as fp:\n fp.write(conv_res.document.export_to_markdown())\n\n # Export Docling document format to text:\n with (output_dir / f\"{doc_filename}.txt\").open(\"w\") as fp:\n fp.write(conv_res.document.export_to_markdown(strict_text=True))\n\n if USE_LEGACY:\n # Export Deep Search document JSON format:\n with (output_dir / f\"{doc_filename}.legacy.json\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(json.dumps(conv_res.legacy_document.export_to_dict()))\n\n # Export Text format:\n with (output_dir / f\"{doc_filename}.legacy.txt\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(\n conv_res.legacy_document.export_to_markdown(strict_text=True)\n )\n\n # Export Markdown format:\n with (output_dir / f\"{doc_filename}.legacy.md\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(conv_res.legacy_document.export_to_markdown())\n\n # Export Document Tags format:\n with (output_dir / f\"{doc_filename}.legacy.doctags.txt\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(conv_res.legacy_document.export_to_document_tokens())\n\n elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS:\n _log.info(\n f\"Document {conv_res.input.file} was partially converted with the following errors:\"\n )\n for item in conv_res.errors:\n _log.info(f\"\\t{item.error_message}\")\n partial_success_count += 1\n else:\n _log.info(f\"Document {conv_res.input.file} failed to convert.\")\n failure_count += 1\n\n _log.info(\n f\"Processed {success_count + partial_success_count + failure_count} docs, \"\n f\"of which {failure_count} failed \"\n f\"and {partial_success_count} were partially converted.\"\n )\n return success_count, partial_success_count, failure_count\n\n\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n # Location of sample PDFs used by this example. If your checkout does not\n # include test data, change `data_folder` or point `input_doc_paths` to\n # your own files.\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_paths = [\n data_folder / \"pdf/2206.01062.pdf\",\n data_folder / \"pdf/2203.01017v2.pdf\",\n data_folder / \"pdf/2305.03393v1.pdf\",\n data_folder / \"pdf/redp5110_sampled.pdf\",\n ]\n\n # buf = BytesIO((data_folder / \"pdf/2206.01062.pdf\").open(\"rb\").read())\n # docs = [DocumentStream(name=\"my_doc.pdf\", stream=buf)]\n # input = DocumentConversionInput.from_streams(docs)\n\n # # Turn on inline debug visualizations:\n # settings.debug.visualize_layout = True\n # settings.debug.visualize_ocr = True\n # settings.debug.visualize_tables = True\n # settings.debug.visualize_cells = True\n\n # Configure the PDF pipeline. Enabling page image generation improves HTML\n # previews (embedded images) but adds processing time.\n pipeline_options = PdfPipelineOptions()\n pipeline_options.generate_page_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options, backend=DoclingParseV4DocumentBackend\n )\n }\n )\n\n start_time = time.time()\n\n # Convert all inputs. Set `raises_on_error=False` to keep processing other\n # files even if one fails; errors are summarized after the run.\n conv_results = doc_converter.convert_all(\n input_doc_paths,\n raises_on_error=False, # to let conversion run through all and examine results at the end\n )\n # Write outputs to ./scratch and log a summary.\n _success_count, _partial_success_count, failure_count = export_documents(\n conv_results, output_dir=Path(\"scratch\")\n )\n\n end_time = time.time() - start_time\n\n _log.info(f\"Document conversion complete in {end_time:.2f} seconds.\")\n\n if failure_count > 0:\n raise RuntimeError(\n f\"The example failed converting {failure_count} on {len(input_doc_paths)}.\"\n )\n\n\nif __name__ == \"__main__\":\n main()\n import json import logging import time from collections.abc import Iterable from pathlib import Path import yaml from docling_core.types.doc import ImageRefMode from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption _log = logging.getLogger(__name__) # Export toggles: # - USE_V2 controls modern Docling document exports. # - USE_LEGACY enables legacy Deep Search exports for comparison or migration. USE_V2 = True USE_LEGACY = False def export_documents( conv_results: Iterable[ConversionResult], output_dir: Path, ): output_dir.mkdir(parents=True, exist_ok=True) success_count = 0 failure_count = 0 partial_success_count = 0 for conv_res in conv_results: if conv_res.status == ConversionStatus.SUCCESS: success_count += 1 doc_filename = conv_res.input.file.stem if USE_V2: # Recommended modern Docling exports. These helpers mirror the # lower-level \"export_to_*\" methods used below, but handle # common details like image handling. conv_res.document.save_as_json( output_dir / f\"{doc_filename}.json\", image_mode=ImageRefMode.PLACEHOLDER, ) conv_res.document.save_as_html( output_dir / f\"{doc_filename}.html\", image_mode=ImageRefMode.EMBEDDED, ) conv_res.document.save_as_doctags( output_dir / f\"{doc_filename}.doctags.txt\" ) conv_res.document.save_as_markdown( output_dir / f\"{doc_filename}.md\", image_mode=ImageRefMode.PLACEHOLDER, ) conv_res.document.save_as_markdown( output_dir / f\"{doc_filename}.txt\", image_mode=ImageRefMode.PLACEHOLDER, strict_text=True, ) # Export Docling document format to YAML: with (output_dir / f\"{doc_filename}.yaml\").open(\"w\") as fp: fp.write(yaml.safe_dump(conv_res.document.export_to_dict())) # Export Docling document format to doctags: with (output_dir / f\"{doc_filename}.doctags.txt\").open(\"w\") as fp: fp.write(conv_res.document.export_to_doctags()) # Export Docling document format to markdown: with (output_dir / f\"{doc_filename}.md\").open(\"w\") as fp: fp.write(conv_res.document.export_to_markdown()) # Export Docling document format to text: with (output_dir / f\"{doc_filename}.txt\").open(\"w\") as fp: fp.write(conv_res.document.export_to_markdown(strict_text=True)) if USE_LEGACY: # Export Deep Search document JSON format: with (output_dir / f\"{doc_filename}.legacy.json\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write(json.dumps(conv_res.legacy_document.export_to_dict())) # Export Text format: with (output_dir / f\"{doc_filename}.legacy.txt\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write( conv_res.legacy_document.export_to_markdown(strict_text=True) ) # Export Markdown format: with (output_dir / f\"{doc_filename}.legacy.md\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write(conv_res.legacy_document.export_to_markdown()) # Export Document Tags format: with (output_dir / f\"{doc_filename}.legacy.doctags.txt\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write(conv_res.legacy_document.export_to_document_tokens()) elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS: _log.info( f\"Document {conv_res.input.file} was partially converted with the following errors:\" ) for item in conv_res.errors: _log.info(f\"\\t{item.error_message}\") partial_success_count += 1 else: _log.info(f\"Document {conv_res.input.file} failed to convert.\") failure_count += 1 _log.info( f\"Processed {success_count + partial_success_count + failure_count} docs, \" f\"of which {failure_count} failed \" f\"and {partial_success_count} were partially converted.\" ) return success_count, partial_success_count, failure_count def main(): logging.basicConfig(level=logging.INFO) # Location of sample PDFs used by this example. If your checkout does not # include test data, change `data_folder` or point `input_doc_paths` to # your own files. data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_paths = [ data_folder / \"pdf/2206.01062.pdf\", data_folder / \"pdf/2203.01017v2.pdf\", data_folder / \"pdf/2305.03393v1.pdf\", data_folder / \"pdf/redp5110_sampled.pdf\", ] # buf = BytesIO((data_folder / \"pdf/2206.01062.pdf\").open(\"rb\").read()) # docs = [DocumentStream(name=\"my_doc.pdf\", stream=buf)] # input = DocumentConversionInput.from_streams(docs) # # Turn on inline debug visualizations: # settings.debug.visualize_layout = True # settings.debug.visualize_ocr = True # settings.debug.visualize_tables = True # settings.debug.visualize_cells = True # Configure the PDF pipeline. Enabling page image generation improves HTML # previews (embedded images) but adds processing time. pipeline_options = PdfPipelineOptions() pipeline_options.generate_page_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, backend=DoclingParseV4DocumentBackend ) } ) start_time = time.time() # Convert all inputs. Set `raises_on_error=False` to keep processing other # files even if one fails; errors are summarized after the run. conv_results = doc_converter.convert_all( input_doc_paths, raises_on_error=False, # to let conversion run through all and examine results at the end ) # Write outputs to ./scratch and log a summary. _success_count, _partial_success_count, failure_count = export_documents( conv_results, output_dir=Path(\"scratch\") ) end_time = time.time() - start_time _log.info(f\"Document conversion complete in {end_time:.2f} seconds.\") if failure_count > 0: raise RuntimeError( f\"The example failed converting {failure_count} on {len(input_doc_paths)}.\" ) if __name__ == \"__main__\": main()"},{"location":"examples/compare_vlm_models/","title":"VLM comparison","text":"Compare different VLM models by running the VLM pipeline and timing outputs.
What this example does
scratch/.Requirements
tabulate for pretty printing (pip install tabulate).Prerequisites
How to run
python docs/examples/compare_vlm_models.py.scratch/ with filenames including the model and framework.Notes
import json\nimport sys\nimport time\nfrom pathlib import Path\n\nfrom docling_core.types.doc import DocItemLabel, ImageRefMode\nfrom docling_core.types.doc.document import DEFAULT_EXPORT_LABELS\nfrom tabulate import tabulate\n\nfrom docling.datamodel import vlm_model_specs\nfrom docling.datamodel.accelerator_options import AcceleratorDevice\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import (\n InferenceFramework,\n InlineVlmOptions,\n ResponseFormat,\n TransformersModelType,\n TransformersPromptStyle,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\n\ndef convert(sources: list[Path], converter: DocumentConverter):\n # Note: this helper assumes a single-item `sources` list. It returns after\n # processing the first source to keep runtime/output focused.\n model_id = pipeline_options.vlm_options.repo_id.replace(\"/\", \"_\")\n framework = pipeline_options.vlm_options.inference_framework\n for source in sources:\n print(\"================================================\")\n print(\"Processing...\")\n print(f\"Source: {source}\")\n print(\"---\")\n print(f\"Model: {model_id}\")\n print(f\"Framework: {framework}\")\n print(\"================================================\")\n print(\"\")\n\n res = converter.convert(source)\n\n print(\"\")\n\n fname = f\"{res.input.file.stem}-{model_id}-{framework}\"\n\n inference_time = 0.0\n for i, page in enumerate(res.pages):\n inference_time += page.predictions.vlm_response.generation_time\n print(\"\")\n print(\n f\" ---------- Predicted page {i} in {pipeline_options.vlm_options.response_format} in {page.predictions.vlm_response.generation_time} [sec]:\"\n )\n print(page.predictions.vlm_response.text)\n print(\" ---------- \")\n\n print(\"===== Final output of the converted document =======\")\n\n # Manual export for illustration. Below, `save_as_json()` writes the same\n # JSON again; kept intentionally to show both approaches.\n with (out_path / f\"{fname}.json\").open(\"w\") as fp:\n fp.write(json.dumps(res.document.export_to_dict()))\n\n res.document.save_as_json(\n out_path / f\"{fname}.json\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n print(f\" => produced {out_path / fname}.json\")\n\n res.document.save_as_markdown(\n out_path / f\"{fname}.md\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n print(f\" => produced {out_path / fname}.md\")\n\n res.document.save_as_html(\n out_path / f\"{fname}.html\",\n image_mode=ImageRefMode.EMBEDDED,\n labels=[*DEFAULT_EXPORT_LABELS, DocItemLabel.FOOTNOTE],\n split_page_view=True,\n )\n print(f\" => produced {out_path / fname}.html\")\n\n pg_num = res.document.num_pages()\n print(\"\")\n print(\n f\"Total document prediction time: {inference_time:.2f} seconds, pages: {pg_num}\"\n )\n print(\"====================================================\")\n\n return [\n source,\n model_id,\n str(framework),\n pg_num,\n inference_time,\n ]\n\n\nif __name__ == \"__main__\":\n sources = [\n \"tests/data/pdf/2305.03393v1-pg9.pdf\",\n ]\n\n out_path = Path(\"scratch\")\n out_path.mkdir(parents=True, exist_ok=True)\n\n ## Definiton of more inline models\n llava_qwen = InlineVlmOptions(\n repo_id=\"llava-hf/llava-interleave-qwen-0.5b-hf\",\n # prompt=\"Read text in the image.\",\n prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\",\n # prompt=\"Parse the reading order of this document.\",\n response_format=ResponseFormat.MARKDOWN,\n inference_framework=InferenceFramework.TRANSFORMERS,\n transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,\n supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU],\n scale=2.0,\n temperature=0.0,\n )\n\n # Note that this is not the expected way of using the Dolphin model, but it shows the usage of a raw prompt.\n dolphin_oneshot = InlineVlmOptions(\n repo_id=\"ByteDance/Dolphin\",\n prompt=\"<s>Read text in the image. <Answer/>\",\n response_format=ResponseFormat.MARKDOWN,\n inference_framework=InferenceFramework.TRANSFORMERS,\n transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,\n transformers_prompt_style=TransformersPromptStyle.RAW,\n supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU],\n scale=2.0,\n temperature=0.0,\n )\n\n ## Use VlmPipeline\n pipeline_options = VlmPipelineOptions()\n pipeline_options.generate_page_images = True\n\n ## On GPU systems, enable flash_attention_2 with CUDA:\n # pipeline_options.accelerator_options.device = AcceleratorDevice.CUDA\n # pipeline_options.accelerator_options.cuda_use_flash_attention2 = True\n\n vlm_models = [\n ## DocTags / SmolDocling models\n vlm_model_specs.SMOLDOCLING_MLX,\n vlm_model_specs.SMOLDOCLING_TRANSFORMERS,\n ## Markdown models (using MLX framework)\n vlm_model_specs.QWEN25_VL_3B_MLX,\n vlm_model_specs.PIXTRAL_12B_MLX,\n vlm_model_specs.GEMMA3_12B_MLX,\n ## Markdown models (using Transformers framework)\n vlm_model_specs.GRANITE_VISION_TRANSFORMERS,\n vlm_model_specs.PHI4_TRANSFORMERS,\n vlm_model_specs.PIXTRAL_12B_TRANSFORMERS,\n ## More inline models\n dolphin_oneshot,\n llava_qwen,\n ]\n\n # Remove MLX models if not on Mac\n if sys.platform != \"darwin\":\n vlm_models = [\n m for m in vlm_models if m.inference_framework != InferenceFramework.MLX\n ]\n\n rows = []\n for vlm_options in vlm_models:\n pipeline_options.vlm_options = vlm_options\n\n ## Set up pipeline for PDF or image inputs\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n InputFormat.IMAGE: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n },\n )\n\n row = convert(sources=sources, converter=converter)\n rows.append(row)\n\n print(\n tabulate(\n rows, headers=[\"source\", \"model_id\", \"framework\", \"num_pages\", \"time\"]\n )\n )\n\n print(\"see if memory gets released ...\")\n time.sleep(10)\n import json import sys import time from pathlib import Path from docling_core.types.doc import DocItemLabel, ImageRefMode from docling_core.types.doc.document import DEFAULT_EXPORT_LABELS from tabulate import tabulate from docling.datamodel import vlm_model_specs from docling.datamodel.accelerator_options import AcceleratorDevice from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import ( InferenceFramework, InlineVlmOptions, ResponseFormat, TransformersModelType, TransformersPromptStyle, ) from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline def convert(sources: list[Path], converter: DocumentConverter): # Note: this helper assumes a single-item `sources` list. It returns after # processing the first source to keep runtime/output focused. model_id = pipeline_options.vlm_options.repo_id.replace(\"/\", \"_\") framework = pipeline_options.vlm_options.inference_framework for source in sources: print(\"================================================\") print(\"Processing...\") print(f\"Source: {source}\") print(\"---\") print(f\"Model: {model_id}\") print(f\"Framework: {framework}\") print(\"================================================\") print(\"\") res = converter.convert(source) print(\"\") fname = f\"{res.input.file.stem}-{model_id}-{framework}\" inference_time = 0.0 for i, page in enumerate(res.pages): inference_time += page.predictions.vlm_response.generation_time print(\"\") print( f\" ---------- Predicted page {i} in {pipeline_options.vlm_options.response_format} in {page.predictions.vlm_response.generation_time} [sec]:\" ) print(page.predictions.vlm_response.text) print(\" ---------- \") print(\"===== Final output of the converted document =======\") # Manual export for illustration. Below, `save_as_json()` writes the same # JSON again; kept intentionally to show both approaches. with (out_path / f\"{fname}.json\").open(\"w\") as fp: fp.write(json.dumps(res.document.export_to_dict())) res.document.save_as_json( out_path / f\"{fname}.json\", image_mode=ImageRefMode.PLACEHOLDER, ) print(f\" => produced {out_path / fname}.json\") res.document.save_as_markdown( out_path / f\"{fname}.md\", image_mode=ImageRefMode.PLACEHOLDER, ) print(f\" => produced {out_path / fname}.md\") res.document.save_as_html( out_path / f\"{fname}.html\", image_mode=ImageRefMode.EMBEDDED, labels=[*DEFAULT_EXPORT_LABELS, DocItemLabel.FOOTNOTE], split_page_view=True, ) print(f\" => produced {out_path / fname}.html\") pg_num = res.document.num_pages() print(\"\") print( f\"Total document prediction time: {inference_time:.2f} seconds, pages: {pg_num}\" ) print(\"====================================================\") return [ source, model_id, str(framework), pg_num, inference_time, ] if __name__ == \"__main__\": sources = [ \"tests/data/pdf/2305.03393v1-pg9.pdf\", ] out_path = Path(\"scratch\") out_path.mkdir(parents=True, exist_ok=True) ## Definiton of more inline models llava_qwen = InlineVlmOptions( repo_id=\"llava-hf/llava-interleave-qwen-0.5b-hf\", # prompt=\"Read text in the image.\", prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\", # prompt=\"Parse the reading order of this document.\", response_format=ResponseFormat.MARKDOWN, inference_framework=InferenceFramework.TRANSFORMERS, transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT, supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU], scale=2.0, temperature=0.0, ) # Note that this is not the expected way of using the Dolphin model, but it shows the usage of a raw prompt. dolphin_oneshot = InlineVlmOptions( repo_id=\"ByteDance/Dolphin\", prompt=\"Read text in the image. \", response_format=ResponseFormat.MARKDOWN, inference_framework=InferenceFramework.TRANSFORMERS, transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT, transformers_prompt_style=TransformersPromptStyle.RAW, supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU], scale=2.0, temperature=0.0, ) ## Use VlmPipeline pipeline_options = VlmPipelineOptions() pipeline_options.generate_page_images = True ## On GPU systems, enable flash_attention_2 with CUDA: # pipeline_options.accelerator_options.device = AcceleratorDevice.CUDA # pipeline_options.accelerator_options.cuda_use_flash_attention2 = True vlm_models = [ ## DocTags / SmolDocling models vlm_model_specs.SMOLDOCLING_MLX, vlm_model_specs.SMOLDOCLING_TRANSFORMERS, ## Markdown models (using MLX framework) vlm_model_specs.QWEN25_VL_3B_MLX, vlm_model_specs.PIXTRAL_12B_MLX, vlm_model_specs.GEMMA3_12B_MLX, ## Markdown models (using Transformers framework) vlm_model_specs.GRANITE_VISION_TRANSFORMERS, vlm_model_specs.PHI4_TRANSFORMERS, vlm_model_specs.PIXTRAL_12B_TRANSFORMERS, ## More inline models dolphin_oneshot, llava_qwen, ] # Remove MLX models if not on Mac if sys.platform != \"darwin\": vlm_models = [ m for m in vlm_models if m.inference_framework != InferenceFramework.MLX ] rows = [] for vlm_options in vlm_models: pipeline_options.vlm_options = vlm_options ## Set up pipeline for PDF or image inputs converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=VlmPipeline, pipeline_options=pipeline_options, ), InputFormat.IMAGE: PdfFormatOption( pipeline_cls=VlmPipeline, pipeline_options=pipeline_options, ), }, ) row = convert(sources=sources, converter=converter) rows.append(row) print( tabulate( rows, headers=[\"source\", \"model_id\", \"framework\", \"num_pages\", \"time\"] ) ) print(\"see if memory gets released ...\") time.sleep(10)"},{"location":"examples/custom_convert/","title":"Custom conversion","text":"Customize PDF conversion by toggling OCR/backends and pipeline options.
What this example does
scratch/.Prerequisites
docling from your Python environment.How to run
python docs/examples/custom_convert.py.scratch/ next to where you run the script.Choosing a configuration
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackendfrom docling.datamodel.pipeline_options import TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptionsInput document
tests/data/pdf/ in the repo.input_doc_path to a local PDF.Notes
pipeline_options.ocr_options.lang (e.g., [\"en\"], [\"es\"], [\"en\", \"de\"]).AcceleratorOptions to select CPU/GPU or threads.scratch/.import json\nimport logging\nimport time\nfrom pathlib import Path\n\nfrom docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n_log = logging.getLogger(__name__)\n\n\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n ###########################################################################\n\n # The sections below demo combinations of PdfPipelineOptions and backends.\n # Tip: Uncomment exactly one section at a time to compare outputs.\n\n # PyPdfium without EasyOCR\n # --------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = False\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = False\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(\n # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend\n # )\n # }\n # )\n\n # PyPdfium with EasyOCR\n # -----------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(\n # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend\n # )\n # }\n # )\n\n # Docling Parse without EasyOCR\n # -------------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = False\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with EasyOCR (default)\n # -------------------------------\n # Enables OCR and table structure with EasyOCR, using automatic device\n # selection via AcceleratorOptions. Adjust languages as needed.\n pipeline_options = PdfPipelineOptions()\n pipeline_options.do_ocr = True\n pipeline_options.do_table_structure = True\n pipeline_options.table_structure_options.do_cell_matching = True\n pipeline_options.ocr_options.lang = [\"es\"]\n pipeline_options.accelerator_options = AcceleratorOptions(\n num_threads=4, device=AcceleratorDevice.AUTO\n )\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n # Docling Parse with EasyOCR (CPU only)\n # -------------------------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.ocr_options.use_gpu = False # <-- set this.\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with Tesseract\n # ----------------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n # pipeline_options.ocr_options = TesseractOcrOptions()\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with Tesseract CLI\n # --------------------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n # pipeline_options.ocr_options = TesseractCliOcrOptions()\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with ocrmac (macOS only)\n # --------------------------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n # pipeline_options.ocr_options = OcrMacOptions()\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n ###########################################################################\n\n start_time = time.time()\n conv_result = doc_converter.convert(input_doc_path)\n end_time = time.time() - start_time\n\n _log.info(f\"Document converted in {end_time:.2f} seconds.\")\n\n ## Export results\n output_dir = Path(\"scratch\")\n output_dir.mkdir(parents=True, exist_ok=True)\n doc_filename = conv_result.input.file.stem\n\n # Export Docling document JSON format:\n with (output_dir / f\"{doc_filename}.json\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(json.dumps(conv_result.document.export_to_dict()))\n\n # Export Text format (plain text via Markdown export):\n with (output_dir / f\"{doc_filename}.txt\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(conv_result.document.export_to_markdown(strict_text=True))\n\n # Export Markdown format:\n with (output_dir / f\"{doc_filename}.md\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(conv_result.document.export_to_markdown())\n\n # Export Document Tags format:\n with (output_dir / f\"{doc_filename}.doctags\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(conv_result.document.export_to_doctags())\n\n\nif __name__ == \"__main__\":\n main()\n import json import logging import time from pathlib import Path from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption _log = logging.getLogger(__name__) def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" ########################################################################### # The sections below demo combinations of PdfPipelineOptions and backends. # Tip: Uncomment exactly one section at a time to compare outputs. # PyPdfium without EasyOCR # -------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = False # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = False # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption( # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend # ) # } # ) # PyPdfium with EasyOCR # ----------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption( # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend # ) # } # ) # Docling Parse without EasyOCR # ------------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = False # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with EasyOCR (default) # ------------------------------- # Enables OCR and table structure with EasyOCR, using automatic device # selection via AcceleratorOptions. Adjust languages as needed. pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True pipeline_options.ocr_options.lang = [\"es\"] pipeline_options.accelerator_options = AcceleratorOptions( num_threads=4, device=AcceleratorDevice.AUTO ) doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) # Docling Parse with EasyOCR (CPU only) # ------------------------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.ocr_options.use_gpu = False # <-- set this. # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with Tesseract # ---------------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # pipeline_options.ocr_options = TesseractOcrOptions() # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with Tesseract CLI # -------------------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # pipeline_options.ocr_options = TesseractCliOcrOptions() # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with ocrmac (macOS only) # -------------------------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # pipeline_options.ocr_options = OcrMacOptions() # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) ########################################################################### start_time = time.time() conv_result = doc_converter.convert(input_doc_path) end_time = time.time() - start_time _log.info(f\"Document converted in {end_time:.2f} seconds.\") ## Export results output_dir = Path(\"scratch\") output_dir.mkdir(parents=True, exist_ok=True) doc_filename = conv_result.input.file.stem # Export Docling document JSON format: with (output_dir / f\"{doc_filename}.json\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(json.dumps(conv_result.document.export_to_dict())) # Export Text format (plain text via Markdown export): with (output_dir / f\"{doc_filename}.txt\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(conv_result.document.export_to_markdown(strict_text=True)) # Export Markdown format: with (output_dir / f\"{doc_filename}.md\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(conv_result.document.export_to_markdown()) # Export Document Tags format: with (output_dir / f\"{doc_filename}.doctags\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(conv_result.document.export_to_doctags()) if __name__ == \"__main__\": main()"},{"location":"examples/demo_layout_vlm/","title":"Demo layout vlm","text":"In\u00a0[\u00a0]: Copied! \"\"\"Demo script for the new ThreadedLayoutVlmPipeline.\n\nThis script demonstrates the usage of the experimental ThreadedLayoutVlmPipeline pipeline\nthat combines layout model preprocessing with VLM processing in a threaded manner.\n\"\"\"\n\"\"\"Demo script for the new ThreadedLayoutVlmPipeline. This script demonstrates the usage of the experimental ThreadedLayoutVlmPipeline pipeline that combines layout model preprocessing with VLM processing in a threaded manner. \"\"\" In\u00a0[\u00a0]: Copied!
import argparse\nimport logging\nimport traceback\nfrom pathlib import Path\nimport argparse import logging import traceback from pathlib import Path In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat\nfrom docling.datamodel.vlm_model_specs import GRANITEDOCLING_TRANSFORMERS\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.experimental.datamodel.threaded_layout_vlm_pipeline_options import (\n ThreadedLayoutVlmPipelineOptions,\n)\nfrom docling.experimental.pipeline.threaded_layout_vlm_pipeline import (\n ThreadedLayoutVlmPipeline,\n)\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat from docling.datamodel.vlm_model_specs import GRANITEDOCLING_TRANSFORMERS from docling.document_converter import DocumentConverter, PdfFormatOption from docling.experimental.datamodel.threaded_layout_vlm_pipeline_options import ( ThreadedLayoutVlmPipelineOptions, ) from docling.experimental.pipeline.threaded_layout_vlm_pipeline import ( ThreadedLayoutVlmPipeline, ) In\u00a0[\u00a0]: Copied!
_log = logging.getLogger(__name__)\n_log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied!
def _parse_args():\n parser = argparse.ArgumentParser(\n description=\"Demo script for the experimental ThreadedLayoutVlmPipeline\"\n )\n parser.add_argument(\n \"--input-file\",\n type=str,\n default=\"tests/data/pdf/code_and_formula.pdf\",\n help=\"Path to a PDF file\",\n )\n parser.add_argument(\n \"--output-dir\",\n type=str,\n default=\"scratch/demo_layout_vlm/\",\n help=\"Output directory for converted files\",\n )\n return parser.parse_args()\ndef _parse_args(): parser = argparse.ArgumentParser( description=\"Demo script for the experimental ThreadedLayoutVlmPipeline\" ) parser.add_argument( \"--input-file\", type=str, default=\"tests/data/pdf/code_and_formula.pdf\", help=\"Path to a PDF file\", ) parser.add_argument( \"--output-dir\", type=str, default=\"scratch/demo_layout_vlm/\", help=\"Output directory for converted files\", ) return parser.parse_args()
Can be used to read multiple pdf files under a folder def _get_docs(input_doc_path): \"\"\"Yield DocumentStream objects from list of input document paths\"\"\" for path in input_doc_path: buf = BytesIO(path.read_bytes()) stream = DocumentStream(name=path.name, stream=buf) yield stream
In\u00a0[\u00a0]: Copied!def openai_compatible_vlm_options(\n model: str,\n prompt: str,\n format: ResponseFormat,\n hostname_and_port,\n temperature: float = 0.7,\n max_tokens: int = 4096,\n api_key: str = \"\",\n skip_special_tokens=False,\n):\n headers = {}\n if api_key:\n headers[\"Authorization\"] = f\"Bearer {api_key}\"\n\n options = ApiVlmOptions(\n url=f\"http://{hostname_and_port}/v1/chat/completions\", # LM studio defaults to port 1234, VLLM to 8000\n params=dict(\n model=model,\n max_tokens=max_tokens,\n skip_special_tokens=skip_special_tokens, # needed for VLLM\n ),\n headers=headers,\n prompt=prompt,\n timeout=90,\n scale=2.0,\n temperature=temperature,\n response_format=format,\n )\n\n return options\n def openai_compatible_vlm_options( model: str, prompt: str, format: ResponseFormat, hostname_and_port, temperature: float = 0.7, max_tokens: int = 4096, api_key: str = \"\", skip_special_tokens=False, ): headers = {} if api_key: headers[\"Authorization\"] = f\"Bearer {api_key}\" options = ApiVlmOptions( url=f\"http://{hostname_and_port}/v1/chat/completions\", # LM studio defaults to port 1234, VLLM to 8000 params=dict( model=model, max_tokens=max_tokens, skip_special_tokens=skip_special_tokens, # needed for VLLM ), headers=headers, prompt=prompt, timeout=90, scale=2.0, temperature=temperature, response_format=format, ) return options In\u00a0[\u00a0]: Copied! def demo_threaded_layout_vlm_pipeline(\n input_doc_path: Path, out_dir_layout_aware: Path, use_api_vlm: bool\n):\n \"\"\"Demonstrate the threaded layout+VLM pipeline.\"\"\"\n\n vlm_options = GRANITEDOCLING_TRANSFORMERS.model_copy()\n\n if use_api_vlm:\n vlm_options = openai_compatible_vlm_options(\n model=\"granite-docling-258m-mlx\", # For VLLM use \"ibm-granite/granite-docling-258M\"\n hostname_and_port=\"localhost:1234\", # LM studio defaults to port 1234, VLLM to 8000\n prompt=\"Convert this page to docling.\",\n format=ResponseFormat.DOCTAGS,\n api_key=\"\",\n )\n vlm_options.track_input_prompt = True\n\n # Configure pipeline options\n print(\"Configuring pipeline options...\")\n pipeline_options_layout_aware = ThreadedLayoutVlmPipelineOptions(\n # VLM configuration - defaults to GRANITEDOCLING_TRANSFORMERS\n vlm_options=vlm_options,\n # Layout configuration - defaults to DOCLING_LAYOUT_HERON\n # Batch sizes for parallel processing\n layout_batch_size=2,\n vlm_batch_size=1,\n # Queue configuration\n queue_max_size=10,\n # Image processing\n images_scale=vlm_options.scale,\n generate_page_images=True,\n enable_remote_services=use_api_vlm,\n )\n\n # Create converter with the new pipeline\n print(\"Initializing DocumentConverter (this may take a while - loading models)...\")\n doc_converter_layout_enhanced = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=ThreadedLayoutVlmPipeline,\n pipeline_options=pipeline_options_layout_aware,\n )\n }\n )\n\n result_layout_aware = doc_converter_layout_enhanced.convert(\n source=input_doc_path, raises_on_error=False\n )\n\n if result_layout_aware.status == ConversionStatus.FAILURE:\n _log.error(f\"Conversion failed: {result_layout_aware.status}\")\n\n doc_filename = result_layout_aware.input.file.stem\n result_layout_aware.document.save_as_json(\n out_dir_layout_aware / f\"{doc_filename}.json\"\n )\n\n result_layout_aware.document.save_as_html(\n out_dir_layout_aware / f\"{doc_filename}.html\", split_page_view=True\n )\n for page in result_layout_aware.pages:\n _log.info(\"Page %s of VLM response:\", page.page_no)\n if page.predictions.vlm_response:\n _log.info(page.predictions.vlm_response)\n def demo_threaded_layout_vlm_pipeline( input_doc_path: Path, out_dir_layout_aware: Path, use_api_vlm: bool ): \"\"\"Demonstrate the threaded layout+VLM pipeline.\"\"\" vlm_options = GRANITEDOCLING_TRANSFORMERS.model_copy() if use_api_vlm: vlm_options = openai_compatible_vlm_options( model=\"granite-docling-258m-mlx\", # For VLLM use \"ibm-granite/granite-docling-258M\" hostname_and_port=\"localhost:1234\", # LM studio defaults to port 1234, VLLM to 8000 prompt=\"Convert this page to docling.\", format=ResponseFormat.DOCTAGS, api_key=\"\", ) vlm_options.track_input_prompt = True # Configure pipeline options print(\"Configuring pipeline options...\") pipeline_options_layout_aware = ThreadedLayoutVlmPipelineOptions( # VLM configuration - defaults to GRANITEDOCLING_TRANSFORMERS vlm_options=vlm_options, # Layout configuration - defaults to DOCLING_LAYOUT_HERON # Batch sizes for parallel processing layout_batch_size=2, vlm_batch_size=1, # Queue configuration queue_max_size=10, # Image processing images_scale=vlm_options.scale, generate_page_images=True, enable_remote_services=use_api_vlm, ) # Create converter with the new pipeline print(\"Initializing DocumentConverter (this may take a while - loading models)...\") doc_converter_layout_enhanced = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=ThreadedLayoutVlmPipeline, pipeline_options=pipeline_options_layout_aware, ) } ) result_layout_aware = doc_converter_layout_enhanced.convert( source=input_doc_path, raises_on_error=False ) if result_layout_aware.status == ConversionStatus.FAILURE: _log.error(f\"Conversion failed: {result_layout_aware.status}\") doc_filename = result_layout_aware.input.file.stem result_layout_aware.document.save_as_json( out_dir_layout_aware / f\"{doc_filename}.json\" ) result_layout_aware.document.save_as_html( out_dir_layout_aware / f\"{doc_filename}.html\", split_page_view=True ) for page in result_layout_aware.pages: _log.info(\"Page %s of VLM response:\", page.page_no) if page.predictions.vlm_response: _log.info(page.predictions.vlm_response) In\u00a0[\u00a0]: Copied! if __name__ == \"__main__\":\n logging.basicConfig(level=logging.INFO)\n try:\n args = _parse_args()\n _log.info(\n f\"Parsed arguments: input={args.input_file}, output={args.output_dir}\"\n )\n\n input_path = Path(args.input_file)\n\n if not input_path.exists():\n raise FileNotFoundError(f\"Input file does not exist: {input_path}\")\n\n if input_path.suffix.lower() != \".pdf\":\n raise ValueError(f\"Input file must be a PDF: {input_path}\")\n\n out_dir_layout_aware = Path(args.output_dir) / \"layout_aware/\"\n out_dir_layout_aware.mkdir(parents=True, exist_ok=True)\n\n use_api_vlm = False # Set to False to use inline VLM model\n\n demo_threaded_layout_vlm_pipeline(input_path, out_dir_layout_aware, use_api_vlm)\n except Exception:\n traceback.print_exc()\n raise\n if __name__ == \"__main__\": logging.basicConfig(level=logging.INFO) try: args = _parse_args() _log.info( f\"Parsed arguments: input={args.input_file}, output={args.output_dir}\" ) input_path = Path(args.input_file) if not input_path.exists(): raise FileNotFoundError(f\"Input file does not exist: {input_path}\") if input_path.suffix.lower() != \".pdf\": raise ValueError(f\"Input file must be a PDF: {input_path}\") out_dir_layout_aware = Path(args.output_dir) / \"layout_aware/\" out_dir_layout_aware.mkdir(parents=True, exist_ok=True) use_api_vlm = False # Set to False to use inline VLM model demo_threaded_layout_vlm_pipeline(input_path, out_dir_layout_aware, use_api_vlm) except Exception: traceback.print_exc() raise"},{"location":"examples/develop_formula_understanding/","title":"Formula enrichment","text":"Developing an enrichment model example (formula understanding: scaffold only).
What this example does
Important
How to run
python docs/examples/develop_formula_understanding.py.Notes
do_formula_understanding=True to enable the example enrichment stage.StandardPdfPipeline and keeps the backend when enrichment is enabled.import logging\nfrom collections.abc import Iterable\nfrom pathlib import Path\n\nfrom docling_core.types.doc import DocItemLabel, DoclingDocument, NodeItem, TextItem\n\nfrom docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.models.base_model import BaseItemAndImageEnrichmentModel\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\n\n\nclass ExampleFormulaUnderstandingPipelineOptions(PdfPipelineOptions):\n do_formula_understanding: bool = True\n\n\n# A new enrichment model using both the document element and its image as input\nclass ExampleFormulaUnderstandingEnrichmentModel(BaseItemAndImageEnrichmentModel):\n images_scale = 2.6\n\n def __init__(self, enabled: bool):\n self.enabled = enabled\n\n def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:\n return (\n self.enabled\n and isinstance(element, TextItem)\n and element.label == DocItemLabel.FORMULA\n )\n\n def __call__(\n self,\n doc: DoclingDocument,\n element_batch: Iterable[ItemAndImageEnrichmentElement],\n ) -> Iterable[NodeItem]:\n if not self.enabled:\n return\n\n for enrich_element in element_batch:\n # Opens a window for each cropped formula image; comment this out when\n # running headless or processing many items to avoid blocking spam.\n enrich_element.image.show()\n\n yield enrich_element.item\n\n\n# How the pipeline can be extended.\nclass ExampleFormulaUnderstandingPipeline(StandardPdfPipeline):\n def __init__(self, pipeline_options: ExampleFormulaUnderstandingPipelineOptions):\n super().__init__(pipeline_options)\n self.pipeline_options: ExampleFormulaUnderstandingPipelineOptions\n\n self.enrichment_pipe = [\n ExampleFormulaUnderstandingEnrichmentModel(\n enabled=self.pipeline_options.do_formula_understanding\n )\n ]\n\n if self.pipeline_options.do_formula_understanding:\n self.keep_backend = True\n\n @classmethod\n def get_default_options(cls) -> ExampleFormulaUnderstandingPipelineOptions:\n return ExampleFormulaUnderstandingPipelineOptions()\n\n\n# Example main. In the final version, we simply have to set do_formula_understanding to true.\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2203.01017v2.pdf\"\n\n pipeline_options = ExampleFormulaUnderstandingPipelineOptions()\n pipeline_options.do_formula_understanding = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=ExampleFormulaUnderstandingPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n doc_converter.convert(input_doc_path)\n\n\nif __name__ == \"__main__\":\n main()\n import logging from collections.abc import Iterable from pathlib import Path from docling_core.types.doc import DocItemLabel, DoclingDocument, NodeItem, TextItem from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.models.base_model import BaseItemAndImageEnrichmentModel from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline class ExampleFormulaUnderstandingPipelineOptions(PdfPipelineOptions): do_formula_understanding: bool = True # A new enrichment model using both the document element and its image as input class ExampleFormulaUnderstandingEnrichmentModel(BaseItemAndImageEnrichmentModel): images_scale = 2.6 def __init__(self, enabled: bool): self.enabled = enabled def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool: return ( self.enabled and isinstance(element, TextItem) and element.label == DocItemLabel.FORMULA ) def __call__( self, doc: DoclingDocument, element_batch: Iterable[ItemAndImageEnrichmentElement], ) -> Iterable[NodeItem]: if not self.enabled: return for enrich_element in element_batch: # Opens a window for each cropped formula image; comment this out when # running headless or processing many items to avoid blocking spam. enrich_element.image.show() yield enrich_element.item # How the pipeline can be extended. class ExampleFormulaUnderstandingPipeline(StandardPdfPipeline): def __init__(self, pipeline_options: ExampleFormulaUnderstandingPipelineOptions): super().__init__(pipeline_options) self.pipeline_options: ExampleFormulaUnderstandingPipelineOptions self.enrichment_pipe = [ ExampleFormulaUnderstandingEnrichmentModel( enabled=self.pipeline_options.do_formula_understanding ) ] if self.pipeline_options.do_formula_understanding: self.keep_backend = True @classmethod def get_default_options(cls) -> ExampleFormulaUnderstandingPipelineOptions: return ExampleFormulaUnderstandingPipelineOptions() # Example main. In the final version, we simply have to set do_formula_understanding to true. def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2203.01017v2.pdf\" pipeline_options = ExampleFormulaUnderstandingPipelineOptions() pipeline_options.do_formula_understanding = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=ExampleFormulaUnderstandingPipeline, pipeline_options=pipeline_options, ) } ) doc_converter.convert(input_doc_path) if __name__ == \"__main__\": main()"},{"location":"examples/develop_picture_enrichment/","title":"Figure enrichment","text":"Developing a picture enrichment model (classifier scaffold only).
What this example does
Important
How to run
python docs/examples/develop_picture_enrichment.py.Notes
images_scale to improve crops.StandardPdfPipeline with a custom enrichment stage.import logging\nfrom collections.abc import Iterable\nfrom pathlib import Path\nfrom typing import Any\n\nfrom docling_core.types.doc import (\n DoclingDocument,\n NodeItem,\n PictureClassificationClass,\n PictureClassificationData,\n PictureItem,\n)\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.models.base_model import BaseEnrichmentModel\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\n\n\nclass ExamplePictureClassifierPipelineOptions(PdfPipelineOptions):\n do_picture_classifer: bool = True\n\n\nclass ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel):\n def __init__(self, enabled: bool):\n self.enabled = enabled\n\n def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:\n return self.enabled and isinstance(element, PictureItem)\n\n def __call__(\n self, doc: DoclingDocument, element_batch: Iterable[NodeItem]\n ) -> Iterable[Any]:\n if not self.enabled:\n return\n\n for element in element_batch:\n assert isinstance(element, PictureItem)\n\n # uncomment this to interactively visualize the image\n # element.get_image(doc).show() # may block; avoid in headless runs\n\n element.annotations.append(\n PictureClassificationData(\n provenance=\"example_classifier-0.0.1\",\n predicted_classes=[\n PictureClassificationClass(class_name=\"dummy\", confidence=0.42)\n ],\n )\n )\n\n yield element\n\n\nclass ExamplePictureClassifierPipeline(StandardPdfPipeline):\n def __init__(self, pipeline_options: ExamplePictureClassifierPipelineOptions):\n super().__init__(pipeline_options)\n self.pipeline_options: ExamplePictureClassifierPipeline\n\n self.enrichment_pipe = [\n ExamplePictureClassifierEnrichmentModel(\n enabled=pipeline_options.do_picture_classifer\n )\n ]\n\n @classmethod\n def get_default_options(cls) -> ExamplePictureClassifierPipelineOptions:\n return ExamplePictureClassifierPipelineOptions()\n\n\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n pipeline_options = ExamplePictureClassifierPipelineOptions()\n pipeline_options.images_scale = 2.0\n pipeline_options.generate_picture_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=ExamplePictureClassifierPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n result = doc_converter.convert(input_doc_path)\n\n for element, _level in result.document.iterate_items():\n if isinstance(element, PictureItem):\n print(\n f\"The model populated the `data` portion of picture {element.self_ref}:\\n{element.annotations}\"\n )\n\n\nif __name__ == \"__main__\":\n main()\n import logging from collections.abc import Iterable from pathlib import Path from typing import Any from docling_core.types.doc import ( DoclingDocument, NodeItem, PictureClassificationClass, PictureClassificationData, PictureItem, ) from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.models.base_model import BaseEnrichmentModel from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline class ExamplePictureClassifierPipelineOptions(PdfPipelineOptions): do_picture_classifer: bool = True class ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel): def __init__(self, enabled: bool): self.enabled = enabled def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool: return self.enabled and isinstance(element, PictureItem) def __call__( self, doc: DoclingDocument, element_batch: Iterable[NodeItem] ) -> Iterable[Any]: if not self.enabled: return for element in element_batch: assert isinstance(element, PictureItem) # uncomment this to interactively visualize the image # element.get_image(doc).show() # may block; avoid in headless runs element.annotations.append( PictureClassificationData( provenance=\"example_classifier-0.0.1\", predicted_classes=[ PictureClassificationClass(class_name=\"dummy\", confidence=0.42) ], ) ) yield element class ExamplePictureClassifierPipeline(StandardPdfPipeline): def __init__(self, pipeline_options: ExamplePictureClassifierPipelineOptions): super().__init__(pipeline_options) self.pipeline_options: ExamplePictureClassifierPipeline self.enrichment_pipe = [ ExamplePictureClassifierEnrichmentModel( enabled=pipeline_options.do_picture_classifer ) ] @classmethod def get_default_options(cls) -> ExamplePictureClassifierPipelineOptions: return ExamplePictureClassifierPipelineOptions() def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" pipeline_options = ExamplePictureClassifierPipelineOptions() pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=ExamplePictureClassifierPipeline, pipeline_options=pipeline_options, ) } ) result = doc_converter.convert(input_doc_path) for element, _level in result.document.iterate_items(): if isinstance(element, PictureItem): print( f\"The model populated the `data` portion of picture {element.self_ref}:\\n{element.annotations}\" ) if __name__ == \"__main__\": main()"},{"location":"examples/dpk-ingest-chunk-tokenize/","title":"Chunking & tokenization with Data Prep Kit","text":"In\u00a0[\u00a0]: Copied! %%capture\n%pip install \"data-prep-toolkit-transforms[docling2parquet,doc_chunk,tokenization]\"\n%pip install pandas\n%pip install \"numpy<2.0\"\nfrom dotenv import load_dotenv\n\nload_dotenv(\".env\", override=True)\n%%capture %pip install \"data-prep-toolkit-transforms[docling2parquet,doc_chunk,tokenization]\" %pip install pandas %pip install \"numpy<2.0\" from dotenv import load_dotenv load_dotenv(\".env\", override=True)
We will define and use a utility function for downloading the articles and saving them to the local disk:
load_corpus: Uses http request with the wikimedia api token to connect to a Wikimedia endpoint and retrieve the HTML articles that will be used as a seed for our LLM application. The article will then be saved to a local cache folder for further processing
In\u00a0[\u00a0]: Copied!def load_corpus(articles: list, folder: str) -> int:\n import os\n import re\n\n import requests\n\n headers = {\"Authorization\": f\"Bearer {os.getenv('WIKI_ACCESS_TOKEN')}\"}\n count = 0\n for article in articles:\n try:\n endpoint = f\"https://api.enterprise.wikimedia.com/v2/articles/{article}\"\n response = requests.get(endpoint, headers=headers)\n response.raise_for_status()\n doc = response.json()\n for article in doc:\n filename = re.sub(r\"[^a-zA-Z0-9_]\", \"_\", article[\"name\"])\n with open(f\"{folder}/{filename}.html\", \"w\") as f:\n f.write(article[\"article_body\"][\"html\"])\n count = count + 1\n except Exception as e:\n print(f\"Failed to retrieve content: {e}\")\n return count\n def load_corpus(articles: list, folder: str) -> int: import os import re import requests headers = {\"Authorization\": f\"Bearer {os.getenv('WIKI_ACCESS_TOKEN')}\"} count = 0 for article in articles: try: endpoint = f\"https://api.enterprise.wikimedia.com/v2/articles/{article}\" response = requests.get(endpoint, headers=headers) response.raise_for_status() doc = response.json() for article in doc: filename = re.sub(r\"[^a-zA-Z0-9_]\", \"_\", article[\"name\"]) with open(f\"{folder}/{filename}.html\", \"w\") as f: f.write(article[\"article_body\"][\"html\"]) count = count + 1 except Exception as e: print(f\"Failed to retrieve content: {e}\") return count In\u00a0[\u00a0]: Copied! import os\nimport tempfile\n\ndatafolder = tempfile.mkdtemp(dir=os.getcwd())\narticles = [\"Science,_technology,_engineering,_and_mathematics\"]\nassert load_corpus(articles, datafolder) > 0, \"Faild to download any documents\"\nimport os import tempfile datafolder = tempfile.mkdtemp(dir=os.getcwd()) articles = [\"Science,_technology,_engineering,_and_mathematics\"] assert load_corpus(articles, datafolder) > 0, \"Faild to download any documents\" In\u00a0[\u00a0]: Copied!
%%capture\nfrom dpk_docling2parquet import Docling2Parquet, docling2parquet_contents_types\n\nresult = Docling2Parquet(\n input_folder=datafolder,\n output_folder=f\"{datafolder}/docling2parquet\",\n data_files_to_use=[\".html\"],\n docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN, # markdown\n).transform()\n %%capture from dpk_docling2parquet import Docling2Parquet, docling2parquet_contents_types result = Docling2Parquet( input_folder=datafolder, output_folder=f\"{datafolder}/docling2parquet\", data_files_to_use=[\".html\"], docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN, # markdown ).transform() In\u00a0[\u00a0]: Copied! %%capture\nfrom dpk_doc_chunk import DocChunk\n\nresult = DocChunk(\n input_folder=f\"{datafolder}/docling2parquet\",\n output_folder=f\"{datafolder}/doc_chunk\",\n doc_chunk_chunking_type=\"li_markdown\",\n doc_chunk_chunk_size_tokens=128, # default 128\n doc_chunk_chunk_overlap_tokens=30, # default 30\n).transform()\n %%capture from dpk_doc_chunk import DocChunk result = DocChunk( input_folder=f\"{datafolder}/docling2parquet\", output_folder=f\"{datafolder}/doc_chunk\", doc_chunk_chunking_type=\"li_markdown\", doc_chunk_chunk_size_tokens=128, # default 128 doc_chunk_chunk_overlap_tokens=30, # default 30 ).transform() In\u00a0[\u00a0]: Copied! %%capture\nfrom dpk_tokenization import Tokenization\n\nTokenization(\n input_folder=f\"{datafolder}/doc_chunk\",\n output_folder=f\"{datafolder}/tkn\",\n tkn_tokenizer=\"hf-internal-testing/llama-tokenizer\",\n tkn_chunk_size=20_000,\n).transform()\n %%capture from dpk_tokenization import Tokenization Tokenization( input_folder=f\"{datafolder}/doc_chunk\", output_folder=f\"{datafolder}/tkn\", tkn_tokenizer=\"hf-internal-testing/llama-tokenizer\", tkn_chunk_size=20_000, ).transform() In\u00a0[\u00a0]: Copied! from pathlib import Path\n\nimport pandas as pd\n\nparquet_files = list(Path(f\"{datafolder}/tkn/\").glob(\"*.parquet\"))\npd.concat(pd.read_parquet(file) for file in parquet_files)\n from pathlib import Path import pandas as pd parquet_files = list(Path(f\"{datafolder}/tkn/\").glob(\"*.parquet\")) pd.concat(pd.read_parquet(file) for file in parquet_files) Out[\u00a0]: tokens document_id document_length token_count 0 [1, 444, 11814, 262, 3002] f1f5b56a78829ab2165b3bbeb94b1167e4c5583c437f1d... 14 5 1 [1, 835, 5298, 13, 13, 797, 278, 4688, 29871, ... 402e82a9e81cc3d2494fac36bebf8bf1a2662800e5a00c... 2100 655 2 [1, 835, 5901, 21833, 13, 13, 29899, 321, 1254... 4fb389d0f0e999c2496f137b4a7c0671e79c09cf9477e9... 2833 968 3 [1, 444, 26304, 4978, 13, 13, 14136, 1967, 666... 3709997548d84224361a6835760b5ae48a1637e78d54a0... 1496 483 4 [1, 444, 2648, 4234] 1e1a58ad5664d963bc207dc791825258c33337c2559f6a... 13 4 5 [1, 835, 8314, 13, 13, 1576, 9870, 315, 1038, ... 83a63864e5ddfdd41ef0f813fb7aa3c95e04c029c32ab3... 1340 442 6 [1, 835, 7400, 13, 13, 6028, 1114, 27871, 2987... 5e29fb4e4cf37ed4c49994620e4a00da9693bc061e82c1... 1800 548 7 [1, 835, 7551, 13, 13, 25411, 3762, 8950, 6020... 3fc34013d93391a7504e84069190479fbc85ba7e7072cb... 1784 511 8 [1, 835, 4092, 13, 13, 13393, 884, 29901, 518,... e8b28e20e3fc3da40b6b368e30f9c953f5218370ec2f7a... 774 229 9 [1, 3191, 18312, 13, 13, 1576, 365, 29965, 152... 94b54fbda274536622f70442b18126f554610e8915b235... 1076 263 10 [1, 3191, 3444, 13, 13, 1576, 1024, 310, 317, ... fef9b66567944df131851834e2fdfb42b5c668e4b08031... 238 60 11 [1, 835, 12798, 12026, 13, 13, 1254, 12665, 97... eeb74ae3490539aa07f25987b6b2666dc907b39147e810... 366 97 12 [1, 835, 7513, 13, 13, 19302, 284, 2879, 515, ... cc2ccd2e9f4d0a8224716109f7a6e7b30f33ff1f8c7adf... 1395 402 13 [1, 835, 20537, 423, 13, 13, 797, 20537, 423, ... baf13788a018da24d86b630a9032eaeee54913bbbdd0d4... 511 137 14 [1, 835, 21215, 13, 13, 1254, 12665, 17800, 52... a5b3973ab3a98d10f4ae07a004d70c6cdcfacb41fda8d7... 1949 536 15 [1, 835, 26260, 13, 13, 797, 278, 518, 4819, 2... dfa35b16704a4dd549701a7821b6aa856f2dd5e5b69daf... 1042 291 16 [1, 835, 660, 14873, 13, 13, 797, 518, 29984, ... a0809b265e4a011407d38cd06c7b3ce5932683a2f9c6af... 852 282 17 [1, 835, 25960, 13, 13, 1254, 12665, 338, 760,... 85e8f3b2af3268d49e60451d3ac87b3bd281a70cf6c4b7... 1165 285 18 [1, 835, 498, 26517, 13, 13, 797, 29871, 29906... 15c924efdbf0135de91a095237cbe831275bab67ee1371... 1612 397 19 [1, 835, 26459, 13, 13, 29911, 29641, 728, 317... b473b50753dd07f08da05bbf776c57747ab85ba79cb081... 435 145 20 [1, 835, 3303, 3900, 13, 13, 797, 278, 3303, 3... 841cefc910bd5d1920187b23554ee67e0e65563373e6de... 1212 344 21 [1, 3191, 3086, 9327, 10606, 13, 13, 14804, 25... 63924939eab38ad6636495f1c5c13760014efe42b330a6... 1592 416 22 [1, 3191, 1954, 29885, 16783, 8898, 13, 13, 24... 44288e766c343592a44f3da59ad3b57a9f26096ac13412... 1653 465 23 [1, 3191, 13151, 13, 13, 13393, 884, 29901, 51... 40a0f6e213901d92f1a158c3e2a55ad2558eb1deaa973f... 4418 1285 24 [1, 3191, 6981, 1455, 17261, 297, 317, 4330, 2... 5cc92a05d39ee56e9c65cdb00f55bc9dcbe8bc1647a442... 1289 375 25 [1, 3191, 402, 1581, 330, 2547, 297, 317, 4330... 37c88bed7898d9a7406b5b0e4b1ccfaca65a732dff0c03... 821 280 26 [1, 3191, 4124, 2042, 2877, 297, 317, 4330, 29... f144b97af462b2ab8aba5cb6d9cba0cf5f383cc710aba0... 1093 297 27 [1, 3191, 3082, 24620, 277, 20193, 512, 4812, ... 16525e2054a7bb7543308ad4e6642bf60e66dc475a0e0a... 2203 538 28 [1, 3191, 317, 4330, 29924, 13151, 3189, 284, ... ebb319391e1bda81edd5ec214887150044c15cfc04f42f... 514 149 29 [1, 3191, 2522, 449, 292, 13, 13, 797, 29871, ... 882582d1f6202a4e495f67952d3a27929177745b1f575e... 850 261 30 [1, 3191, 10317, 310, 5282, 1947, 11104, 13, 1... 311aa5c91354b6bf575682be701981ccc6569eb35fd726... 1561 416 31 [1, 3191, 24206, 13, 13, 1254, 12665, 23992, 2... abaa73aba997ea267d9b556679c5d680810ee5baa231fa... 384 139 32 [1, 3191, 18991, 362, 13, 13, 1576, 518, 29048... 00f85d6dffd914d89eb44dbb4caa3a1c6b2af47f5c4c96... 878 247 33 [1, 3191, 17163, 29879, 13, 13, 797, 3979, 298... f8d901fca6dcac6c266cf2799da814c5f5b5644c3b9476... 2321 682 34 [1, 3191, 3599, 296, 6728, 13, 13, 7504, 3278,... 8347c4988e3acde4723696fbf63a0f2c13d61e92c8fbac... 2960 841 35 [1, 3191, 28488, 322, 11104, 304, 1371, 2693, ... c3d0c80c861ffcd422f60b78d693bb953b69dfc3c3d55f... 222 81 36 [1, 835, 18444, 13, 13, 797, 18444, 29892, 676... 9c41677100393c4e5e3bc4bc36caee5561cb5c93546aaf... 1143 288 37 [1, 444, 10152, 13, 13, 6330, 7456, 29901, 518... 83f0f668bac5736d5f23f750f86ebbe173c0a56e3c51b8... 2777 833 38 [1, 444, 365, 7210, 29911, 29984, 29974, 13, 1... 24bbfff971979686cd41132b491060bdaaf357bd3bc7cf... 2579 847 39 [1, 444, 15976, 293, 1608, 13, 13, 1576, 8569,... 1b8c147d642e4d53152e1be73223ed58e0788700d82c73... 4700 1299 40 [1, 444, 2823, 884, 13, 13, 29899, 518, 29907,... ac3fb4073323718ea3e32e006ed67c298af9801c4a03dd... 1310 443 41 [1, 444, 28318, 13, 13, 29896, 29889, 518, 298... 2dad03b0e2b81c47012f94be0ab730e9c8341f0311c59e... 59373 26470 42 [1, 444, 8725, 5183, 13, 13, 29899, 4699, 1522... 07dabd1b5cfa6f8c70f97eb33c3a19189a866eae1203c7... 2648 1075 43 [1, 444, 3985, 2988, 13, 13, 29899, 8213, 4475... ef8cc66ae18d7238680d07372859c5be061d57b955cf7d... 5025 705 In\u00a0[\u00a0]: Copied! \n"},{"location":"examples/dpk-ingest-chunk-tokenize/#chunking-tokenization-with-data-prep-kit","title":"Chunking & tokenization with Data Prep Kit\u00b6","text":"
This notebook demonstrates how to build a sequence of DPK transforms for ingesting HTML documents using Docling2Parquet transforms and chunking them using Doc_Chunk transform. Both transforms are based on the Docling library.
In this example, we will use the Wikimedia API to retrieve the HTML articles that will be used as a seed for our LLM application. Once the articles are loaded to a local cache, we will construct and invoke the sequence of transforms to ingest the content and produce the embedding for the chuncked content.
"},{"location":"examples/dpk-ingest-chunk-tokenize/#why-dpk-pipelines","title":"\ud83d\udd0d Why DPK Pipelines\u00b6","text":"DPK transform pipelines are intended to simplify how any number of transforms can be executed in a sequence to ingest, annotate, filter and create embedding used for LLM post-training and RAG applications.
"},{"location":"examples/dpk-ingest-chunk-tokenize/#key-transforms-in-this-recipe","title":"\ud83e\uddf0 Key Transforms in This Recipe\u00b6","text":"We will use the following transforms from DPK:
Docling2Parquet: Ingest one or more HTML document and turn it into a parquet file.Doc_Chunk: Create chunks from one more more ducment.Tokenization: Create embedding for document chunks.1- This notebook uses Wikimedia API for retrieving the initial HTML documents and llama-tokenizer from hugging face.
2- In order to use the notebook, users must provide a .env file with a valid access tokens to be used for accessing the wikimedia endpoint ( instructions can be found here ) and a Hugging face token for loading the model ( instructions can be found here). The .env file will look something like this:
WIKI_ACCESS_TOKEN='eyxxx'\nHF_READ_ACCESS_TOKEN='hf_xxx'\n 3- Install DPK library to environment
"},{"location":"examples/dpk-ingest-chunk-tokenize/#setup-the-experiment","title":"\ud83d\udd17 Setup the experiment\u00b6","text":"DPK requires that we define a source/input folder where the transform sequence will be ingesting the document and a destination/output folder where the embedding will be stored. We will also initialize the list of articles we want to use in our application
"},{"location":"examples/dpk-ingest-chunk-tokenize/#injest","title":"\ud83d\udd17 Injest\u00b6","text":"Invoke Docling2Parquet tansform that will parse the HTML document and create a Markdown
"},{"location":"examples/dpk-ingest-chunk-tokenize/#chunk","title":"\ud83d\udd17 Chunk\u00b6","text":"Invoke DocChunk tansform to break the HTML document into chunks
"},{"location":"examples/dpk-ingest-chunk-tokenize/#tokenization","title":"\ud83d\udd17 Tokenization\u00b6","text":"Invoke Tokenization transform to create embedding of various chunks
"},{"location":"examples/dpk-ingest-chunk-tokenize/#summary","title":"\u2705 Summary\u00b6","text":"This notebook demonstrated how to run a DPK pipeline using IBM's Data Prep Kit and the Docling library. Each transform create one or more parquet files that users can explore to better understand what each stage of the pipeline produces. The see the output of the final stage, we will use Pandas to read the final parquet file and display its content
"},{"location":"examples/enrich_doclingdocument/","title":"Enrich a DoclingDocument","text":"Enrich an existing DoclingDocument JSON with a custom model (post-conversion).
What this example does
Prerequisites
How to run
python docs/examples/enrich_doclingdocument.py.input_doc_path and input_pdf_path if your data is elsewhere.Notes
BATCH_SIZE controls how many elements are passed to the model at once.prepare_element() crops context around elements based on the model's expansion.### Load modules\n\nfrom pathlib import Path\nfrom typing import Iterable, Optional\n\nfrom docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem\nfrom rich.pretty import pprint\n\nfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\nfrom docling.datamodel.accelerator_options import AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement\nfrom docling.datamodel.document import InputDocument\nfrom docling.models.base_model import BaseItemAndImageEnrichmentModel\nfrom docling.models.document_picture_classifier import (\n DocumentPictureClassifier,\n DocumentPictureClassifierOptions,\n)\nfrom docling.utils.utils import chunkify\n\n### Define batch size used for processing\n\nBATCH_SIZE = 4\n# Trade-off: larger batches improve throughput but increase memory usage.\n\n### From DocItem to the model inputs\n# The following function is responsible for taking an item and applying the required pre-processing for the model.\n# In this case we generate a cropped image from the document backend.\n\n\ndef prepare_element(\n doc: DoclingDocument,\n backend: PyPdfiumDocumentBackend,\n model: BaseItemAndImageEnrichmentModel,\n element: NodeItem,\n) -> Optional[ItemAndImageEnrichmentElement]:\n if not model.is_processable(doc=doc, element=element):\n return None\n\n assert isinstance(element, DocItem)\n element_prov = element.prov[0]\n\n bbox = element_prov.bbox\n width = bbox.r - bbox.l\n height = bbox.t - bbox.b\n\n expanded_bbox = BoundingBox(\n l=bbox.l - width * model.expansion_factor,\n t=bbox.t + height * model.expansion_factor,\n r=bbox.r + width * model.expansion_factor,\n b=bbox.b - height * model.expansion_factor,\n coord_origin=bbox.coord_origin,\n )\n\n page_ix = element_prov.page_no - 1\n page_backend = backend.load_page(page_no=page_ix)\n cropped_image = page_backend.get_page_image(\n scale=model.images_scale, cropbox=expanded_bbox\n )\n return ItemAndImageEnrichmentElement(item=element, image=cropped_image)\n\n\n### Iterate through the document\n# This block defines the `enrich_document()` which is responsible for iterating through the document\n# and batch the selected document items for running through the model.\n\n\ndef enrich_document(\n doc: DoclingDocument,\n backend: PyPdfiumDocumentBackend,\n model: BaseItemAndImageEnrichmentModel,\n) -> DoclingDocument:\n def _prepare_elements(\n doc: DoclingDocument,\n backend: PyPdfiumDocumentBackend,\n model: BaseItemAndImageEnrichmentModel,\n ) -> Iterable[NodeItem]:\n for doc_element, _level in doc.iterate_items():\n prepared_element = prepare_element(\n doc=doc, backend=backend, model=model, element=doc_element\n )\n if prepared_element is not None:\n yield prepared_element\n\n for element_batch in chunkify(\n _prepare_elements(doc, backend, model),\n BATCH_SIZE,\n ):\n for element in model(doc=doc, element_batch=element_batch): # Must exhaust!\n pass\n\n return doc\n\n\n### Open and process\n# The `main()` function which initializes the document and model objects for calling `enrich_document()`.\n\n\ndef main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_pdf_path = data_folder / \"pdf/2206.01062.pdf\"\n\n input_doc_path = data_folder / \"groundtruth/docling_v2/2206.01062.json\"\n\n doc = DoclingDocument.load_from_json(input_doc_path)\n\n in_pdf_doc = InputDocument(\n input_pdf_path,\n format=InputFormat.PDF,\n backend=PyPdfiumDocumentBackend,\n filename=input_pdf_path.name,\n )\n backend = in_pdf_doc._backend\n\n model = DocumentPictureClassifier(\n enabled=True,\n artifacts_path=None,\n options=DocumentPictureClassifierOptions(),\n accelerator_options=AcceleratorOptions(),\n )\n\n doc = enrich_document(doc=doc, backend=backend, model=model)\n\n for pic in doc.pictures[:5]:\n print(pic.self_ref)\n pprint(pic.annotations)\n\n\nif __name__ == \"__main__\":\n main()\n### Load modules from pathlib import Path from typing import Iterable, Optional from docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem from rich.pretty import pprint from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend from docling.datamodel.accelerator_options import AcceleratorOptions from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement from docling.datamodel.document import InputDocument from docling.models.base_model import BaseItemAndImageEnrichmentModel from docling.models.document_picture_classifier import ( DocumentPictureClassifier, DocumentPictureClassifierOptions, ) from docling.utils.utils import chunkify ### Define batch size used for processing BATCH_SIZE = 4 # Trade-off: larger batches improve throughput but increase memory usage. ### From DocItem to the model inputs # The following function is responsible for taking an item and applying the required pre-processing for the model. # In this case we generate a cropped image from the document backend. def prepare_element( doc: DoclingDocument, backend: PyPdfiumDocumentBackend, model: BaseItemAndImageEnrichmentModel, element: NodeItem, ) -> Optional[ItemAndImageEnrichmentElement]: if not model.is_processable(doc=doc, element=element): return None assert isinstance(element, DocItem) element_prov = element.prov[0] bbox = element_prov.bbox width = bbox.r - bbox.l height = bbox.t - bbox.b expanded_bbox = BoundingBox( l=bbox.l - width * model.expansion_factor, t=bbox.t + height * model.expansion_factor, r=bbox.r + width * model.expansion_factor, b=bbox.b - height * model.expansion_factor, coord_origin=bbox.coord_origin, ) page_ix = element_prov.page_no - 1 page_backend = backend.load_page(page_no=page_ix) cropped_image = page_backend.get_page_image( scale=model.images_scale, cropbox=expanded_bbox ) return ItemAndImageEnrichmentElement(item=element, image=cropped_image) ### Iterate through the document # This block defines the `enrich_document()` which is responsible for iterating through the document # and batch the selected document items for running through the model. def enrich_document( doc: DoclingDocument, backend: PyPdfiumDocumentBackend, model: BaseItemAndImageEnrichmentModel, ) -> DoclingDocument: def _prepare_elements( doc: DoclingDocument, backend: PyPdfiumDocumentBackend, model: BaseItemAndImageEnrichmentModel, ) -> Iterable[NodeItem]: for doc_element, _level in doc.iterate_items(): prepared_element = prepare_element( doc=doc, backend=backend, model=model, element=doc_element ) if prepared_element is not None: yield prepared_element for element_batch in chunkify( _prepare_elements(doc, backend, model), BATCH_SIZE, ): for element in model(doc=doc, element_batch=element_batch): # Must exhaust! pass return doc ### Open and process # The `main()` function which initializes the document and model objects for calling `enrich_document()`. def main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_pdf_path = data_folder / \"pdf/2206.01062.pdf\" input_doc_path = data_folder / \"groundtruth/docling_v2/2206.01062.json\" doc = DoclingDocument.load_from_json(input_doc_path) in_pdf_doc = InputDocument( input_pdf_path, format=InputFormat.PDF, backend=PyPdfiumDocumentBackend, filename=input_pdf_path.name, ) backend = in_pdf_doc._backend model = DocumentPictureClassifier( enabled=True, artifacts_path=None, options=DocumentPictureClassifierOptions(), accelerator_options=AcceleratorOptions(), ) doc = enrich_document(doc=doc, backend=backend, model=model) for pic in doc.pictures[:5]: print(pic.self_ref) pprint(pic.annotations) if __name__ == \"__main__\": main()"},{"location":"examples/enrich_simple_pipeline/","title":"Enrich simple pipeline","text":"In\u00a0[\u00a0]: Copied!
import logging\nfrom pathlib import Path\nimport logging from pathlib import Path In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import ConvertPipelineOptions\nfrom docling.document_converter import (\n DocumentConverter,\n HTMLFormatOption,\n WordFormatOption,\n)\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ConvertPipelineOptions from docling.document_converter import ( DocumentConverter, HTMLFormatOption, WordFormatOption, ) In\u00a0[\u00a0]: Copied!
_log = logging.getLogger(__name__)\n_log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied!
def main():\n input_path = Path(\"tests/data/docx/word_sample.docx\")\n\n pipeline_options = ConvertPipelineOptions()\n pipeline_options.do_picture_classification = True\n pipeline_options.do_picture_description = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.DOCX: WordFormatOption(pipeline_options=pipeline_options),\n InputFormat.HTML: HTMLFormatOption(pipeline_options=pipeline_options),\n },\n )\n\n res = doc_converter.convert(input_path)\n\n print(res.document.export_to_markdown())\n def main(): input_path = Path(\"tests/data/docx/word_sample.docx\") pipeline_options = ConvertPipelineOptions() pipeline_options.do_picture_classification = True pipeline_options.do_picture_description = True doc_converter = DocumentConverter( format_options={ InputFormat.DOCX: WordFormatOption(pipeline_options=pipeline_options), InputFormat.HTML: HTMLFormatOption(pipeline_options=pipeline_options), }, ) res = doc_converter.convert(input_path) print(res.document.export_to_markdown()) In\u00a0[\u00a0]: Copied! if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/export_figures/","title":"Figure export","text":"
Export page, figure, and table images from a PDF and save rich outputs.
What this example does
scratch/.Prerequisites
pip install pillow) if not already available via Docling's deps.docling from your Python environment.How to run
python docs/examples/export_figures.py.scratch/.Key options
IMAGE_RESOLUTION_SCALE: increase to render higher-resolution images (e.g., 2.0).PdfPipelineOptions.generate_page_images/generate_picture_images: preserve images for export.ImageRefMode: choose EMBEDDED or REFERENCED when saving Markdown/HTML.Input document
tests/data/pdf/2206.01062.pdf. Change input_doc_path as needed.import logging\nimport time\nfrom pathlib import Path\n\nfrom docling_core.types.doc import ImageRefMode, PictureItem, TableItem\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n_log = logging.getLogger(__name__)\n\nIMAGE_RESOLUTION_SCALE = 2.0\n\n\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\")\n\n # Keep page/element images so they can be exported. The `images_scale` controls\n # the rendered image resolution (scale=1 ~ 72 DPI). The `generate_*` toggles\n # decide which elements are enriched with images.\n pipeline_options = PdfPipelineOptions()\n pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n pipeline_options.generate_page_images = True\n pipeline_options.generate_picture_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n start_time = time.time()\n\n conv_res = doc_converter.convert(input_doc_path)\n\n output_dir.mkdir(parents=True, exist_ok=True)\n doc_filename = conv_res.input.file.stem\n\n # Save page images\n for page_no, page in conv_res.document.pages.items():\n page_no = page.page_no\n page_image_filename = output_dir / f\"{doc_filename}-{page_no}.png\"\n with page_image_filename.open(\"wb\") as fp:\n page.image.pil_image.save(fp, format=\"PNG\")\n\n # Save images of figures and tables\n table_counter = 0\n picture_counter = 0\n for element, _level in conv_res.document.iterate_items():\n if isinstance(element, TableItem):\n table_counter += 1\n element_image_filename = (\n output_dir / f\"{doc_filename}-table-{table_counter}.png\"\n )\n with element_image_filename.open(\"wb\") as fp:\n element.get_image(conv_res.document).save(fp, \"PNG\")\n\n if isinstance(element, PictureItem):\n picture_counter += 1\n element_image_filename = (\n output_dir / f\"{doc_filename}-picture-{picture_counter}.png\"\n )\n with element_image_filename.open(\"wb\") as fp:\n element.get_image(conv_res.document).save(fp, \"PNG\")\n\n # Save markdown with embedded pictures\n md_filename = output_dir / f\"{doc_filename}-with-images.md\"\n conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n # Save markdown with externally referenced pictures\n md_filename = output_dir / f\"{doc_filename}-with-image-refs.md\"\n conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)\n\n # Save HTML with externally referenced pictures\n html_filename = output_dir / f\"{doc_filename}-with-image-refs.html\"\n conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED)\n\n end_time = time.time() - start_time\n\n _log.info(f\"Document converted and figures exported in {end_time:.2f} seconds.\")\n\n\nif __name__ == \"__main__\":\n main()\n import logging import time from pathlib import Path from docling_core.types.doc import ImageRefMode, PictureItem, TableItem from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption _log = logging.getLogger(__name__) IMAGE_RESOLUTION_SCALE = 2.0 def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") # Keep page/element images so they can be exported. The `images_scale` controls # the rendered image resolution (scale=1 ~ 72 DPI). The `generate_*` toggles # decide which elements are enriched with images. pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) start_time = time.time() conv_res = doc_converter.convert(input_doc_path) output_dir.mkdir(parents=True, exist_ok=True) doc_filename = conv_res.input.file.stem # Save page images for page_no, page in conv_res.document.pages.items(): page_no = page.page_no page_image_filename = output_dir / f\"{doc_filename}-{page_no}.png\" with page_image_filename.open(\"wb\") as fp: page.image.pil_image.save(fp, format=\"PNG\") # Save images of figures and tables table_counter = 0 picture_counter = 0 for element, _level in conv_res.document.iterate_items(): if isinstance(element, TableItem): table_counter += 1 element_image_filename = ( output_dir / f\"{doc_filename}-table-{table_counter}.png\" ) with element_image_filename.open(\"wb\") as fp: element.get_image(conv_res.document).save(fp, \"PNG\") if isinstance(element, PictureItem): picture_counter += 1 element_image_filename = ( output_dir / f\"{doc_filename}-picture-{picture_counter}.png\" ) with element_image_filename.open(\"wb\") as fp: element.get_image(conv_res.document).save(fp, \"PNG\") # Save markdown with embedded pictures md_filename = output_dir / f\"{doc_filename}-with-images.md\" conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED) # Save markdown with externally referenced pictures md_filename = output_dir / f\"{doc_filename}-with-image-refs.md\" conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED) # Save HTML with externally referenced pictures html_filename = output_dir / f\"{doc_filename}-with-image-refs.html\" conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED) end_time = time.time() - start_time _log.info(f\"Document converted and figures exported in {end_time:.2f} seconds.\") if __name__ == \"__main__\": main()"},{"location":"examples/export_multimodal/","title":"Multimodal export","text":"Export multimodal page data (image bytes, text, segments) to a Parquet file.
What this example does
.parquet in scratch/.Prerequisites
pandas. Optional: datasets and Pillow for the commented demo.How to run
python docs/examples/export_multimodal.py.scratch/.Key options
IMAGE_RESOLUTION_SCALE: page rendering scale (1 ~ 72 DPI).PdfPipelineOptions.generate_page_images: keep page images for export.Requirements
pyarrow or fastparquet (pip install pyarrow is the most common choice).Input document
tests/data/pdf/2206.01062.pdf. Change input_doc_path as needed.Notes
import datetime\nimport logging\nimport time\nfrom pathlib import Path\n\nimport pandas as pd\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.utils.export import generate_multimodal_pages\nfrom docling.utils.utils import create_hash\n\n_log = logging.getLogger(__name__)\n\nIMAGE_RESOLUTION_SCALE = 2.0\n\n\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\")\n\n # Keep page images so they can be exported to the multimodal rows.\n # Use PdfPipelineOptions.images_scale to control the render scale (1 ~ 72 DPI).\n pipeline_options = PdfPipelineOptions()\n pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n pipeline_options.generate_page_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n start_time = time.time()\n\n conv_res = doc_converter.convert(input_doc_path)\n\n output_dir.mkdir(parents=True, exist_ok=True)\n\n rows = []\n for (\n content_text,\n content_md,\n content_dt,\n page_cells,\n page_segments,\n page,\n ) in generate_multimodal_pages(conv_res):\n dpi = page._default_image_scale * 72\n\n rows.append(\n {\n \"document\": conv_res.input.file.name,\n \"hash\": conv_res.input.document_hash,\n \"page_hash\": create_hash(\n conv_res.input.document_hash + \":\" + str(page.page_no - 1)\n ),\n \"image\": {\n \"width\": page.image.width,\n \"height\": page.image.height,\n \"bytes\": page.image.tobytes(),\n },\n \"cells\": page_cells,\n \"contents\": content_text,\n \"contents_md\": content_md,\n \"contents_dt\": content_dt,\n \"segments\": page_segments,\n \"extra\": {\n \"page_num\": page.page_no + 1,\n \"width_in_points\": page.size.width,\n \"height_in_points\": page.size.height,\n \"dpi\": dpi,\n },\n }\n )\n\n # Generate one parquet from all documents\n df_result = pd.json_normalize(rows)\n now = datetime.datetime.now()\n output_filename = output_dir / f\"multimodal_{now:%Y-%m-%d_%H%M%S}.parquet\"\n df_result.to_parquet(output_filename)\n\n end_time = time.time() - start_time\n\n _log.info(\n f\"Document converted and multimodal pages generated in {end_time:.2f} seconds.\"\n )\n\n # This block demonstrates how the file can be opened with the HF datasets library\n # from datasets import Dataset\n # from PIL import Image\n # multimodal_df = pd.read_parquet(output_filename)\n\n # # Convert pandas DataFrame to Hugging Face Dataset and load bytes into image\n # dataset = Dataset.from_pandas(multimodal_df)\n # def transforms(examples):\n # examples[\"image\"] = Image.frombytes('RGB', (examples[\"image.width\"], examples[\"image.height\"]), examples[\"image.bytes\"], 'raw')\n # return examples\n # dataset = dataset.map(transforms)\n\n\nif __name__ == \"__main__\":\n main()\n import datetime import logging import time from pathlib import Path import pandas as pd from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.utils.export import generate_multimodal_pages from docling.utils.utils import create_hash _log = logging.getLogger(__name__) IMAGE_RESOLUTION_SCALE = 2.0 def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") # Keep page images so they can be exported to the multimodal rows. # Use PdfPipelineOptions.images_scale to control the render scale (1 ~ 72 DPI). pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) start_time = time.time() conv_res = doc_converter.convert(input_doc_path) output_dir.mkdir(parents=True, exist_ok=True) rows = [] for ( content_text, content_md, content_dt, page_cells, page_segments, page, ) in generate_multimodal_pages(conv_res): dpi = page._default_image_scale * 72 rows.append( { \"document\": conv_res.input.file.name, \"hash\": conv_res.input.document_hash, \"page_hash\": create_hash( conv_res.input.document_hash + \":\" + str(page.page_no - 1) ), \"image\": { \"width\": page.image.width, \"height\": page.image.height, \"bytes\": page.image.tobytes(), }, \"cells\": page_cells, \"contents\": content_text, \"contents_md\": content_md, \"contents_dt\": content_dt, \"segments\": page_segments, \"extra\": { \"page_num\": page.page_no + 1, \"width_in_points\": page.size.width, \"height_in_points\": page.size.height, \"dpi\": dpi, }, } ) # Generate one parquet from all documents df_result = pd.json_normalize(rows) now = datetime.datetime.now() output_filename = output_dir / f\"multimodal_{now:%Y-%m-%d_%H%M%S}.parquet\" df_result.to_parquet(output_filename) end_time = time.time() - start_time _log.info( f\"Document converted and multimodal pages generated in {end_time:.2f} seconds.\" ) # This block demonstrates how the file can be opened with the HF datasets library # from datasets import Dataset # from PIL import Image # multimodal_df = pd.read_parquet(output_filename) # # Convert pandas DataFrame to Hugging Face Dataset and load bytes into image # dataset = Dataset.from_pandas(multimodal_df) # def transforms(examples): # examples[\"image\"] = Image.frombytes('RGB', (examples[\"image.width\"], examples[\"image.height\"]), examples[\"image.bytes\"], 'raw') # return examples # dataset = dataset.map(transforms) if __name__ == \"__main__\": main()"},{"location":"examples/export_tables/","title":"Table export","text":"Extract tables from a PDF and export them as CSV and HTML.
What this example does
scratch/.Prerequisites
pandas.How to run
python docs/examples/export_tables.py.scratch/.Input document
tests/data/pdf/2206.01062.pdf. Change input_doc_path as needed.Notes
table.export_to_dataframe() returns a pandas DataFrame for convenient export/processing.DataFrame.to_markdown() may require the optional tabulate package (pip install tabulate). If unavailable, skip the print or use to_csv().import logging\nimport time\nfrom pathlib import Path\n\nimport pandas as pd\n\nfrom docling.document_converter import DocumentConverter\n\n_log = logging.getLogger(__name__)\n\n\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\")\n\n doc_converter = DocumentConverter()\n\n start_time = time.time()\n\n conv_res = doc_converter.convert(input_doc_path)\n\n output_dir.mkdir(parents=True, exist_ok=True)\n\n doc_filename = conv_res.input.file.stem\n\n # Export tables\n for table_ix, table in enumerate(conv_res.document.tables):\n table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res.document)\n print(f\"## Table {table_ix}\")\n print(table_df.to_markdown())\n\n # Save the table as CSV\n element_csv_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.csv\"\n _log.info(f\"Saving CSV table to {element_csv_filename}\")\n table_df.to_csv(element_csv_filename)\n\n # Save the table as HTML\n element_html_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.html\"\n _log.info(f\"Saving HTML table to {element_html_filename}\")\n with element_html_filename.open(\"w\") as fp:\n fp.write(table.export_to_html(doc=conv_res.document))\n\n end_time = time.time() - start_time\n\n _log.info(f\"Document converted and tables exported in {end_time:.2f} seconds.\")\n\n\nif __name__ == \"__main__\":\n main()\n import logging import time from pathlib import Path import pandas as pd from docling.document_converter import DocumentConverter _log = logging.getLogger(__name__) def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") doc_converter = DocumentConverter() start_time = time.time() conv_res = doc_converter.convert(input_doc_path) output_dir.mkdir(parents=True, exist_ok=True) doc_filename = conv_res.input.file.stem # Export tables for table_ix, table in enumerate(conv_res.document.tables): table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res.document) print(f\"## Table {table_ix}\") print(table_df.to_markdown()) # Save the table as CSV element_csv_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.csv\" _log.info(f\"Saving CSV table to {element_csv_filename}\") table_df.to_csv(element_csv_filename) # Save the table as HTML element_html_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.html\" _log.info(f\"Saving HTML table to {element_html_filename}\") with element_html_filename.open(\"w\") as fp: fp.write(table.export_to_html(doc=conv_res.document)) end_time = time.time() - start_time _log.info(f\"Document converted and tables exported in {end_time:.2f} seconds.\") if __name__ == \"__main__\": main()"},{"location":"examples/extraction/","title":"Information extraction","text":"\ud83d\udc49 NOTE: The extraction API is currently in beta and may change without prior notice.
Docling provides the capability of extracting information, i.e. structured data, from unstructured documents.
The user can provide the desired data schema AKA template, either as a dictionary or as a Pydantic model, and Docling will return the extracted data as a standardized output, organized by page.
Check out the subsections below for different usage scenarios.
In\u00a0[\u00a0]: Copied!%pip install -q docling[vlm] # Install the Docling package with VLM support\n%pip install -q docling[vlm] # Install the Docling package with VLM support In\u00a0[1]: Copied!
from IPython import display\nfrom pydantic import BaseModel, Field\nfrom rich import print\nfrom IPython import display from pydantic import BaseModel, Field from rich import print
In this notebook, we will work with an example input image \u2014 let's quickly inspect it:
In\u00a0[2]: Copied!file_path = (\n \"https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg\"\n)\ndisplay.HTML(f\"<img src='{file_path}' height='1000'>\")\n file_path = ( \"https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg\" ) display.HTML(f\"\") Out[2]: Let's first define our extractor:
In\u00a0[3]: Copied!from docling.datamodel.base_models import InputFormat\nfrom docling.document_extractor import DocumentExtractor\n\nextractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])\nfrom docling.datamodel.base_models import InputFormat from docling.document_extractor import DocumentExtractor extractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])
Following, we look at different ways to define the data template.
In\u00a0[4]: Copied!result = extractor.extract(\n source=file_path,\n template='{\"bill_no\": \"string\", \"total\": \"float\"}',\n)\nprint(result.pages)\n result = extractor.extract( source=file_path, template='{\"bill_no\": \"string\", \"total\": \"float\"}', ) print(result.pages) /Users/pva/work/github.com/DS4SD/docling/docling/document_extractor.py:143: UserWarning: The extract API is currently experimental and may change without prior notice.\nOnly PDF and image formats are supported.\n return next(all_res)\nYou have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.\nThe following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n
[\n ExtractedPageData(\n page_no=1,\n extracted_data={'bill_no': '3139', 'total': 3949.75},\n raw_text='{\"bill_no\": \"3139\", \"total\": 3949.75}',\n errors=[]\n )\n]\n In\u00a0[5]: Copied! result = extractor.extract(\n source=file_path,\n template={\n \"bill_no\": \"string\",\n \"total\": \"float\",\n },\n)\nprint(result.pages)\n result = extractor.extract( source=file_path, template={ \"bill_no\": \"string\", \"total\": \"float\", }, ) print(result.pages) The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n
[\n ExtractedPageData(\n page_no=1,\n extracted_data={'bill_no': '3139', 'total': 3949.75},\n raw_text='{\"bill_no\": \"3139\", \"total\": 3949.75}',\n errors=[]\n )\n]\n First we define the Pydantic model we want to use
In\u00a0[6]: Copied!from typing import Optional\n\n\nclass Invoice(BaseModel):\n bill_no: str = Field(\n examples=[\"A123\", \"5414\"]\n ) # provide some examples, but no default value\n total: float = Field(\n default=10, examples=[20]\n ) # provide some examples and a default value\n tax_id: Optional[str] = Field(default=None, examples=[\"1234567890\"])\nfrom typing import Optional class Invoice(BaseModel): bill_no: str = Field( examples=[\"A123\", \"5414\"] ) # provide some examples, but no default value total: float = Field( default=10, examples=[20] ) # provide some examples and a default value tax_id: Optional[str] = Field(default=None, examples=[\"1234567890\"])
The class itself can then be used directly as the template:
In\u00a0[7]: Copied!result = extractor.extract(\n source=file_path,\n template=Invoice,\n)\nprint(result.pages)\nresult = extractor.extract( source=file_path, template=Invoice, ) print(result.pages)
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n
[\n ExtractedPageData(\n page_no=1,\n extracted_data={'bill_no': '3139', 'total': 3949.75, 'tax_id': None},\n raw_text='{\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": null}',\n errors=[]\n )\n]\n Alternatively, a Pydantic model instance can be passed as a template instead, allowing to override the default values.
This can be very useful in scenarios where we happen to have available context that is more relevant than the default values predefined in the model definition.
E.g. in the example below:
bill_no and total are actually set from the value extracted from the data,tax_id to be extracted, so the updated default we provided was appliedresult = extractor.extract(\n source=file_path,\n template=Invoice(\n bill_no=\"41\",\n total=100,\n tax_id=\"42\",\n ),\n)\nprint(result.pages)\nresult = extractor.extract( source=file_path, template=Invoice( bill_no=\"41\", total=100, tax_id=\"42\", ), ) print(result.pages)
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n
[\n ExtractedPageData(\n page_no=1,\n extracted_data={'bill_no': '3139', 'total': 3949.75, 'tax_id': '42'},\n raw_text='{\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": \"42\"}',\n errors=[]\n )\n]\n Besides a flat template, we can in principle use any Pydantic model, thus leveraging reuse and being able to capture hierarchies:
In\u00a0[9]: Copied!class Contact(BaseModel):\n name: Optional[str] = Field(default=None, examples=[\"Smith\"])\n address: str = Field(default=\"123 Main St\", examples=[\"456 Elm St\"])\n postal_code: str = Field(default=\"12345\", examples=[\"67890\"])\n city: str = Field(default=\"Anytown\", examples=[\"Othertown\"])\n country: Optional[str] = Field(default=None, examples=[\"Canada\"])\n\n\nclass ExtendedInvoice(BaseModel):\n bill_no: str = Field(\n examples=[\"A123\", \"5414\"]\n ) # provide some examples, but not the actual value of the test sample\n total: float = Field(\n default=10, examples=[20]\n ) # provide a default value and some examples\n garden_work_hours: int = Field(default=1, examples=[2])\n sender: Contact = Field(default=Contact(), examples=[Contact()])\n receiver: Contact = Field(default=Contact(), examples=[Contact()])\nclass Contact(BaseModel): name: Optional[str] = Field(default=None, examples=[\"Smith\"]) address: str = Field(default=\"123 Main St\", examples=[\"456 Elm St\"]) postal_code: str = Field(default=\"12345\", examples=[\"67890\"]) city: str = Field(default=\"Anytown\", examples=[\"Othertown\"]) country: Optional[str] = Field(default=None, examples=[\"Canada\"]) class ExtendedInvoice(BaseModel): bill_no: str = Field( examples=[\"A123\", \"5414\"] ) # provide some examples, but not the actual value of the test sample total: float = Field( default=10, examples=[20] ) # provide a default value and some examples garden_work_hours: int = Field(default=1, examples=[2]) sender: Contact = Field(default=Contact(), examples=[Contact()]) receiver: Contact = Field(default=Contact(), examples=[Contact()]) In\u00a0[10]: Copied!
result = extractor.extract(\n source=file_path,\n template=ExtendedInvoice,\n)\nprint(result.pages)\nresult = extractor.extract( source=file_path, template=ExtendedInvoice, ) print(result.pages)
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n
[\n ExtractedPageData(\n page_no=1,\n extracted_data={\n 'bill_no': '3139',\n 'total': 3949.75,\n 'garden_work_hours': 28,\n 'sender': {\n 'name': 'Robert Schneider',\n 'address': 'Rue du Lac 1268',\n 'postal_code': '2501',\n 'city': 'Biel',\n 'country': 'Switzerland'\n },\n 'receiver': {\n 'name': 'Pia Rutschmann',\n 'address': 'Marktgasse 28',\n 'postal_code': '9400',\n 'city': 'Rorschach',\n 'country': 'Switzerland'\n }\n },\n raw_text='{\"bill_no\": \"3139\", \"total\": 3949.75, \"garden_work_hours\": 28, \"sender\": {\"name\": \"Robert \nSchneider\", \"address\": \"Rue du Lac 1268\", \"postal_code\": \"2501\", \"city\": \"Biel\", \"country\": \"Switzerland\"}, \n\"receiver\": {\"name\": \"Pia Rutschmann\", \"address\": \"Marktgasse 28\", \"postal_code\": \"9400\", \"city\": \"Rorschach\", \n\"country\": \"Switzerland\"}}',\n errors=[]\n )\n]\n The generated response data can be easily validated and loaded via Pydantic:
In\u00a0[11]: Copied!invoice = ExtendedInvoice.model_validate(result.pages[0].extracted_data)\nprint(invoice)\ninvoice = ExtendedInvoice.model_validate(result.pages[0].extracted_data) print(invoice)
ExtendedInvoice(\n bill_no='3139',\n total=3949.75,\n garden_work_hours=28,\n sender=Contact(\n name='Robert Schneider',\n address='Rue du Lac 1268',\n postal_code='2501',\n city='Biel',\n country='Switzerland'\n ),\n receiver=Contact(\n name='Pia Rutschmann',\n address='Marktgasse 28',\n postal_code='9400',\n city='Rorschach',\n country='Switzerland'\n )\n)\n
This way, we can get from completely unstructured data to a very structured and developer-friendly representation:
In\u00a0[12]: Copied!print(\n f\"Invoice #{invoice.bill_no} was sent by {invoice.sender.name} \"\n f\"to {invoice.receiver.name} at {invoice.sender.address}.\"\n)\n print( f\"Invoice #{invoice.bill_no} was sent by {invoice.sender.name} \" f\"to {invoice.receiver.name} at {invoice.sender.address}.\" ) Invoice #3139 was sent by Robert Schneider to Pia Rutschmann at Rue du Lac 1268.\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/extraction/#information-extraction","title":"Information extraction\u00b6","text":""},{"location":"examples/extraction/#defining-the-extractor","title":"Defining the extractor\u00b6","text":""},{"location":"examples/extraction/#using-a-string-template","title":"Using a string template\u00b6","text":""},{"location":"examples/extraction/#using-a-dict-template","title":"Using a dict template\u00b6","text":""},{"location":"examples/extraction/#using-a-pydantic-model-template","title":"Using a Pydantic model template\u00b6","text":""},{"location":"examples/extraction/#advanced-pydantic-model","title":"Advanced Pydantic model\u00b6","text":""},{"location":"examples/extraction/#validating-and-loading-the-extracted-data","title":"Validating and loading the extracted data\u00b6","text":""},{"location":"examples/full_page_ocr/","title":"Force full page OCR","text":"
Force full-page OCR on a PDF using different OCR backends.
What this example does
ocr_options.Prerequisites
How to run
python docs/examples/full_page_ocr.py.Choosing an OCR backend
ocr_options = ... line below. Exactly one should be active.force_full_page_ocr=True processes each page purely via OCR (often slower than hybrid detection). Use when layout extraction is unreliable or the PDF contains scanned pages.EasyOcrOptions, TesseractOcrOptions, OcrMacOptions, RapidOcrOptions.Input document
tests/data/pdf/2206.01062.pdf. Change input_doc_path as needed.from pathlib import Path\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n TesseractCliOcrOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n\ndef main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n pipeline_options = PdfPipelineOptions()\n pipeline_options.do_ocr = True\n pipeline_options.do_table_structure = True\n pipeline_options.table_structure_options.do_cell_matching = True\n\n # Any of the OCR options can be used: EasyOcrOptions, TesseractOcrOptions,\n # TesseractCliOcrOptions, OcrMacOptions (macOS only), RapidOcrOptions\n # ocr_options = EasyOcrOptions(force_full_page_ocr=True)\n # ocr_options = TesseractOcrOptions(force_full_page_ocr=True)\n # ocr_options = OcrMacOptions(force_full_page_ocr=True)\n # ocr_options = RapidOcrOptions(force_full_page_ocr=True)\n ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)\n pipeline_options.ocr_options = ocr_options\n\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n\n doc = converter.convert(input_doc_path).document\n md = doc.export_to_markdown()\n print(md)\n\n\nif __name__ == \"__main__\":\n main()\n from pathlib import Path from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, TesseractCliOcrOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption def main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True # Any of the OCR options can be used: EasyOcrOptions, TesseractOcrOptions, # TesseractCliOcrOptions, OcrMacOptions (macOS only), RapidOcrOptions # ocr_options = EasyOcrOptions(force_full_page_ocr=True) # ocr_options = TesseractOcrOptions(force_full_page_ocr=True) # ocr_options = OcrMacOptions(force_full_page_ocr=True) # ocr_options = RapidOcrOptions(force_full_page_ocr=True) ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True) pipeline_options.ocr_options = ocr_options converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(input_doc_path).document md = doc.export_to_markdown() print(md) if __name__ == \"__main__\": main()"},{"location":"examples/gpu_standard_pipeline/","title":"Standard pipeline","text":"What this example does
Requirements
pip install doclingHow to run
python docs/examples/gpu_standard_pipeline.pyThis example is part of a set of GPU optimization strategies. Read more about it in GPU support
In\u00a0[\u00a0]: Copied!import datetime\nimport logging\nimport time\nfrom pathlib import Path\n\nimport numpy as np\nfrom pydantic import TypeAdapter\n\nfrom docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.pipeline_options import (\n ThreadedPdfPipelineOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.threaded_standard_pdf_pipeline import ThreadedStandardPdfPipeline\nfrom docling.utils.profiling import ProfilingItem\n\n_log = logging.getLogger(__name__)\n\n\ndef main():\n logging.getLogger(\"docling\").setLevel(logging.WARNING)\n _log.setLevel(logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n # input_doc_path = data_folder / \"pdf\" / \"2305.03393v1.pdf\" # 14 pages\n input_doc_path = data_folder / \"pdf\" / \"redp5110_sampled.pdf\" # 18 pages\n\n pipeline_options = ThreadedPdfPipelineOptions(\n accelerator_options=AcceleratorOptions(\n device=AcceleratorDevice.CUDA,\n ),\n ocr_batch_size=4,\n layout_batch_size=64,\n table_batch_size=4,\n )\n pipeline_options.do_ocr = False\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=ThreadedStandardPdfPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n\n start_time = time.time()\n doc_converter.initialize_pipeline(InputFormat.PDF)\n init_runtime = time.time() - start_time\n _log.info(f\"Pipeline initialized in {init_runtime:.2f} seconds.\")\n\n start_time = time.time()\n conv_result = doc_converter.convert(input_doc_path)\n pipeline_runtime = time.time() - start_time\n assert conv_result.status == ConversionStatus.SUCCESS\n\n num_pages = len(conv_result.pages)\n _log.info(f\"Document converted in {pipeline_runtime:.2f} seconds.\")\n _log.info(f\" {num_pages / pipeline_runtime:.2f} pages/second.\")\n\n\nif __name__ == \"__main__\":\n main()\n import datetime import logging import time from pathlib import Path import numpy as np from pydantic import TypeAdapter from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.pipeline_options import ( ThreadedPdfPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.threaded_standard_pdf_pipeline import ThreadedStandardPdfPipeline from docling.utils.profiling import ProfilingItem _log = logging.getLogger(__name__) def main(): logging.getLogger(\"docling\").setLevel(logging.WARNING) _log.setLevel(logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" # input_doc_path = data_folder / \"pdf\" / \"2305.03393v1.pdf\" # 14 pages input_doc_path = data_folder / \"pdf\" / \"redp5110_sampled.pdf\" # 18 pages pipeline_options = ThreadedPdfPipelineOptions( accelerator_options=AcceleratorOptions( device=AcceleratorDevice.CUDA, ), ocr_batch_size=4, layout_batch_size=64, table_batch_size=4, ) pipeline_options.do_ocr = False doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=ThreadedStandardPdfPipeline, pipeline_options=pipeline_options, ) } ) start_time = time.time() doc_converter.initialize_pipeline(InputFormat.PDF) init_runtime = time.time() - start_time _log.info(f\"Pipeline initialized in {init_runtime:.2f} seconds.\") start_time = time.time() conv_result = doc_converter.convert(input_doc_path) pipeline_runtime = time.time() - start_time assert conv_result.status == ConversionStatus.SUCCESS num_pages = len(conv_result.pages) _log.info(f\"Document converted in {pipeline_runtime:.2f} seconds.\") _log.info(f\" {num_pages / pipeline_runtime:.2f} pages/second.\") if __name__ == \"__main__\": main()"},{"location":"examples/gpu_standard_pipeline/#example-code","title":"Example code\u00b6","text":""},{"location":"examples/gpu_vlm_pipeline/","title":"VLM pipeline","text":"What this example does
Requirements
pip install doclingpip install vllmHow to run
python docs/examples/gpu_vlm_pipeline.pyThis example is part of a set of GPU optimization strategies. Read more about it in GPU support
In\u00a0[\u00a0]: Copied!import datetime\nimport logging\nimport time\nfrom pathlib import Path\n\nimport numpy as np\nfrom pydantic import TypeAdapter\n\nfrom docling.datamodel import vlm_model_specs\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat\nfrom docling.datamodel.settings import settings\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\nfrom docling.utils.profiling import ProfilingItem\n\n_log = logging.getLogger(__name__)\n\n\ndef main():\n logging.getLogger(\"docling\").setLevel(logging.WARNING)\n _log.setLevel(logging.INFO)\n\n BATCH_SIZE = 64\n\n settings.perf.page_batch_size = BATCH_SIZE\n settings.debug.profile_pipeline_timings = True\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n # input_doc_path = data_folder / \"pdf\" / \"2305.03393v1.pdf\" # 14 pages\n input_doc_path = data_folder / \"pdf\" / \"redp5110_sampled.pdf\" # 18 pages\n\n vlm_options = ApiVlmOptions(\n url=\"http://localhost:8000/v1/chat/completions\", # LM studio defaults to port 1234, VLLM to 8000\n params=dict(\n model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id,\n max_tokens=4096,\n skip_special_tokens=True,\n ),\n prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt,\n timeout=90,\n scale=2.0,\n temperature=0.0,\n concurrency=BATCH_SIZE,\n stop_strings=[\"</doctag>\", \"<|end_of_text|>\"],\n response_format=ResponseFormat.DOCTAGS,\n )\n\n pipeline_options = VlmPipelineOptions(\n vlm_options=vlm_options,\n enable_remote_services=True, # required when using a remote inference service.\n )\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n }\n )\n\n start_time = time.time()\n doc_converter.initialize_pipeline(InputFormat.PDF)\n end_time = time.time() - start_time\n _log.info(f\"Pipeline initialized in {end_time:.2f} seconds.\")\n\n now = datetime.datetime.now()\n conv_result = doc_converter.convert(input_doc_path)\n assert conv_result.status == ConversionStatus.SUCCESS\n\n num_pages = len(conv_result.pages)\n pipeline_runtime = conv_result.timings[\"pipeline_total\"].times[0]\n _log.info(f\"Document converted in {pipeline_runtime:.2f} seconds.\")\n _log.info(f\" [efficiency]: {num_pages / pipeline_runtime:.2f} pages/second.\")\n for stage in (\"page_init\", \"vlm\"):\n values = np.array(conv_result.timings[stage].times)\n _log.info(\n f\" [{stage}]: {np.min(values):.2f} / {np.median(values):.2f} / {np.max(values):.2f} seconds/page\"\n )\n\n TimingsT = TypeAdapter(dict[str, ProfilingItem])\n timings_file = Path(f\"result-timings-gpu-vlm-{now:%Y-%m-%d_%H-%M-%S}.json\")\n with timings_file.open(\"wb\") as fp:\n r = TimingsT.dump_json(conv_result.timings, indent=2)\n fp.write(r)\n _log.info(f\"Profile details in {timings_file}.\")\n\n\nif __name__ == \"__main__\":\n main()\n import datetime import logging import time from pathlib import Path import numpy as np from pydantic import TypeAdapter from docling.datamodel import vlm_model_specs from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.pipeline_options import ( VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat from docling.datamodel.settings import settings from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline from docling.utils.profiling import ProfilingItem _log = logging.getLogger(__name__) def main(): logging.getLogger(\"docling\").setLevel(logging.WARNING) _log.setLevel(logging.INFO) BATCH_SIZE = 64 settings.perf.page_batch_size = BATCH_SIZE settings.debug.profile_pipeline_timings = True data_folder = Path(__file__).parent / \"../../tests/data\" # input_doc_path = data_folder / \"pdf\" / \"2305.03393v1.pdf\" # 14 pages input_doc_path = data_folder / \"pdf\" / \"redp5110_sampled.pdf\" # 18 pages vlm_options = ApiVlmOptions( url=\"http://localhost:8000/v1/chat/completions\", # LM studio defaults to port 1234, VLLM to 8000 params=dict( model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id, max_tokens=4096, skip_special_tokens=True, ), prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt, timeout=90, scale=2.0, temperature=0.0, concurrency=BATCH_SIZE, stop_strings=[\"\", \"<|end_of_text|>\"], response_format=ResponseFormat.DOCTAGS, ) pipeline_options = VlmPipelineOptions( vlm_options=vlm_options, enable_remote_services=True, # required when using a remote inference service. ) doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=VlmPipeline, pipeline_options=pipeline_options, ), } ) start_time = time.time() doc_converter.initialize_pipeline(InputFormat.PDF) end_time = time.time() - start_time _log.info(f\"Pipeline initialized in {end_time:.2f} seconds.\") now = datetime.datetime.now() conv_result = doc_converter.convert(input_doc_path) assert conv_result.status == ConversionStatus.SUCCESS num_pages = len(conv_result.pages) pipeline_runtime = conv_result.timings[\"pipeline_total\"].times[0] _log.info(f\"Document converted in {pipeline_runtime:.2f} seconds.\") _log.info(f\" [efficiency]: {num_pages / pipeline_runtime:.2f} pages/second.\") for stage in (\"page_init\", \"vlm\"): values = np.array(conv_result.timings[stage].times) _log.info( f\" [{stage}]: {np.min(values):.2f} / {np.median(values):.2f} / {np.max(values):.2f} seconds/page\" ) TimingsT = TypeAdapter(dict[str, ProfilingItem]) timings_file = Path(f\"result-timings-gpu-vlm-{now:%Y-%m-%d_%H-%M-%S}.json\") with timings_file.open(\"wb\") as fp: r = TimingsT.dump_json(conv_result.timings, indent=2) fp.write(r) _log.info(f\"Profile details in {timings_file}.\") if __name__ == \"__main__\": main()"},{"location":"examples/gpu_vlm_pipeline/#start-models-with-vllm","title":"Start models with vllm\u00b6","text":"vllm serve ibm-granite/granite-docling-258M \\\n --host 127.0.0.1 --port 8000 \\\n --max-num-seqs 512 \\\n --max-num-batched-tokens 8192 \\\n --enable-chunked-prefill \\\n --gpu-memory-utilization 0.9\n"},{"location":"examples/gpu_vlm_pipeline/#example-code","title":"Example code\u00b6","text":""},{"location":"examples/granitedocling_repetition_stopping/","title":"Granitedocling repetition stopping","text":"
Experimental VLM pipeline with custom repetition stopping criteria.
This script demonstrates the use of custom stopping criteria that detect repetitive location coordinate patterns in generated text and stop generation when such patterns are found.
What this example does
import logging\n\nfrom docling.datamodel import vlm_model_specs\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import VlmPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.models.utils.generation_utils import (\n DocTagsRepetitionStopper,\n)\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\nlogging.basicConfig(level=logging.INFO, format=\"%(levelname)s:%(name)s:%(message)s\")\n\n\n# Set up logging to see when repetition stopping is triggered\nlogging.basicConfig(level=logging.INFO)\n\n# Replace with a local path if preferred.\n# source = \"https://ibm.biz/docling-page-with-table\" # Example that shows no repetitions.\nsource = \"tests/data_scanned/old_newspaper.png\" # Example that creates repetitions.\nprint(f\"Processing document: {source}\")\n\n###### USING GRANITEDOCLING WITH CUSTOM REPETITION STOPPING\n\n## Using standard Huggingface Transformers (most portable, slowest)\ncustom_vlm_options = vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.model_copy()\n\n# Uncomment this to use MLX-accelerated version on Apple Silicon\n# custom_vlm_options = vlm_model_specs.GRANITEDOCLING_MLX.model_copy() # use this for Apple Silicon\n\n\n# Create custom VLM options with repetition stopping criteria\ncustom_vlm_options.custom_stopping_criteria = [\n DocTagsRepetitionStopper(N=32)\n] # check for repetitions for every 32 new tokens decoded.\n\npipeline_options = VlmPipelineOptions(\n vlm_options=custom_vlm_options,\n)\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.IMAGE: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n }\n)\n\ndoc = converter.convert(source=source).document\n\nprint(doc.export_to_markdown())\n\n## Using a remote VLM inference service (for example VLLM) - uncomment to use\n\n# custom_vlm_options = ApiVlmOptions(\n# url=\"http://localhost:8000/v1/chat/completions\", # LM studio defaults to port 1234, VLLM to 8000\n# params=dict(\n# model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id,\n# max_tokens=8192,\n# skip_special_tokens=True, # needed for VLLM\n# ),\n# headers={\n# \"Authorization\": \"Bearer YOUR_API_KEY\",\n# },\n# prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt,\n# timeout=90,\n# scale=2.0,\n# temperature=0.0,\n# response_format=ResponseFormat.DOCTAGS,\n# custom_stopping_criteria=[\n# DocTagsRepetitionStopper(N=1)\n# ], # check for repetitions for every new chunk of the response stream\n# )\n\n\n# pipeline_options = VlmPipelineOptions(\n# vlm_options=custom_vlm_options,\n# enable_remote_services=True, # required when using a remote inference service.\n# )\n\n# converter = DocumentConverter(\n# format_options={\n# InputFormat.IMAGE: PdfFormatOption(\n# pipeline_cls=VlmPipeline,\n# pipeline_options=pipeline_options,\n# ),\n# }\n# )\n\n# doc = converter.convert(source=source).document\n\n# print(doc.export_to_markdown())\n import logging from docling.datamodel import vlm_model_specs from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import VlmPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.models.utils.generation_utils import ( DocTagsRepetitionStopper, ) from docling.pipeline.vlm_pipeline import VlmPipeline logging.basicConfig(level=logging.INFO, format=\"%(levelname)s:%(name)s:%(message)s\") # Set up logging to see when repetition stopping is triggered logging.basicConfig(level=logging.INFO) # Replace with a local path if preferred. # source = \"https://ibm.biz/docling-page-with-table\" # Example that shows no repetitions. source = \"tests/data_scanned/old_newspaper.png\" # Example that creates repetitions. print(f\"Processing document: {source}\") ###### USING GRANITEDOCLING WITH CUSTOM REPETITION STOPPING ## Using standard Huggingface Transformers (most portable, slowest) custom_vlm_options = vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.model_copy() # Uncomment this to use MLX-accelerated version on Apple Silicon # custom_vlm_options = vlm_model_specs.GRANITEDOCLING_MLX.model_copy() # use this for Apple Silicon # Create custom VLM options with repetition stopping criteria custom_vlm_options.custom_stopping_criteria = [ DocTagsRepetitionStopper(N=32) ] # check for repetitions for every 32 new tokens decoded. pipeline_options = VlmPipelineOptions( vlm_options=custom_vlm_options, ) converter = DocumentConverter( format_options={ InputFormat.IMAGE: PdfFormatOption( pipeline_cls=VlmPipeline, pipeline_options=pipeline_options, ), } ) doc = converter.convert(source=source).document print(doc.export_to_markdown()) ## Using a remote VLM inference service (for example VLLM) - uncomment to use # custom_vlm_options = ApiVlmOptions( # url=\"http://localhost:8000/v1/chat/completions\", # LM studio defaults to port 1234, VLLM to 8000 # params=dict( # model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id, # max_tokens=8192, # skip_special_tokens=True, # needed for VLLM # ), # headers={ # \"Authorization\": \"Bearer YOUR_API_KEY\", # }, # prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt, # timeout=90, # scale=2.0, # temperature=0.0, # response_format=ResponseFormat.DOCTAGS, # custom_stopping_criteria=[ # DocTagsRepetitionStopper(N=1) # ], # check for repetitions for every new chunk of the response stream # ) # pipeline_options = VlmPipelineOptions( # vlm_options=custom_vlm_options, # enable_remote_services=True, # required when using a remote inference service. # ) # converter = DocumentConverter( # format_options={ # InputFormat.IMAGE: PdfFormatOption( # pipeline_cls=VlmPipeline, # pipeline_options=pipeline_options, # ), # } # ) # doc = converter.convert(source=source).document # print(doc.export_to_markdown())"},{"location":"examples/hybrid_chunking/","title":"Hybrid chunking","text":"Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking.
For more details, see here.
In\u00a0[1]: Copied!%pip install -qU pip docling transformers\n%pip install -qU pip docling transformers
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
DOC_SOURCE = \"../../tests/data/md/wiki.md\"\nDOC_SOURCE = \"../../tests/data/md/wiki.md\"
We first convert the document:
In\u00a0[3]: Copied!from docling.document_converter import DocumentConverter\n\ndoc = DocumentConverter().convert(source=DOC_SOURCE).document\nfrom docling.document_converter import DocumentConverter doc = DocumentConverter().convert(source=DOC_SOURCE).document
For a basic chunking scenario, we can just instantiate a HybridChunker, which will use the default parameters.
from docling.chunking import HybridChunker\n\nchunker = HybridChunker()\nchunk_iter = chunker.chunk(dl_doc=doc)\nfrom docling.chunking import HybridChunker chunker = HybridChunker() chunk_iter = chunker.chunk(dl_doc=doc)
Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors\n
\ud83d\udc49 NOTE: As you see above, using the HybridChunker can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here.
Note that the text you would typically want to embed is the context-enriched one as returned by the contextualize() method:
for i, chunk in enumerate(chunk_iter):\n print(f\"=== {i} ===\")\n print(f\"chunk.text:\\n{f'{chunk.text[:300]}\u2026'!r}\")\n\n enriched_text = chunker.contextualize(chunk=chunk)\n print(f\"chunker.contextualize(chunk):\\n{f'{enriched_text[:300]}\u2026'!r}\")\n\n print()\n for i, chunk in enumerate(chunk_iter): print(f\"=== {i} ===\") print(f\"chunk.text:\\n{f'{chunk.text[:300]}\u2026'!r}\") enriched_text = chunker.contextualize(chunk=chunk) print(f\"chunker.contextualize(chunk):\\n{f'{enriched_text[:300]}\u2026'!r}\") print() === 0 ===\nchunk.text:\n'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Aver\u2026'\nchunker.contextualize(chunk):\n'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial \u2026'\n\n=== 1 ===\nchunk.text:\n'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willa\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1910s\u20131950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889\u2026'\n\n=== 2 ===\nchunk.text:\n'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson,\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1910s\u20131950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John \u2026'\n\n=== 3 ===\nchunk.text:\n'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1960s\u20131980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.\u2026'\n\nIn\u00a0[6]: Copied!
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\nfrom transformers import AutoTokenizer\n\nfrom docling.chunking import HybridChunker\n\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nMAX_TOKENS = 64 # set to a small number for illustrative purposes\n\ntokenizer = HuggingFaceTokenizer(\n tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case\n)\nfrom docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer from transformers import AutoTokenizer from docling.chunking import HybridChunker EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" MAX_TOKENS = 64 # set to a small number for illustrative purposes tokenizer = HuggingFaceTokenizer( tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID), max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case )
\ud83d\udc49 Alternatively, OpenAI tokenizers can be used as shown in the example below (uncomment to use \u2014 requires installing docling-core[chunking-openai]):
# import tiktoken\n\n# from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer\n\n# tokenizer = OpenAITokenizer(\n# tokenizer=tiktoken.encoding_for_model(\"gpt-4o\"),\n# max_tokens=128 * 1024, # context window length required for OpenAI tokenizers\n# )\n# import tiktoken # from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer # tokenizer = OpenAITokenizer( # tokenizer=tiktoken.encoding_for_model(\"gpt-4o\"), # max_tokens=128 * 1024, # context window length required for OpenAI tokenizers # )
We can now instantiate our chunker:
In\u00a0[8]: Copied!chunker = HybridChunker(\n tokenizer=tokenizer,\n merge_peers=True, # optional, defaults to True\n)\nchunk_iter = chunker.chunk(dl_doc=doc)\nchunks = list(chunk_iter)\nchunker = HybridChunker( tokenizer=tokenizer, merge_peers=True, # optional, defaults to True ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter)
Points to notice looking at the output chunks below:
for i, chunk in enumerate(chunks):\n print(f\"=== {i} ===\")\n txt_tokens = tokenizer.count_tokens(chunk.text)\n print(f\"chunk.text ({txt_tokens} tokens):\\n{chunk.text!r}\")\n\n ser_txt = chunker.contextualize(chunk=chunk)\n ser_tokens = tokenizer.count_tokens(ser_txt)\n print(f\"chunker.contextualize(chunk) ({ser_tokens} tokens):\\n{ser_txt!r}\")\n\n print()\n for i, chunk in enumerate(chunks): print(f\"=== {i} ===\") txt_tokens = tokenizer.count_tokens(chunk.text) print(f\"chunk.text ({txt_tokens} tokens):\\n{chunk.text!r}\") ser_txt = chunker.contextualize(chunk=chunk) ser_tokens = tokenizer.count_tokens(ser_txt) print(f\"chunker.contextualize(chunk) ({ser_tokens} tokens):\\n{ser_txt!r}\") print() === 0 ===\nchunk.text (55 tokens):\n'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\nchunker.contextualize(chunk) (56 tokens):\n'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\n\n=== 1 ===\nchunk.text (45 tokens):\n'IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\nchunker.contextualize(chunk) (46 tokens):\n'IBM\\nIBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\n\n=== 2 ===\nchunk.text (63 tokens):\n'IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nIBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\n\n=== 3 ===\nchunk.text (44 tokens):\n\"IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\nchunker.contextualize(chunk) (45 tokens):\n\"IBM\\nIBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n\n=== 4 ===\nchunk.text (63 tokens):\n'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, \u2014 its DOS software provided by Microsoft, \u2014 which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, \u2014 its DOS software provided by Microsoft, \u2014 which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\n\n=== 5 ===\nchunk.text (61 tokens):\n'IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\nchunker.contextualize(chunk) (62 tokens):\n'IBM\\nIBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n\n=== 6 ===\nchunk.text (62 tokens):\n\"As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\nchunker.contextualize(chunk) (63 tokens):\n\"IBM\\nAs one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\n\n=== 7 ===\nchunk.text (63 tokens):\n'language, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nlanguage, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\n\n=== 8 ===\nchunk.text (5 tokens):\n'Awards.[16]'\nchunker.contextualize(chunk) (6 tokens):\n'IBM\\nAwards.[16]'\n\n=== 9 ===\nchunk.text (56 tokens):\n'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\nchunker.contextualize(chunk) (60 tokens):\n'IBM\\n1910s\u20131950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\n\n=== 10 ===\nchunk.text (60 tokens):\n\"(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\nchunker.contextualize(chunk) (64 tokens):\n\"IBM\\n1910s\u20131950s\\n(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\n\n=== 11 ===\nchunk.text (59 tokens):\n'Computing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\nchunker.contextualize(chunk) (63 tokens):\n'IBM\\n1910s\u20131950s\\nComputing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\n\n=== 12 ===\nchunk.text (13 tokens):\n'D.C.; and Toronto, Canada.[22]'\nchunker.contextualize(chunk) (17 tokens):\n'IBM\\n1910s\u20131950s\\nD.C.; and Toronto, Canada.[22]'\n\n=== 13 ===\nchunk.text (60 tokens):\n'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\n1910s\u20131950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\n\n=== 14 ===\nchunk.text (59 tokens):\n\"on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\nchunker.contextualize(chunk) (63 tokens):\n\"IBM\\n1910s\u20131950s\\non Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\n\n=== 15 ===\nchunk.text (23 tokens):\n\"practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\nchunker.contextualize(chunk) (27 tokens):\n\"IBM\\n1910s\u20131950s\\npractices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\n\n=== 16 ===\nchunk.text (59 tokens):\n'He implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\nchunker.contextualize(chunk) (63 tokens):\n'IBM\\n1910s\u20131950s\\nHe implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\n\n=== 17 ===\nchunk.text (60 tokens):\n'\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\n1910s\u20131950s\\n\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\n\n=== 18 ===\nchunk.text (57 tokens):\n'clumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\nchunker.contextualize(chunk) (61 tokens):\n'IBM\\n1910s\u20131950s\\nclumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\n\n=== 19 ===\nchunk.text (21 tokens):\n'1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\nchunker.contextualize(chunk) (25 tokens):\n'IBM\\n1910s\u20131950s\\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n\n=== 20 ===\nchunk.text (22 tokens):\n'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\nchunker.contextualize(chunk) (26 tokens):\n'IBM\\n1960s\u20131980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\n\n"},{"location":"examples/hybrid_chunking/#hybrid-chunking","title":"Hybrid chunking\u00b6","text":""},{"location":"examples/hybrid_chunking/#overview","title":"Overview\u00b6","text":""},{"location":"examples/hybrid_chunking/#setup","title":"Setup\u00b6","text":""},{"location":"examples/hybrid_chunking/#basic-usage","title":"Basic usage\u00b6","text":""},{"location":"examples/hybrid_chunking/#configuring-tokenization","title":"Configuring tokenization\u00b6","text":"
For more control on the chunking, we can parametrize tokenization as shown below.
In a RAG / retrieval context, it is important to make sure that the chunker and embedding model are using the same tokenizer.
\ud83d\udc49 HuggingFace transformers tokenizers can be used as shown in the following example:
"},{"location":"examples/inspect_picture_content/","title":"Inspect picture content","text":"Inspect the contents associated with each picture in a converted document.
What this example does
How to run
python docs/examples/inspect_picture_content.py.Notes
picture.get_image(doc).show() to visually inspect each picture.source to point to a different PDF if desired.from docling_core.types.doc import TextItem\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n# Change this to a local path if desired\nsource = \"tests/data/pdf/amt_handbook_sample.pdf\"\n\npipeline_options = PdfPipelineOptions()\n# Higher scale yields sharper crops when inspecting picture content.\npipeline_options.images_scale = 2\npipeline_options.generate_page_images = True\n\ndoc_converter = DocumentConverter(\n format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}\n)\n\nresult = doc_converter.convert(source)\n\ndoc = result.document\n\nfor picture in doc.pictures:\n # picture.get_image(doc).show() # display the picture\n print(picture.caption_text(doc), \" contains these elements:\")\n\n for item, level in doc.iterate_items(root=picture, traverse_pictures=True):\n if isinstance(item, TextItem):\n print(item.text)\n\n print(\"\\n\")\n from docling_core.types.doc import TextItem from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption # Change this to a local path if desired source = \"tests/data/pdf/amt_handbook_sample.pdf\" pipeline_options = PdfPipelineOptions() # Higher scale yields sharper crops when inspecting picture content. pipeline_options.images_scale = 2 pipeline_options.generate_page_images = True doc_converter = DocumentConverter( format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} ) result = doc_converter.convert(source) doc = result.document for picture in doc.pictures: # picture.get_image(doc).show() # display the picture print(picture.caption_text(doc), \" contains these elements:\") for item, level in doc.iterate_items(root=picture, traverse_pictures=True): if isinstance(item, TextItem): print(item.text) print(\"\\n\")"},{"location":"examples/minimal/","title":"Simple conversion","text":"What this example does
Requirements
pip install doclingHow to run
python docs/examples/minimal.pysource variable below.Notes
docs/examples/batch_convert.py.from docling.document_converter import DocumentConverter\n\n# Change this to a local path or another URL if desired.\n# Note: using the default URL requires network access; if offline, provide a\n# local file path (e.g., Path(\"/path/to/file.pdf\")).\nsource = \"https://arxiv.org/pdf/2408.09869\"\n\nconverter = DocumentConverter()\nresult = converter.convert(source)\n\n# Print Markdown to stdout.\nprint(result.document.export_to_markdown())\nfrom docling.document_converter import DocumentConverter # Change this to a local path or another URL if desired. # Note: using the default URL requires network access; if offline, provide a # local file path (e.g., Path(\"/path/to/file.pdf\")). source = \"https://arxiv.org/pdf/2408.09869\" converter = DocumentConverter() result = converter.convert(source) # Print Markdown to stdout. print(result.document.export_to_markdown())"},{"location":"examples/minimal_asr_pipeline/","title":"ASR pipeline with Whisper","text":"
Minimal ASR pipeline example: transcribe an audio file to Markdown text.
What this example does
Prerequisites
How to run
python docs/examples/minimal_asr_pipeline.py.Customizing the model
get_asr_converter() to manually override pipeline_options.asr_options with any model from asr_model_specs.InputFormat.AUDIO and AsrPipeline unchanged for a minimal setup.Input audio
tests/data/audio/sample_10s.mp3. Update audio_path to your own file if needed.from pathlib import Path\n\nfrom docling_core.types.doc import DoclingDocument\n\nfrom docling.datamodel import asr_model_specs\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import AsrPipelineOptions\nfrom docling.document_converter import AudioFormatOption, DocumentConverter\nfrom docling.pipeline.asr_pipeline import AsrPipeline\n\n\ndef get_asr_converter():\n \"\"\"Create a DocumentConverter configured for ASR with automatic model selection.\n\n Uses `asr_model_specs.WHISPER_TURBO` which automatically selects the best\n implementation for your hardware:\n - MLX Whisper Turbo for Apple Silicon (M1/M2/M3) with mlx-whisper installed\n - Native Whisper Turbo as fallback\n\n You can swap in another model spec from `docling.datamodel.asr_model_specs`\n to experiment with different model sizes.\n \"\"\"\n pipeline_options = AsrPipelineOptions()\n pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO\n\n converter = DocumentConverter(\n format_options={\n InputFormat.AUDIO: AudioFormatOption(\n pipeline_cls=AsrPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n return converter\n\n\ndef asr_pipeline_conversion(audio_path: Path) -> DoclingDocument:\n \"\"\"Run the ASR pipeline and return a `DoclingDocument` transcript.\"\"\"\n # Check if the test audio file exists\n assert audio_path.exists(), f\"Test audio file not found: {audio_path}\"\n\n converter = get_asr_converter()\n\n # Convert the audio file\n result: ConversionResult = converter.convert(audio_path)\n\n # Verify conversion was successful\n assert result.status == ConversionStatus.SUCCESS, (\n f\"Conversion failed with status: {result.status}\"\n )\n return result.document\n\n\nif __name__ == \"__main__\":\n audio_path = Path(\"tests/data/audio/sample_10s.mp3\")\n\n doc = asr_pipeline_conversion(audio_path=audio_path)\n print(doc.export_to_markdown())\n\n # Expected output:\n #\n # [time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde\n #\n # [time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.\n from pathlib import Path from docling_core.types.doc import DoclingDocument from docling.datamodel import asr_model_specs from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import AsrPipelineOptions from docling.document_converter import AudioFormatOption, DocumentConverter from docling.pipeline.asr_pipeline import AsrPipeline def get_asr_converter(): \"\"\"Create a DocumentConverter configured for ASR with automatic model selection. Uses `asr_model_specs.WHISPER_TURBO` which automatically selects the best implementation for your hardware: - MLX Whisper Turbo for Apple Silicon (M1/M2/M3) with mlx-whisper installed - Native Whisper Turbo as fallback You can swap in another model spec from `docling.datamodel.asr_model_specs` to experiment with different model sizes. \"\"\" pipeline_options = AsrPipelineOptions() pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO converter = DocumentConverter( format_options={ InputFormat.AUDIO: AudioFormatOption( pipeline_cls=AsrPipeline, pipeline_options=pipeline_options, ) } ) return converter def asr_pipeline_conversion(audio_path: Path) -> DoclingDocument: \"\"\"Run the ASR pipeline and return a `DoclingDocument` transcript.\"\"\" # Check if the test audio file exists assert audio_path.exists(), f\"Test audio file not found: {audio_path}\" converter = get_asr_converter() # Convert the audio file result: ConversionResult = converter.convert(audio_path) # Verify conversion was successful assert result.status == ConversionStatus.SUCCESS, ( f\"Conversion failed with status: {result.status}\" ) return result.document if __name__ == \"__main__\": audio_path = Path(\"tests/data/audio/sample_10s.mp3\") doc = asr_pipeline_conversion(audio_path=audio_path) print(doc.export_to_markdown()) # Expected output: # # [time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde # # [time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain."},{"location":"examples/minimal_vlm_pipeline/","title":"VLM pipeline with GraniteDocling","text":"Minimal VLM pipeline example: convert a PDF using a vision-language model.
What this example does
Prerequisites
How to run
python docs/examples/minimal_vlm_pipeline.py.Notes
source may be a local path or a URL to a PDF.vlm_model_specs.GRANITEDOCLING_MLX).docs/examples/compare_vlm_models.py.from docling.datamodel import vlm_model_specs\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\n# Convert a public arXiv PDF; replace with a local path if preferred.\nsource = \"https://arxiv.org/pdf/2501.17887\"\n\n###### USING SIMPLE DEFAULT VALUES\n# - GraniteDocling model\n# - Using the transformers framework\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n ),\n }\n)\n\ndoc = converter.convert(source=source).document\n\nprint(doc.export_to_markdown())\n\n\n###### USING MACOS MPS ACCELERATOR\n# Demonstrates using MLX on macOS with MPS acceleration (macOS only).\n# For more options see the `compare_vlm_models.py` example.\n\npipeline_options = VlmPipelineOptions(\n vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,\n)\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n }\n)\n\ndoc = converter.convert(source=source).document\n\nprint(doc.export_to_markdown())\n from docling.datamodel import vlm_model_specs from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( VlmPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline # Convert a public arXiv PDF; replace with a local path if preferred. source = \"https://arxiv.org/pdf/2501.17887\" ###### USING SIMPLE DEFAULT VALUES # - GraniteDocling model # - Using the transformers framework converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=VlmPipeline, ), } ) doc = converter.convert(source=source).document print(doc.export_to_markdown()) ###### USING MACOS MPS ACCELERATOR # Demonstrates using MLX on macOS with MPS acceleration (macOS only). # For more options see the `compare_vlm_models.py` example. pipeline_options = VlmPipelineOptions( vlm_options=vlm_model_specs.GRANITEDOCLING_MLX, ) converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=VlmPipeline, pipeline_options=pipeline_options, ), } ) doc = converter.convert(source=source).document print(doc.export_to_markdown())"},{"location":"examples/mlx_whisper_example/","title":"Mlx whisper example","text":"In\u00a0[\u00a0]: Copied! \"\"\"\nExample script demonstrating MLX Whisper integration for Apple Silicon.\n\nThis script shows how to use the MLX Whisper models for speech recognition\non Apple Silicon devices with optimized performance.\n\"\"\"\n\"\"\" Example script demonstrating MLX Whisper integration for Apple Silicon. This script shows how to use the MLX Whisper models for speech recognition on Apple Silicon devices with optimized performance. \"\"\" In\u00a0[\u00a0]: Copied!
import argparse\nimport sys\nfrom pathlib import Path\nimport argparse import sys from pathlib import Path In\u00a0[\u00a0]: Copied!
# Add the repository root to the path so we can import docling\nsys.path.insert(0, str(Path(__file__).parent.parent.parent))\n# Add the repository root to the path so we can import docling sys.path.insert(0, str(Path(__file__).parent.parent.parent)) In\u00a0[\u00a0]: Copied!
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.asr_model_specs import (\n WHISPER_BASE,\n WHISPER_LARGE,\n WHISPER_MEDIUM,\n WHISPER_SMALL,\n WHISPER_TINY,\n WHISPER_TURBO,\n)\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import AsrPipelineOptions\nfrom docling.document_converter import AudioFormatOption, DocumentConverter\nfrom docling.pipeline.asr_pipeline import AsrPipeline\nfrom docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.asr_model_specs import ( WHISPER_BASE, WHISPER_LARGE, WHISPER_MEDIUM, WHISPER_SMALL, WHISPER_TINY, WHISPER_TURBO, ) from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import AsrPipelineOptions from docling.document_converter import AudioFormatOption, DocumentConverter from docling.pipeline.asr_pipeline import AsrPipeline In\u00a0[\u00a0]: Copied!
def transcribe_audio_with_mlx_whisper(audio_file_path: str, model_size: str = \"base\"):\n \"\"\"\n Transcribe audio using Whisper models with automatic MLX optimization for Apple Silicon.\n\n Args:\n audio_file_path: Path to the audio file to transcribe\n model_size: Size of the Whisper model to use\n (\"tiny\", \"base\", \"small\", \"medium\", \"large\", \"turbo\")\n Note: MLX optimization is automatically used on Apple Silicon when available\n\n Returns:\n The transcribed text\n \"\"\"\n # Select the appropriate Whisper model (automatically uses MLX on Apple Silicon)\n model_map = {\n \"tiny\": WHISPER_TINY,\n \"base\": WHISPER_BASE,\n \"small\": WHISPER_SMALL,\n \"medium\": WHISPER_MEDIUM,\n \"large\": WHISPER_LARGE,\n \"turbo\": WHISPER_TURBO,\n }\n\n if model_size not in model_map:\n raise ValueError(\n f\"Invalid model size: {model_size}. Choose from: {list(model_map.keys())}\"\n )\n\n asr_options = model_map[model_size]\n\n # Configure accelerator options for Apple Silicon\n accelerator_options = AcceleratorOptions(device=AcceleratorDevice.MPS)\n\n # Create pipeline options\n pipeline_options = AsrPipelineOptions(\n asr_options=asr_options,\n accelerator_options=accelerator_options,\n )\n\n # Create document converter with MLX Whisper configuration\n converter = DocumentConverter(\n format_options={\n InputFormat.AUDIO: AudioFormatOption(\n pipeline_cls=AsrPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n\n # Run transcription\n result = converter.convert(Path(audio_file_path))\n\n if result.status.value == \"success\":\n # Extract text from the document\n text_content = []\n for item in result.document.texts:\n text_content.append(item.text)\n\n return \"\\n\".join(text_content)\n else:\n raise RuntimeError(f\"Transcription failed: {result.status}\")\n def transcribe_audio_with_mlx_whisper(audio_file_path: str, model_size: str = \"base\"): \"\"\" Transcribe audio using Whisper models with automatic MLX optimization for Apple Silicon. Args: audio_file_path: Path to the audio file to transcribe model_size: Size of the Whisper model to use (\"tiny\", \"base\", \"small\", \"medium\", \"large\", \"turbo\") Note: MLX optimization is automatically used on Apple Silicon when available Returns: The transcribed text \"\"\" # Select the appropriate Whisper model (automatically uses MLX on Apple Silicon) model_map = { \"tiny\": WHISPER_TINY, \"base\": WHISPER_BASE, \"small\": WHISPER_SMALL, \"medium\": WHISPER_MEDIUM, \"large\": WHISPER_LARGE, \"turbo\": WHISPER_TURBO, } if model_size not in model_map: raise ValueError( f\"Invalid model size: {model_size}. Choose from: {list(model_map.keys())}\" ) asr_options = model_map[model_size] # Configure accelerator options for Apple Silicon accelerator_options = AcceleratorOptions(device=AcceleratorDevice.MPS) # Create pipeline options pipeline_options = AsrPipelineOptions( asr_options=asr_options, accelerator_options=accelerator_options, ) # Create document converter with MLX Whisper configuration converter = DocumentConverter( format_options={ InputFormat.AUDIO: AudioFormatOption( pipeline_cls=AsrPipeline, pipeline_options=pipeline_options, ) } ) # Run transcription result = converter.convert(Path(audio_file_path)) if result.status.value == \"success\": # Extract text from the document text_content = [] for item in result.document.texts: text_content.append(item.text) return \"\\n\".join(text_content) else: raise RuntimeError(f\"Transcription failed: {result.status}\") In\u00a0[\u00a0]: Copied! def parse_args():\n \"\"\"Parse command line arguments.\"\"\"\n parser = argparse.ArgumentParser(\n description=\"MLX Whisper example for Apple Silicon speech recognition\",\n formatter_class=argparse.RawDescriptionHelpFormatter,\n epilog=\"\"\"\nExamples:\n\n# Use default test audio file\npython mlx_whisper_example.py\n\n# Use your own audio file\npython mlx_whisper_example.py --audio /path/to/your/audio.mp3\n\n# Use specific model size\npython mlx_whisper_example.py --audio audio.wav --model tiny\n\n# Use default test file with specific model\npython mlx_whisper_example.py --model turbo\n \"\"\",\n )\n\n parser.add_argument(\n \"--audio\",\n type=str,\n help=\"Path to audio file for transcription (default: tests/data/audio/sample_10s.mp3)\",\n )\n\n parser.add_argument(\n \"--model\",\n type=str,\n choices=[\"tiny\", \"base\", \"small\", \"medium\", \"large\", \"turbo\"],\n default=\"base\",\n help=\"Whisper model size to use (default: base)\",\n )\n\n return parser.parse_args()\ndef parse_args(): \"\"\"Parse command line arguments.\"\"\" parser = argparse.ArgumentParser( description=\"MLX Whisper example for Apple Silicon speech recognition\", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=\"\"\" Examples: # Use default test audio file python mlx_whisper_example.py # Use your own audio file python mlx_whisper_example.py --audio /path/to/your/audio.mp3 # Use specific model size python mlx_whisper_example.py --audio audio.wav --model tiny # Use default test file with specific model python mlx_whisper_example.py --model turbo \"\"\", ) parser.add_argument( \"--audio\", type=str, help=\"Path to audio file for transcription (default: tests/data/audio/sample_10s.mp3)\", ) parser.add_argument( \"--model\", type=str, choices=[\"tiny\", \"base\", \"small\", \"medium\", \"large\", \"turbo\"], default=\"base\", help=\"Whisper model size to use (default: base)\", ) return parser.parse_args() In\u00a0[\u00a0]: Copied!
def main():\n \"\"\"Main function to demonstrate MLX Whisper usage.\"\"\"\n args = parse_args()\n\n # Determine audio file path\n if args.audio:\n audio_file_path = args.audio\n else:\n # Use default test audio file if no audio file specified\n default_audio = (\n Path(__file__).parent.parent.parent\n / \"tests\"\n / \"data\"\n / \"audio\"\n / \"sample_10s.mp3\"\n )\n if default_audio.exists():\n audio_file_path = str(default_audio)\n print(\"No audio file specified, using default test file:\")\n print(f\" Audio file: {audio_file_path}\")\n print(f\" Model size: {args.model}\")\n print()\n else:\n print(\"Error: No audio file specified and default test file not found.\")\n print(\n \"Please specify an audio file with --audio or ensure tests/data/audio/sample_10s.mp3 exists.\"\n )\n sys.exit(1)\n\n if not Path(audio_file_path).exists():\n print(f\"Error: Audio file '{audio_file_path}' not found.\")\n sys.exit(1)\n\n try:\n print(f\"Transcribing '{audio_file_path}' using Whisper {args.model} model...\")\n print(\n \"Note: MLX optimization is automatically used on Apple Silicon when available.\"\n )\n print()\n\n transcribed_text = transcribe_audio_with_mlx_whisper(\n audio_file_path, args.model\n )\n\n print(\"Transcription Result:\")\n print(\"=\" * 50)\n print(transcribed_text)\n print(\"=\" * 50)\n\n except ImportError as e:\n print(f\"Error: {e}\")\n print(\"Please install mlx-whisper: pip install mlx-whisper\")\n print(\"Or install with uv: uv sync --extra asr\")\n sys.exit(1)\n except Exception as e:\n print(f\"Error during transcription: {e}\")\n sys.exit(1)\n def main(): \"\"\"Main function to demonstrate MLX Whisper usage.\"\"\" args = parse_args() # Determine audio file path if args.audio: audio_file_path = args.audio else: # Use default test audio file if no audio file specified default_audio = ( Path(__file__).parent.parent.parent / \"tests\" / \"data\" / \"audio\" / \"sample_10s.mp3\" ) if default_audio.exists(): audio_file_path = str(default_audio) print(\"No audio file specified, using default test file:\") print(f\" Audio file: {audio_file_path}\") print(f\" Model size: {args.model}\") print() else: print(\"Error: No audio file specified and default test file not found.\") print( \"Please specify an audio file with --audio or ensure tests/data/audio/sample_10s.mp3 exists.\" ) sys.exit(1) if not Path(audio_file_path).exists(): print(f\"Error: Audio file '{audio_file_path}' not found.\") sys.exit(1) try: print(f\"Transcribing '{audio_file_path}' using Whisper {args.model} model...\") print( \"Note: MLX optimization is automatically used on Apple Silicon when available.\" ) print() transcribed_text = transcribe_audio_with_mlx_whisper( audio_file_path, args.model ) print(\"Transcription Result:\") print(\"=\" * 50) print(transcribed_text) print(\"=\" * 50) except ImportError as e: print(f\"Error: {e}\") print(\"Please install mlx-whisper: pip install mlx-whisper\") print(\"Or install with uv: uv sync --extra asr\") sys.exit(1) except Exception as e: print(f\"Error during transcription: {e}\") sys.exit(1) In\u00a0[\u00a0]: Copied! if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/parquet_images/","title":"Parquet benchmark","text":"
What this example does
Requirements
pip install doclingHow to run
python docs/examples/parquet_images.py FILEThe parquet file should be in the format similar to the ViDoRe V3 dataset. https://huggingface.co/collections/vidore/vidore-benchmark-v3
For example:
import io\nimport time\nfrom pathlib import Path\nfrom typing import Annotated, Literal\n\nimport pyarrow.parquet as pq\nimport typer\nfrom PIL import Image\n\nfrom docling.datamodel import vlm_model_specs\nfrom docling.datamodel.base_models import ConversionStatus, DocumentStream, InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n PipelineOptions,\n RapidOcrOptions,\n VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat\nfrom docling.datamodel.settings import settings\nfrom docling.document_converter import DocumentConverter, ImageFormatOption\nfrom docling.pipeline.base_pipeline import ConvertPipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\n\ndef process_document(\n images: list[Image.Image], chunk_idx: int, doc_converter: DocumentConverter\n):\n \"\"\"Builds a tall image and sends it through Docling.\"\"\"\n\n print(f\"\\n--- Processing chunk {chunk_idx} with {len(images)} images ---\")\n\n # Convert images to mode RGB (TIFF pages must match)\n rgb_images = [im.convert(\"RGB\") for im in images]\n\n # First image is the base frame\n first = rgb_images[0]\n rest = rgb_images[1:]\n\n # Create multi-page TIFF using PIL frames\n buf = io.BytesIO()\n first.save(\n buf,\n format=\"TIFF\",\n save_all=True,\n append_images=rest,\n compression=\"tiff_deflate\", # good compression, optional\n )\n buf.seek(0)\n\n # Docling conversion\n doc_stream = DocumentStream(name=f\"doc_{chunk_idx}.tiff\", stream=buf)\n\n start_time = time.time()\n conv_result = doc_converter.convert(doc_stream)\n runtime = time.time() - start_time\n\n assert conv_result.status == ConversionStatus.SUCCESS\n\n pages = len(conv_result.pages)\n print(\n f\"Chunk {chunk_idx} converted in {runtime:.2f} sec ({pages / runtime:.2f} pages/s).\"\n )\n\n\ndef run(\n filename: Annotated[Path, typer.Argument()] = Path(\n \"docs/examples/data/vidore_v3_hr-slice.parquet\"\n ),\n doc_size: int = 192,\n batch_size: int = 64,\n pipeline: Literal[\"standard\", \"vlm\"] = \"standard\",\n):\n if pipeline == \"standard\":\n pipeline_cls: type[ConvertPipeline] = StandardPdfPipeline\n pipeline_options: PipelineOptions = PdfPipelineOptions(\n # ocr_options=RapidOcrOptions(backend=\"openvino\"),\n ocr_batch_size=batch_size,\n layout_batch_size=batch_size,\n table_batch_size=4,\n )\n elif pipeline == \"vlm\":\n settings.perf.page_batch_size = batch_size\n pipeline_cls = VlmPipeline\n vlm_options = ApiVlmOptions(\n url=\"http://localhost:8000/v1/chat/completions\",\n params=dict(\n model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id,\n max_tokens=4096,\n skip_special_tokens=True,\n ),\n prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt,\n timeout=90,\n scale=1.0,\n temperature=0.0,\n concurrency=batch_size,\n stop_strings=[\"</doctag>\", \"<|end_of_text|>\"],\n response_format=ResponseFormat.DOCTAGS,\n )\n pipeline_options = VlmPipelineOptions(\n vlm_options=vlm_options,\n enable_remote_services=True, # required when using a remote inference service.\n )\n else:\n raise RuntimeError(f\"Pipeline {pipeline} not available.\")\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.IMAGE: ImageFormatOption(\n pipeline_cls=pipeline_cls,\n pipeline_options=pipeline_options,\n )\n }\n )\n\n start_time = time.time()\n doc_converter.initialize_pipeline(InputFormat.IMAGE)\n init_runtime = time.time() - start_time\n print(f\"Pipeline initialized in {init_runtime:.2f} seconds.\")\n\n # ------------------------------------------------------------\n # Open parquet file in streaming mode\n # ------------------------------------------------------------\n pf = pq.ParquetFile(filename)\n\n image_buffer = [] # holds up to doc_size images\n chunk_idx = 0\n\n # ------------------------------------------------------------\n # Stream batches from parquet\n # ------------------------------------------------------------\n for batch in pf.iter_batches(batch_size=batch_size, columns=[\"image\"]):\n col = batch.column(\"image\")\n\n # Extract Python objects (PIL images)\n # Arrow stores them as Python objects inside an ObjectArray\n for i in range(len(col)):\n img_dict = col[i].as_py() # {\"bytes\": ..., \"path\": ...}\n pil_image = Image.open(io.BytesIO(img_dict[\"bytes\"]))\n image_buffer.append(pil_image)\n\n # If enough images gathered \u2192 process one doc\n if len(image_buffer) == doc_size:\n process_document(image_buffer, chunk_idx, doc_converter)\n image_buffer.clear()\n chunk_idx += 1\n\n # ------------------------------------------------------------\n # Process trailing images (last partial chunk)\n # ------------------------------------------------------------\n if image_buffer:\n process_document(image_buffer, chunk_idx, doc_converter)\n\n\nif __name__ == \"__main__\":\n typer.run(run)\n import io import time from pathlib import Path from typing import Annotated, Literal import pyarrow.parquet as pq import typer from PIL import Image from docling.datamodel import vlm_model_specs from docling.datamodel.base_models import ConversionStatus, DocumentStream, InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, PipelineOptions, RapidOcrOptions, VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat from docling.datamodel.settings import settings from docling.document_converter import DocumentConverter, ImageFormatOption from docling.pipeline.base_pipeline import ConvertPipeline from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline from docling.pipeline.vlm_pipeline import VlmPipeline def process_document( images: list[Image.Image], chunk_idx: int, doc_converter: DocumentConverter ): \"\"\"Builds a tall image and sends it through Docling.\"\"\" print(f\"\\n--- Processing chunk {chunk_idx} with {len(images)} images ---\") # Convert images to mode RGB (TIFF pages must match) rgb_images = [im.convert(\"RGB\") for im in images] # First image is the base frame first = rgb_images[0] rest = rgb_images[1:] # Create multi-page TIFF using PIL frames buf = io.BytesIO() first.save( buf, format=\"TIFF\", save_all=True, append_images=rest, compression=\"tiff_deflate\", # good compression, optional ) buf.seek(0) # Docling conversion doc_stream = DocumentStream(name=f\"doc_{chunk_idx}.tiff\", stream=buf) start_time = time.time() conv_result = doc_converter.convert(doc_stream) runtime = time.time() - start_time assert conv_result.status == ConversionStatus.SUCCESS pages = len(conv_result.pages) print( f\"Chunk {chunk_idx} converted in {runtime:.2f} sec ({pages / runtime:.2f} pages/s).\" ) def run( filename: Annotated[Path, typer.Argument()] = Path( \"docs/examples/data/vidore_v3_hr-slice.parquet\" ), doc_size: int = 192, batch_size: int = 64, pipeline: Literal[\"standard\", \"vlm\"] = \"standard\", ): if pipeline == \"standard\": pipeline_cls: type[ConvertPipeline] = StandardPdfPipeline pipeline_options: PipelineOptions = PdfPipelineOptions( # ocr_options=RapidOcrOptions(backend=\"openvino\"), ocr_batch_size=batch_size, layout_batch_size=batch_size, table_batch_size=4, ) elif pipeline == \"vlm\": settings.perf.page_batch_size = batch_size pipeline_cls = VlmPipeline vlm_options = ApiVlmOptions( url=\"http://localhost:8000/v1/chat/completions\", params=dict( model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id, max_tokens=4096, skip_special_tokens=True, ), prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt, timeout=90, scale=1.0, temperature=0.0, concurrency=batch_size, stop_strings=[\"\", \"<|end_of_text|>\"], response_format=ResponseFormat.DOCTAGS, ) pipeline_options = VlmPipelineOptions( vlm_options=vlm_options, enable_remote_services=True, # required when using a remote inference service. ) else: raise RuntimeError(f\"Pipeline {pipeline} not available.\") doc_converter = DocumentConverter( format_options={ InputFormat.IMAGE: ImageFormatOption( pipeline_cls=pipeline_cls, pipeline_options=pipeline_options, ) } ) start_time = time.time() doc_converter.initialize_pipeline(InputFormat.IMAGE) init_runtime = time.time() - start_time print(f\"Pipeline initialized in {init_runtime:.2f} seconds.\") # ------------------------------------------------------------ # Open parquet file in streaming mode # ------------------------------------------------------------ pf = pq.ParquetFile(filename) image_buffer = [] # holds up to doc_size images chunk_idx = 0 # ------------------------------------------------------------ # Stream batches from parquet # ------------------------------------------------------------ for batch in pf.iter_batches(batch_size=batch_size, columns=[\"image\"]): col = batch.column(\"image\") # Extract Python objects (PIL images) # Arrow stores them as Python objects inside an ObjectArray for i in range(len(col)): img_dict = col[i].as_py() # {\"bytes\": ..., \"path\": ...} pil_image = Image.open(io.BytesIO(img_dict[\"bytes\"])) image_buffer.append(pil_image) # If enough images gathered \u2192 process one doc if len(image_buffer) == doc_size: process_document(image_buffer, chunk_idx, doc_converter) image_buffer.clear() chunk_idx += 1 # ------------------------------------------------------------ # Process trailing images (last partial chunk) # ------------------------------------------------------------ if image_buffer: process_document(image_buffer, chunk_idx, doc_converter) if __name__ == \"__main__\": typer.run(run)"},{"location":"examples/parquet_images/#start-models-with-vllm","title":"Start models with vllm\u00b6","text":"vllm serve ibm-granite/granite-docling-258M \\\n --host 127.0.0.1 --port 8000 \\\n --max-num-seqs 512 \\\n --max-num-batched-tokens 8192 \\\n --enable-chunked-prefill \\\n --gpu-memory-utilization 0.9\n"},{"location":"examples/pictures_description/","title":"Annotate picture with local VLM","text":"In\u00a0[\u00a0]: Copied!
%pip install -q docling[vlm] ipython\n%pip install -q docling[vlm] ipython
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[1]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[2]: Copied!
# The source document\nDOC_SOURCE = \"https://arxiv.org/pdf/2501.17887\"\n# The source document DOC_SOURCE = \"https://arxiv.org/pdf/2501.17887\" In\u00a0[3]: Copied!
from docling.datamodel.pipeline_options import granite_picture_description\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = (\n granite_picture_description # <-- the model choice\n)\npipeline_options.picture_description_options.prompt = (\n \"Describe the image in three sentences. Be consise and accurate.\"\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n)\ndoc = converter.convert(DOC_SOURCE).document\n from docling.datamodel.pipeline_options import granite_picture_description pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = ( granite_picture_description # <-- the model choice ) pipeline_options.picture_description_options.prompt = ( \"Describe the image in three sentences. Be consise and accurate.\" ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(DOC_SOURCE).document Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.\n
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]In\u00a0[4]: Copied!
from docling_core.types.doc.document import PictureDescriptionData\nfrom IPython import display\n\nhtml_buffer = []\n# display the first 5 pictures and their captions and annotations:\nfor pic in doc.pictures[:5]:\n html_item = (\n f\"<h3>Picture <code>{pic.self_ref}</code></h3>\"\n f'<img src=\"{pic.image.uri!s}\" /><br />'\n f\"<h4>Caption</h4>{pic.caption_text(doc=doc)}<br />\"\n )\n for annotation in pic.annotations:\n if not isinstance(annotation, PictureDescriptionData):\n continue\n html_item += (\n f\"<h4>Annotations ({annotation.provenance})</h4>{annotation.text}<br />\\n\"\n )\n html_buffer.append(html_item)\ndisplay.HTML(\"<hr />\".join(html_buffer))\n from docling_core.types.doc.document import PictureDescriptionData from IPython import display html_buffer = [] # display the first 5 pictures and their captions and annotations: for pic in doc.pictures[:5]: html_item = ( f\"Picture {pic.self_ref}\" f'' f\"Caption{pic.caption_text(doc=doc)}\" ) for annotation in pic.annotations: if not isinstance(annotation, PictureDescriptionData): continue html_item += ( f\"Annotations ({annotation.provenance}){annotation.text}\\n\" ) html_buffer.append(html_item) display.HTML(\"\".join(html_buffer)) Out[4]: Picture #/pictures/0CaptionFigure 1: Sketch of Docling's pipelines and usage model. Both PDF pipeline and simple pipeline build up a DoclingDocument representation, which can be further enriched. Downstream applications can utilize Docling's API to inspect, export, or chunk the document for various purposes.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a poster with some text and images. Picture #/pictures/1CaptionFigure 2: Dataset categories and sample counts for documents and pages.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a pie chart. In the pie chart we can see the categories and the number of documents in each category. Picture #/pictures/2CaptionFigure 3: Distribution of conversion times for all documents, ordered by number of pages in a document, on all system configurations. Every dot represents one document. Log/log scale is used to even the spacing, since both number of pages and conversion times have long-tail distributions.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a graph. On the x-axis we can see the number of pages. On the y-axis we can see the seconds. Picture #/pictures/3CaptionFigure 4: Contributions of PDF backend and AI models to the conversion time of a page (in seconds per page). Lower is better. Left: Ranges of time contributions for each model to pages it was applied on (i.e., OCR was applied only on pages with bitmaps, table structure was applied only on pages with tables). Right: Average time contribution to a page in the benchmark dataset (factoring in zero-time contribution for OCR and table structure models on pages without bitmaps or tables) .Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a bar chart and a line chart. In the bar chart we can see the values of Pdf Parse, OCR, Layout, Table Structure, Page Total and Page. In the line chart we can see the values of Pdf Parse, OCR, Layout, Table Structure, Page Total and Page. Picture #/pictures/4CaptionFigure 5: Conversion time in seconds per page on our dataset in three scenarios, across all assets and system configurations. Lower bars are better. The configuration includes OCR and table structure recognition ( fast table option on Docling and MinerU, hi res in unstructured, as shown in table 1).Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a bar chart. In the chart we can see the CPU, Max, GPU, and sec/page. In\u00a0[7]: Copied! from docling.datamodel.pipeline_options import smolvlm_picture_description\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = (\n smolvlm_picture_description # <-- the model choice\n)\npipeline_options.picture_description_options.prompt = (\n \"Describe the image in three sentences. Be consise and accurate.\"\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n)\ndoc = converter.convert(DOC_SOURCE).document\n from docling.datamodel.pipeline_options import smolvlm_picture_description pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = ( smolvlm_picture_description # <-- the model choice ) pipeline_options.picture_description_options.prompt = ( \"Describe the image in three sentences. Be consise and accurate.\" ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(DOC_SOURCE).document In\u00a0[6]: Copied! from docling_core.types.doc.document import PictureDescriptionData\nfrom IPython import display\n\nhtml_buffer = []\n# display the first 5 pictures and their captions and annotations:\nfor pic in doc.pictures[:5]:\n html_item = (\n f\"<h3>Picture <code>{pic.self_ref}</code></h3>\"\n f'<img src=\"{pic.image.uri!s}\" /><br />'\n f\"<h4>Caption</h4>{pic.caption_text(doc=doc)}<br />\"\n )\n for annotation in pic.annotations:\n if not isinstance(annotation, PictureDescriptionData):\n continue\n html_item += (\n f\"<h4>Annotations ({annotation.provenance})</h4>{annotation.text}<br />\\n\"\n )\n html_buffer.append(html_item)\ndisplay.HTML(\"<hr />\".join(html_buffer))\n from docling_core.types.doc.document import PictureDescriptionData from IPython import display html_buffer = [] # display the first 5 pictures and their captions and annotations: for pic in doc.pictures[:5]: html_item = ( f\"Picture {pic.self_ref}\" f'' f\"Caption{pic.caption_text(doc=doc)}\" ) for annotation in pic.annotations: if not isinstance(annotation, PictureDescriptionData): continue html_item += ( f\"Annotations ({annotation.provenance}){annotation.text}\\n\" ) html_buffer.append(html_item) display.HTML(\"\".join(html_buffer)) Out[6]: Picture #/pictures/0CaptionFigure 1: Sketch of Docling's pipelines and usage model. Both PDF pipeline and simple pipeline build up a DoclingDocument representation, which can be further enriched. Downstream applications can utilize Docling's API to inspect, export, or chunk the document for various purposes.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)This is a page that has different types of documents on it. Picture #/pictures/1CaptionFigure 2: Dataset categories and sample counts for documents and pages.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)Here is a page-by-page list of documents per category: - Science - Articles - Law and Regulations - Articles - Misc. Picture #/pictures/2CaptionFigure 3: Distribution of conversion times for all documents, ordered by number of pages in a document, on all system configurations. Every dot represents one document. Log/log scale is used to even the spacing, since both number of pages and conversion times have long-tail distributions.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)The image is a bar chart that shows the number of pages of a website as a function of the number of pages of the website. The x-axis represents the number of pages, ranging from 100 to 10,000. The y-axis represents the number of pages, ranging from 100 to 10,000. The chart is labeled \"Number of pages\" and has a legend at the top of the chart that indicates the number of pages. The chart shows a clear trend: as the number of pages increases, the number of pages decreases. This is evident from the following points: - The number of pages increases from 100 to 1000. - The number of pages decreases from 1000 to 10,000. - The number of pages increases from 10,000 to 10,000. Picture #/pictures/3CaptionFigure 4: Contributions of PDF backend and AI models to the conversion time of a page (in seconds per page). Lower is better. Left: Ranges of time contributions for each model to pages it was applied on (i.e., OCR was applied only on pages with bitmaps, table structure was applied only on pages with tables). Right: Average time contribution to a page in the benchmark dataset (factoring in zero-time contribution for OCR and table structure models on pages without bitmaps or tables) .Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)bar chart with different colored bars representing different data points. Picture #/pictures/4CaptionFigure 5: Conversion time in seconds per page on our dataset in three scenarios, across all assets and system configurations. Lower bars are better. The configuration includes OCR and table structure recognition ( fast table option on Docling and MinerU, hi res in unstructured, as shown in table 1).Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)A bar chart with the following information: - The x-axis represents the number of pages, ranging from 0 to 14. - The y-axis represents the page count, ranging from 0 to 14. - The chart has three categories: Marker, Unstructured, and Detailed. - The x-axis is labeled \"see/page.\" - The y-axis is labeled \"Page Count.\" - The chart shows that the Marker category has the highest number of pages, followed by the Unstructured category, and then the Detailed category. In\u00a0[8]: Copied! from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = PictureDescriptionVlmOptions(\n repo_id=\"\", # <-- add here the Hugging Face repo_id of your favorite VLM\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n)\n\n# Uncomment to run:\n# doc = converter.convert(DOC_SOURCE).document\n from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = PictureDescriptionVlmOptions( repo_id=\"\", # <-- add here the Hugging Face repo_id of your favorite VLM prompt=\"Describe the image in three sentences. Be consise and accurate.\", ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) # Uncomment to run: # doc = converter.convert(DOC_SOURCE).document In\u00a0[\u00a0]: Copied! \n"},{"location":"examples/pictures_description/#describe-pictures-with-granite-vision","title":"Describe pictures with Granite Vision\u00b6","text":"
This section will run locally the ibm-granite/granite-vision-3.1-2b-preview model to describe the pictures of the document.
"},{"location":"examples/pictures_description/#describe-pictures-with-smolvlm","title":"Describe pictures with SmolVLM\u00b6","text":"This section will run locally the HuggingFaceTB/SmolVLM-256M-Instruct model to describe the pictures of the document.
"},{"location":"examples/pictures_description/#use-other-vision-models","title":"Use other vision models\u00b6","text":"The examples above can also be reproduced using other vision model. The Docling options PictureDescriptionVlmOptions allows to specify your favorite vision model from the Hugging Face Hub.
Describe pictures using a remote VLM API (vLLM, LM Studio, or watsonx.ai).
What this example does
PictureDescriptionApiOptions for local or cloud providers.Prerequisites
python-dotenv if loading env vars from a .env file.WX_API_KEY and WX_PROJECT_ID in the environment.How to run
python docs/examples/pictures_description_api.py.enable_remote_services=True (already set).Notes
http://localhost:8000/v1/chat/completions.http://localhost:1234/v1/chat/completions.import logging\nimport os\nfrom pathlib import Path\n\nimport requests\nfrom docling_core.types.doc import PictureItem\nfrom dotenv import load_dotenv\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n PictureDescriptionApiOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n### Example of PictureDescriptionApiOptions definitions\n\n#### Using vLLM\n# Models can be launched via:\n# $ vllm serve MODEL_NAME\n\n\ndef vllm_local_options(model: str):\n options = PictureDescriptionApiOptions(\n url=\"http://localhost:8000/v1/chat/completions\",\n params=dict(\n model=model,\n seed=42,\n max_completion_tokens=200,\n ),\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n timeout=90,\n )\n return options\n\n\n#### Using LM Studio\n\n\ndef lms_local_options(model: str):\n options = PictureDescriptionApiOptions(\n url=\"http://localhost:1234/v1/chat/completions\",\n params=dict(\n model=model,\n seed=42,\n max_completion_tokens=200,\n ),\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n timeout=90,\n )\n return options\n\n\n#### Using a cloud service like IBM watsonx.ai\n\n\ndef watsonx_vlm_options():\n load_dotenv()\n api_key = os.environ.get(\"WX_API_KEY\")\n project_id = os.environ.get(\"WX_PROJECT_ID\")\n\n def _get_iam_access_token(api_key: str) -> str:\n res = requests.post(\n url=\"https://iam.cloud.ibm.com/identity/token\",\n headers={\n \"Content-Type\": \"application/x-www-form-urlencoded\",\n },\n data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\",\n )\n res.raise_for_status()\n api_out = res.json()\n print(f\"{api_out=}\")\n return api_out[\"access_token\"]\n\n # Background information in case the model_id is updated:\n # [1] Official list of models: https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx\n # [2] Info on granite vision 3.3: https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models-ibm.html?context=wx#granite-vision-3-3-2b\n\n options = PictureDescriptionApiOptions(\n url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\",\n params=dict(\n model_id=\"ibm/granite-vision-3-3-2b\",\n project_id=project_id,\n parameters=dict(\n max_new_tokens=400,\n ),\n ),\n headers={\n \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key),\n },\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n timeout=60,\n )\n return options\n\n\n### Usage and conversion\n\n\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n pipeline_options = PdfPipelineOptions(\n enable_remote_services=True # <-- this is required!\n )\n pipeline_options.do_picture_description = True\n\n # The PictureDescriptionApiOptions() allows to interface with APIs supporting\n # the multi-modal chat interface. Here follow a few example on how to configure those.\n #\n # One possibility is self-hosting model, e.g. via VLLM.\n # $ vllm serve MODEL_NAME\n # Then PictureDescriptionApiOptions can point to the localhost endpoint.\n\n # Example for the Granite Vision model:\n # (uncomment the following lines)\n # pipeline_options.picture_description_options = vllm_local_options(\n # model=\"ibm-granite/granite-vision-3.3-2b\"\n # )\n\n # Example for the SmolVLM model:\n # (uncomment the following lines)\n # pipeline_options.picture_description_options = vllm_local_options(\n # model=\"HuggingFaceTB/SmolVLM-256M-Instruct\"\n # )\n\n # For using models on LM Studio using the built-in GGUF or MLX runtimes, e.g. the SmolVLM model:\n # (uncomment the following lines)\n pipeline_options.picture_description_options = lms_local_options(\n model=\"smolvlm-256m-instruct\"\n )\n\n # Another possibility is using online services, e.g. watsonx.ai.\n # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID.\n # (uncomment the following lines)\n # pipeline_options.picture_description_options = watsonx_vlm_options()\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n result = doc_converter.convert(input_doc_path)\n\n for element, _level in result.document.iterate_items():\n if isinstance(element, PictureItem):\n print(\n f\"Picture {element.self_ref}\\n\"\n f\"Caption: {element.caption_text(doc=result.document)}\\n\"\n f\"Annotations: {element.annotations}\"\n )\n\n\nif __name__ == \"__main__\":\n main()\n import logging import os from pathlib import Path import requests from docling_core.types.doc import PictureItem from dotenv import load_dotenv from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, PictureDescriptionApiOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption ### Example of PictureDescriptionApiOptions definitions #### Using vLLM # Models can be launched via: # $ vllm serve MODEL_NAME def vllm_local_options(model: str): options = PictureDescriptionApiOptions( url=\"http://localhost:8000/v1/chat/completions\", params=dict( model=model, seed=42, max_completion_tokens=200, ), prompt=\"Describe the image in three sentences. Be consise and accurate.\", timeout=90, ) return options #### Using LM Studio def lms_local_options(model: str): options = PictureDescriptionApiOptions( url=\"http://localhost:1234/v1/chat/completions\", params=dict( model=model, seed=42, max_completion_tokens=200, ), prompt=\"Describe the image in three sentences. Be consise and accurate.\", timeout=90, ) return options #### Using a cloud service like IBM watsonx.ai def watsonx_vlm_options(): load_dotenv() api_key = os.environ.get(\"WX_API_KEY\") project_id = os.environ.get(\"WX_PROJECT_ID\") def _get_iam_access_token(api_key: str) -> str: res = requests.post( url=\"https://iam.cloud.ibm.com/identity/token\", headers={ \"Content-Type\": \"application/x-www-form-urlencoded\", }, data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\", ) res.raise_for_status() api_out = res.json() print(f\"{api_out=}\") return api_out[\"access_token\"] # Background information in case the model_id is updated: # [1] Official list of models: https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx # [2] Info on granite vision 3.3: https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models-ibm.html?context=wx#granite-vision-3-3-2b options = PictureDescriptionApiOptions( url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\", params=dict( model_id=\"ibm/granite-vision-3-3-2b\", project_id=project_id, parameters=dict( max_new_tokens=400, ), ), headers={ \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key), }, prompt=\"Describe the image in three sentences. Be consise and accurate.\", timeout=60, ) return options ### Usage and conversion def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" pipeline_options = PdfPipelineOptions( enable_remote_services=True # <-- this is required! ) pipeline_options.do_picture_description = True # The PictureDescriptionApiOptions() allows to interface with APIs supporting # the multi-modal chat interface. Here follow a few example on how to configure those. # # One possibility is self-hosting model, e.g. via VLLM. # $ vllm serve MODEL_NAME # Then PictureDescriptionApiOptions can point to the localhost endpoint. # Example for the Granite Vision model: # (uncomment the following lines) # pipeline_options.picture_description_options = vllm_local_options( # model=\"ibm-granite/granite-vision-3.3-2b\" # ) # Example for the SmolVLM model: # (uncomment the following lines) # pipeline_options.picture_description_options = vllm_local_options( # model=\"HuggingFaceTB/SmolVLM-256M-Instruct\" # ) # For using models on LM Studio using the built-in GGUF or MLX runtimes, e.g. the SmolVLM model: # (uncomment the following lines) pipeline_options.picture_description_options = lms_local_options( model=\"smolvlm-256m-instruct\" ) # Another possibility is using online services, e.g. watsonx.ai. # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID. # (uncomment the following lines) # pipeline_options.picture_description_options = watsonx_vlm_options() doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) result = doc_converter.convert(input_doc_path) for element, _level in result.document.iterate_items(): if isinstance(element, PictureItem): print( f\"Picture {element.self_ref}\\n\" f\"Caption: {element.caption_text(doc=result.document)}\\n\" f\"Annotations: {element.annotations}\" ) if __name__ == \"__main__\": main()"},{"location":"examples/pii_obfuscate/","title":"Detect and obfuscate PII","text":"Detect and obfuscate PII using a Hugging Face NER model.
What this example does
Prerequisites
pip install transformers.pip install gliner If needed for CPU-only envs: pip install torch --extra-index-url https://download.pytorch.org/whl/cpuHF_MODEL to a different NER/PII model.How to run
python docs/examples/pii_obfuscate.py.PII_ENGINE=gliner.scratch/.Notes
import argparse\nimport logging\nimport os\nimport re\nfrom pathlib import Path\nfrom typing import Dict, List, Tuple\n\nfrom docling_core.types.doc import ImageRefMode, TableItem, TextItem\nfrom tabulate import tabulate\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n_log = logging.getLogger(__name__)\n\nIMAGE_RESOLUTION_SCALE = 2.0\nHF_MODEL = \"dslim/bert-base-NER\" # Swap with another HF NER/PII model if desired, eg https://huggingface.co/urchade/gliner_multi_pii-v1 looks very promising too!\nGLINER_MODEL = \"urchade/gliner_multi_pii-v1\"\n\n\ndef _build_simple_ner_pipeline():\n \"\"\"Create a Hugging Face token-classification pipeline for NER.\n\n Returns a callable like: ner(text) -> List[dict]\n \"\"\"\n try:\n from transformers import (\n AutoModelForTokenClassification,\n AutoTokenizer,\n pipeline,\n )\n except Exception:\n _log.error(\"Transformers not installed. Please run: pip install transformers\")\n raise\n\n tokenizer = AutoTokenizer.from_pretrained(HF_MODEL)\n model = AutoModelForTokenClassification.from_pretrained(HF_MODEL)\n ner = pipeline(\n \"token-classification\",\n model=model,\n tokenizer=tokenizer,\n aggregation_strategy=\"simple\", # groups subwords into complete entities\n # Note: modern Transformers returns `start`/`end` when possible with aggregation\n )\n return ner\n\n\nclass SimplePiiObfuscator:\n \"\"\"Tracks PII strings and replaces them with stable IDs per entity type.\"\"\"\n\n def __init__(self, ner_callable):\n self.ner = ner_callable\n self.entity_map: Dict[str, str] = {}\n self.counters: Dict[str, int] = {\n \"person\": 0,\n \"org\": 0,\n \"location\": 0,\n \"misc\": 0,\n }\n # Map model labels to our coarse types\n self.label_map = {\n \"PER\": \"person\",\n \"PERSON\": \"person\",\n \"ORG\": \"org\",\n \"ORGANIZATION\": \"org\",\n \"LOC\": \"location\",\n \"LOCATION\": \"location\",\n \"GPE\": \"location\",\n # Fallbacks\n \"MISC\": \"misc\",\n \"O\": \"misc\",\n }\n # Only obfuscate these by default. Adjust as needed.\n self.allowed_types = {\"person\", \"org\", \"location\"}\n\n def _next_id(self, typ: str) -> str:\n self.counters[typ] += 1\n return f\"{typ}-{self.counters[typ]}\"\n\n def _normalize(self, s: str) -> str:\n return re.sub(r\"\\s+\", \" \", s).strip()\n\n def _extract_entities(self, text: str) -> List[Tuple[str, str]]:\n \"\"\"Run NER and return a list of (surface_text, type) to obfuscate.\"\"\"\n if not text:\n return []\n results = self.ner(text)\n # Collect normalized items with optional span info\n items = []\n for r in results:\n raw_label = r.get(\"entity_group\") or r.get(\"entity\") or \"MISC\"\n label = self.label_map.get(raw_label, \"misc\")\n if label not in self.allowed_types:\n continue\n start = r.get(\"start\")\n end = r.get(\"end\")\n word = self._normalize(r.get(\"word\") or r.get(\"text\") or \"\")\n items.append({\"label\": label, \"start\": start, \"end\": end, \"word\": word})\n\n found: List[Tuple[str, str]] = []\n # If the pipeline provides character spans, merge consecutive/overlapping\n # entities of the same type into a single span, then take the substring\n # from the original text. This handles cases like subword tokenization\n # where multiple adjacent pieces belong to the same named entity.\n have_spans = any(i[\"start\"] is not None and i[\"end\"] is not None for i in items)\n if have_spans:\n spans = [\n i for i in items if i[\"start\"] is not None and i[\"end\"] is not None\n ]\n # Ensure processing order by start (then end)\n spans.sort(key=lambda x: (x[\"start\"], x[\"end\"]))\n\n merged = []\n for s in spans:\n if not merged:\n merged.append(dict(s))\n continue\n last = merged[-1]\n if s[\"label\"] == last[\"label\"] and s[\"start\"] <= last[\"end\"]:\n # Merge identical, overlapping, or touching spans of same type\n last[\"start\"] = min(last[\"start\"], s[\"start\"])\n last[\"end\"] = max(last[\"end\"], s[\"end\"])\n else:\n merged.append(dict(s))\n\n for m in merged:\n surface = self._normalize(text[m[\"start\"] : m[\"end\"]])\n if surface:\n found.append((surface, m[\"label\"]))\n\n # Include any items lacking spans as-is (fallback)\n for i in items:\n if i[\"start\"] is None or i[\"end\"] is None:\n if i[\"word\"]:\n found.append((i[\"word\"], i[\"label\"]))\n else:\n # Fallback when spans aren't provided: return normalized words\n for i in items:\n if i[\"word\"]:\n found.append((i[\"word\"], i[\"label\"]))\n return found\n\n def obfuscate_text(self, text: str) -> str:\n if not text:\n return text\n\n entities = self._extract_entities(text)\n if not entities:\n return text\n\n # Deduplicate per text, keep stable global mapping\n unique_words: Dict[str, str] = {}\n for word, label in entities:\n if word not in self.entity_map:\n replacement = self._next_id(label)\n self.entity_map[word] = replacement\n unique_words[word] = self.entity_map[word]\n\n # Replace longer matches first to avoid partial overlaps\n sorted_pairs = sorted(\n unique_words.items(), key=lambda x: len(x[0]), reverse=True\n )\n\n def replace_once(s: str, old: str, new: str) -> str:\n # Use simple substring replacement; for stricter matching, use word boundaries\n # when appropriate (e.g., names). This is a demo, keep it simple.\n pattern = re.escape(old)\n return re.sub(pattern, new, s)\n\n obfuscated = text\n for old, new in sorted_pairs:\n obfuscated = replace_once(obfuscated, old, new)\n return obfuscated\n\n\ndef _build_gliner_model():\n \"\"\"Create a GLiNER model for PII-like entity extraction.\n\n Returns a tuple (model, labels) where model.predict_entities(text, labels)\n yields entities with \"text\" and \"label\" fields.\n \"\"\"\n try:\n from gliner import GLiNER # type: ignore\n except Exception:\n _log.error(\n \"GLiNER not installed. Please run: pip install gliner torch --extra-index-url https://download.pytorch.org/whl/cpu\"\n )\n raise\n\n model = GLiNER.from_pretrained(GLINER_MODEL)\n # Curated set of labels for PII detection. Adjust as needed.\n labels = [\n # \"work\",\n \"booking number\",\n \"personally identifiable information\",\n \"driver licence\",\n \"person\",\n \"full address\",\n \"company\",\n # \"actor\",\n # \"character\",\n \"email\",\n \"passport number\",\n \"Social Security Number\",\n \"phone number\",\n ]\n return model, labels\n\n\nclass AdvancedPIIObfuscator:\n \"\"\"PII obfuscator powered by GLiNER with fine-grained labels.\n\n - Uses GLiNER's `predict_entities(text, labels)` to detect entities.\n - Obfuscates with stable IDs per fine-grained label, e.g. `email-1`.\n \"\"\"\n\n def __init__(self, gliner_model, labels: List[str]):\n self.model = gliner_model\n self.labels = labels\n self.entity_map: Dict[str, str] = {}\n self.counters: Dict[str, int] = {}\n\n def _normalize(self, s: str) -> str:\n return re.sub(r\"\\s+\", \" \", s).strip()\n\n def _norm_label(self, label: str) -> str:\n return (\n re.sub(\n r\"[^a-z0-9_]+\", \"_\", label.lower().replace(\" \", \"_\").replace(\"-\", \"_\")\n ).strip(\"_\")\n or \"pii\"\n )\n\n def _next_id(self, typ: str) -> str:\n self.cc(typ)\n self.counters[typ] += 1\n return f\"{typ}-{self.counters[typ]}\"\n\n def cc(self, typ: str) -> None:\n if typ not in self.counters:\n self.counters[typ] = 0\n\n def _extract_entities(self, text: str) -> List[Tuple[str, str]]:\n if not text:\n return []\n results = self.model.predict_entities(\n text, self.labels\n ) # expects dicts with text/label\n found: List[Tuple[str, str]] = []\n for r in results:\n label = self._norm_label(str(r.get(\"label\", \"pii\")))\n surface = self._normalize(str(r.get(\"text\", \"\")))\n if surface:\n found.append((surface, label))\n return found\n\n def obfuscate_text(self, text: str) -> str:\n if not text:\n return text\n entities = self._extract_entities(text)\n if not entities:\n return text\n\n unique_words: Dict[str, str] = {}\n for word, label in entities:\n if word not in self.entity_map:\n replacement = self._next_id(label)\n self.entity_map[word] = replacement\n unique_words[word] = self.entity_map[word]\n\n sorted_pairs = sorted(\n unique_words.items(), key=lambda x: len(x[0]), reverse=True\n )\n\n def replace_once(s: str, old: str, new: str) -> str:\n pattern = re.escape(old)\n return re.sub(pattern, new, s)\n\n obfuscated = text\n for old, new in sorted_pairs:\n obfuscated = replace_once(obfuscated, old, new)\n return obfuscated\n\n\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\") # ensure this directory exists before saving\n\n # Choose engine via CLI flag or env var (default: hf)\n parser = argparse.ArgumentParser(description=\"PII obfuscation example\")\n parser.add_argument(\n \"--engine\",\n choices=[\"hf\", \"gliner\"],\n default=os.getenv(\"PII_ENGINE\", \"hf\"),\n help=\"NER engine: 'hf' (Transformers) or 'gliner' (GLiNER)\",\n )\n args = parser.parse_args()\n\n # Ensure output dir exists\n output_dir.mkdir(parents=True, exist_ok=True)\n\n # Keep and generate images so Markdown can embed them\n pipeline_options = PdfPipelineOptions()\n pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n pipeline_options.generate_page_images = True\n pipeline_options.generate_picture_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n conv_res = doc_converter.convert(input_doc_path)\n conv_doc = conv_res.document\n doc_filename = conv_res.input.file.name\n\n # Save markdown with embedded pictures in original text\n md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\"\n conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n # Build NER pipeline and obfuscator\n if args.engine == \"gliner\":\n _log.info(\"Using GLiNER-based AdvancedPIIObfuscator\")\n gliner_model, gliner_labels = _build_gliner_model()\n obfuscator = AdvancedPIIObfuscator(gliner_model, gliner_labels)\n else:\n _log.info(\"Using HF Transformers-based SimplePiiObfuscator\")\n ner = _build_simple_ner_pipeline()\n obfuscator = SimplePiiObfuscator(ner)\n\n for element, _level in conv_res.document.iterate_items():\n if isinstance(element, TextItem):\n element.orig = element.text\n element.text = obfuscator.obfuscate_text(element.text)\n # print(element.orig, \" => \", element.text)\n\n elif isinstance(element, TableItem):\n for cell in element.data.table_cells:\n cell.text = obfuscator.obfuscate_text(cell.text)\n\n # Save markdown with embedded pictures and obfuscated text\n md_filename = output_dir / f\"{doc_filename}-with-images-pii-obfuscated.md\"\n conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n # Optional: log mapping summary\n if obfuscator.entity_map:\n data = []\n for key, val in obfuscator.entity_map.items():\n data.append([key, val])\n\n _log.info(\n f\"Obfuscated entities:\\n\\n{tabulate(data)}\",\n )\n\n\nif __name__ == \"__main__\":\n main()\n import argparse import logging import os import re from pathlib import Path from typing import Dict, List, Tuple from docling_core.types.doc import ImageRefMode, TableItem, TextItem from tabulate import tabulate from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption _log = logging.getLogger(__name__) IMAGE_RESOLUTION_SCALE = 2.0 HF_MODEL = \"dslim/bert-base-NER\" # Swap with another HF NER/PII model if desired, eg https://huggingface.co/urchade/gliner_multi_pii-v1 looks very promising too! GLINER_MODEL = \"urchade/gliner_multi_pii-v1\" def _build_simple_ner_pipeline(): \"\"\"Create a Hugging Face token-classification pipeline for NER. Returns a callable like: ner(text) -> List[dict] \"\"\" try: from transformers import ( AutoModelForTokenClassification, AutoTokenizer, pipeline, ) except Exception: _log.error(\"Transformers not installed. Please run: pip install transformers\") raise tokenizer = AutoTokenizer.from_pretrained(HF_MODEL) model = AutoModelForTokenClassification.from_pretrained(HF_MODEL) ner = pipeline( \"token-classification\", model=model, tokenizer=tokenizer, aggregation_strategy=\"simple\", # groups subwords into complete entities # Note: modern Transformers returns `start`/`end` when possible with aggregation ) return ner class SimplePiiObfuscator: \"\"\"Tracks PII strings and replaces them with stable IDs per entity type.\"\"\" def __init__(self, ner_callable): self.ner = ner_callable self.entity_map: Dict[str, str] = {} self.counters: Dict[str, int] = { \"person\": 0, \"org\": 0, \"location\": 0, \"misc\": 0, } # Map model labels to our coarse types self.label_map = { \"PER\": \"person\", \"PERSON\": \"person\", \"ORG\": \"org\", \"ORGANIZATION\": \"org\", \"LOC\": \"location\", \"LOCATION\": \"location\", \"GPE\": \"location\", # Fallbacks \"MISC\": \"misc\", \"O\": \"misc\", } # Only obfuscate these by default. Adjust as needed. self.allowed_types = {\"person\", \"org\", \"location\"} def _next_id(self, typ: str) -> str: self.counters[typ] += 1 return f\"{typ}-{self.counters[typ]}\" def _normalize(self, s: str) -> str: return re.sub(r\"\\s+\", \" \", s).strip() def _extract_entities(self, text: str) -> List[Tuple[str, str]]: \"\"\"Run NER and return a list of (surface_text, type) to obfuscate.\"\"\" if not text: return [] results = self.ner(text) # Collect normalized items with optional span info items = [] for r in results: raw_label = r.get(\"entity_group\") or r.get(\"entity\") or \"MISC\" label = self.label_map.get(raw_label, \"misc\") if label not in self.allowed_types: continue start = r.get(\"start\") end = r.get(\"end\") word = self._normalize(r.get(\"word\") or r.get(\"text\") or \"\") items.append({\"label\": label, \"start\": start, \"end\": end, \"word\": word}) found: List[Tuple[str, str]] = [] # If the pipeline provides character spans, merge consecutive/overlapping # entities of the same type into a single span, then take the substring # from the original text. This handles cases like subword tokenization # where multiple adjacent pieces belong to the same named entity. have_spans = any(i[\"start\"] is not None and i[\"end\"] is not None for i in items) if have_spans: spans = [ i for i in items if i[\"start\"] is not None and i[\"end\"] is not None ] # Ensure processing order by start (then end) spans.sort(key=lambda x: (x[\"start\"], x[\"end\"])) merged = [] for s in spans: if not merged: merged.append(dict(s)) continue last = merged[-1] if s[\"label\"] == last[\"label\"] and s[\"start\"] <= last[\"end\"]: # Merge identical, overlapping, or touching spans of same type last[\"start\"] = min(last[\"start\"], s[\"start\"]) last[\"end\"] = max(last[\"end\"], s[\"end\"]) else: merged.append(dict(s)) for m in merged: surface = self._normalize(text[m[\"start\"] : m[\"end\"]]) if surface: found.append((surface, m[\"label\"])) # Include any items lacking spans as-is (fallback) for i in items: if i[\"start\"] is None or i[\"end\"] is None: if i[\"word\"]: found.append((i[\"word\"], i[\"label\"])) else: # Fallback when spans aren't provided: return normalized words for i in items: if i[\"word\"]: found.append((i[\"word\"], i[\"label\"])) return found def obfuscate_text(self, text: str) -> str: if not text: return text entities = self._extract_entities(text) if not entities: return text # Deduplicate per text, keep stable global mapping unique_words: Dict[str, str] = {} for word, label in entities: if word not in self.entity_map: replacement = self._next_id(label) self.entity_map[word] = replacement unique_words[word] = self.entity_map[word] # Replace longer matches first to avoid partial overlaps sorted_pairs = sorted( unique_words.items(), key=lambda x: len(x[0]), reverse=True ) def replace_once(s: str, old: str, new: str) -> str: # Use simple substring replacement; for stricter matching, use word boundaries # when appropriate (e.g., names). This is a demo, keep it simple. pattern = re.escape(old) return re.sub(pattern, new, s) obfuscated = text for old, new in sorted_pairs: obfuscated = replace_once(obfuscated, old, new) return obfuscated def _build_gliner_model(): \"\"\"Create a GLiNER model for PII-like entity extraction. Returns a tuple (model, labels) where model.predict_entities(text, labels) yields entities with \"text\" and \"label\" fields. \"\"\" try: from gliner import GLiNER # type: ignore except Exception: _log.error( \"GLiNER not installed. Please run: pip install gliner torch --extra-index-url https://download.pytorch.org/whl/cpu\" ) raise model = GLiNER.from_pretrained(GLINER_MODEL) # Curated set of labels for PII detection. Adjust as needed. labels = [ # \"work\", \"booking number\", \"personally identifiable information\", \"driver licence\", \"person\", \"full address\", \"company\", # \"actor\", # \"character\", \"email\", \"passport number\", \"Social Security Number\", \"phone number\", ] return model, labels class AdvancedPIIObfuscator: \"\"\"PII obfuscator powered by GLiNER with fine-grained labels. - Uses GLiNER's `predict_entities(text, labels)` to detect entities. - Obfuscates with stable IDs per fine-grained label, e.g. `email-1`. \"\"\" def __init__(self, gliner_model, labels: List[str]): self.model = gliner_model self.labels = labels self.entity_map: Dict[str, str] = {} self.counters: Dict[str, int] = {} def _normalize(self, s: str) -> str: return re.sub(r\"\\s+\", \" \", s).strip() def _norm_label(self, label: str) -> str: return ( re.sub( r\"[^a-z0-9_]+\", \"_\", label.lower().replace(\" \", \"_\").replace(\"-\", \"_\") ).strip(\"_\") or \"pii\" ) def _next_id(self, typ: str) -> str: self.cc(typ) self.counters[typ] += 1 return f\"{typ}-{self.counters[typ]}\" def cc(self, typ: str) -> None: if typ not in self.counters: self.counters[typ] = 0 def _extract_entities(self, text: str) -> List[Tuple[str, str]]: if not text: return [] results = self.model.predict_entities( text, self.labels ) # expects dicts with text/label found: List[Tuple[str, str]] = [] for r in results: label = self._norm_label(str(r.get(\"label\", \"pii\"))) surface = self._normalize(str(r.get(\"text\", \"\"))) if surface: found.append((surface, label)) return found def obfuscate_text(self, text: str) -> str: if not text: return text entities = self._extract_entities(text) if not entities: return text unique_words: Dict[str, str] = {} for word, label in entities: if word not in self.entity_map: replacement = self._next_id(label) self.entity_map[word] = replacement unique_words[word] = self.entity_map[word] sorted_pairs = sorted( unique_words.items(), key=lambda x: len(x[0]), reverse=True ) def replace_once(s: str, old: str, new: str) -> str: pattern = re.escape(old) return re.sub(pattern, new, s) obfuscated = text for old, new in sorted_pairs: obfuscated = replace_once(obfuscated, old, new) return obfuscated def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") # ensure this directory exists before saving # Choose engine via CLI flag or env var (default: hf) parser = argparse.ArgumentParser(description=\"PII obfuscation example\") parser.add_argument( \"--engine\", choices=[\"hf\", \"gliner\"], default=os.getenv(\"PII_ENGINE\", \"hf\"), help=\"NER engine: 'hf' (Transformers) or 'gliner' (GLiNER)\", ) args = parser.parse_args() # Ensure output dir exists output_dir.mkdir(parents=True, exist_ok=True) # Keep and generate images so Markdown can embed them pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) conv_res = doc_converter.convert(input_doc_path) conv_doc = conv_res.document doc_filename = conv_res.input.file.name # Save markdown with embedded pictures in original text md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\" conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED) # Build NER pipeline and obfuscator if args.engine == \"gliner\": _log.info(\"Using GLiNER-based AdvancedPIIObfuscator\") gliner_model, gliner_labels = _build_gliner_model() obfuscator = AdvancedPIIObfuscator(gliner_model, gliner_labels) else: _log.info(\"Using HF Transformers-based SimplePiiObfuscator\") ner = _build_simple_ner_pipeline() obfuscator = SimplePiiObfuscator(ner) for element, _level in conv_res.document.iterate_items(): if isinstance(element, TextItem): element.orig = element.text element.text = obfuscator.obfuscate_text(element.text) # print(element.orig, \" => \", element.text) elif isinstance(element, TableItem): for cell in element.data.table_cells: cell.text = obfuscator.obfuscate_text(cell.text) # Save markdown with embedded pictures and obfuscated text md_filename = output_dir / f\"{doc_filename}-with-images-pii-obfuscated.md\" conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED) # Optional: log mapping summary if obfuscator.entity_map: data = [] for key, val in obfuscator.entity_map.items(): data.append([key, val]) _log.info( f\"Obfuscated entities:\\n\\n{tabulate(data)}\", ) if __name__ == \"__main__\": main()"},{"location":"examples/post_process_ocr_with_vlm/","title":"Post process ocr with vlm","text":"In\u00a0[\u00a0]: Copied! import argparse\nimport logging\nimport os\nimport re\nfrom collections.abc import Iterable\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\nfrom typing import Any, Optional, Union\nimport argparse import logging import os import re from collections.abc import Iterable from concurrent.futures import ThreadPoolExecutor from pathlib import Path from typing import Any, Optional, Union In\u00a0[\u00a0]: Copied!
import numpy as np\nfrom docling_core.types.doc import (\n DoclingDocument,\n ImageRefMode,\n NodeItem,\n TextItem,\n)\nfrom docling_core.types.doc.document import (\n ContentLayer,\n DocItem,\n FormItem,\n GraphCell,\n KeyValueItem,\n PictureItem,\n RichTableCell,\n TableCell,\n TableItem,\n)\nfrom PIL import Image, ImageFilter\nfrom PIL.ImageOps import crop\nfrom pydantic import BaseModel, ConfigDict\nfrom tqdm import tqdm\nimport numpy as np from docling_core.types.doc import ( DoclingDocument, ImageRefMode, NodeItem, TextItem, ) from docling_core.types.doc.document import ( ContentLayer, DocItem, FormItem, GraphCell, KeyValueItem, PictureItem, RichTableCell, TableCell, TableItem, ) from PIL import Image, ImageFilter from PIL.ImageOps import crop from pydantic import BaseModel, ConfigDict from tqdm import tqdm In\u00a0[\u00a0]: Copied!
from docling.backend.json.docling_json_backend import DoclingJSONBackend\nfrom docling.datamodel.accelerator_options import AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import (\n ConvertPipelineOptions,\n PdfPipelineOptions,\n PictureDescriptionApiOptions,\n)\nfrom docling.document_converter import DocumentConverter, FormatOption, PdfFormatOption\nfrom docling.exceptions import OperationNotAllowed\nfrom docling.models.base_model import BaseModelWithOptions, GenericEnrichmentModel\nfrom docling.pipeline.simple_pipeline import SimplePipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\nfrom docling.utils.api_image_request import api_image_request\nfrom docling.utils.profiling import ProfilingScope, TimeRecorder\nfrom docling.utils.utils import chunkify\nfrom docling.backend.json.docling_json_backend import DoclingJSONBackend from docling.datamodel.accelerator_options import AcceleratorOptions from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import ( ConvertPipelineOptions, PdfPipelineOptions, PictureDescriptionApiOptions, ) from docling.document_converter import DocumentConverter, FormatOption, PdfFormatOption from docling.exceptions import OperationNotAllowed from docling.models.base_model import BaseModelWithOptions, GenericEnrichmentModel from docling.pipeline.simple_pipeline import SimplePipeline from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline from docling.utils.api_image_request import api_image_request from docling.utils.profiling import ProfilingScope, TimeRecorder from docling.utils.utils import chunkify
Example on how to apply to Docling Document OCR as a post-processing with \"nanonets-ocr2-3b\" via LM Studio Requires LM Studio running inference server with \"nanonets-ocr2-3b\" model pre-loaded To run: uv run python docs/examples/post_process_ocr_with_vlm.py
In\u00a0[\u00a0]: Copied!LM_STUDIO_URL = \"http://localhost:1234/v1/chat/completions\"\nLM_STUDIO_MODEL = \"nanonets-ocr2-3b\"\nLM_STUDIO_URL = \"http://localhost:1234/v1/chat/completions\" LM_STUDIO_MODEL = \"nanonets-ocr2-3b\" In\u00a0[\u00a0]: Copied!
DEFAULT_PROMPT = \"Extract the text from the above document as if you were reading it naturally. Output pure text, no html and no markdown. Pay attention on line breaks and don't miss text after line break. Put all text in one line.\"\nVERBOSE = True\nSHOW_IMAGE = False\nSHOW_EMPTY_CROPS = False\nSHOW_NONEMPTY_CROPS = False\nPRINT_RESULT_MARKDOWN = False\nDEFAULT_PROMPT = \"Extract the text from the above document as if you were reading it naturally. Output pure text, no html and no markdown. Pay attention on line breaks and don't miss text after line break. Put all text in one line.\" VERBOSE = True SHOW_IMAGE = False SHOW_EMPTY_CROPS = False SHOW_NONEMPTY_CROPS = False PRINT_RESULT_MARKDOWN = False In\u00a0[\u00a0]: Copied!
def is_empty_fast_with_lines_pil(\n pil_img: Image.Image,\n downscale_max_side: int = 48, # 64\n grad_threshold: float = 15.0, # how strong a gradient must be to count as edge\n min_line_coverage: float = 0.6, # line must cover 60% of height/width\n max_allowed_lines: int = 10, # allow up to this many strong lines (default 4)\n edge_fraction_threshold: float = 0.0035,\n):\n \"\"\"\n Fast 'empty' detector using only PIL + NumPy.\n\n Treats an image as empty if:\n - It has very few edges overall, OR\n - Edges can be explained by at most `max_allowed_lines` long vertical/horizontal lines.\n\n Returns:\n (is_empty: bool, remaining_edge_fraction: float, debug: dict)\n \"\"\"\n\n # 1) Convert to grayscale\n gray = pil_img.convert(\"L\")\n\n # 2) Aggressive downscale, keeping aspect ratio\n w0, h0 = gray.size\n max_side = max(w0, h0)\n if max_side > downscale_max_side:\n # scale = downscale_max_side / max_side\n # new_w = max(1, int(w0 * scale))\n # new_h = max(1, int(h0 * scale))\n\n new_w = downscale_max_side\n new_h = downscale_max_side\n\n gray = gray.resize((new_w, new_h), resample=Image.BILINEAR)\n\n w, h = gray.size\n if w == 0 or h == 0:\n return True, 0.0, {\"reason\": \"zero_size\"}\n\n # 3) Small blur to reduce noise\n gray = gray.filter(ImageFilter.BoxBlur(1))\n\n # 4) Convert to NumPy\n arr = np.asarray(\n gray, dtype=np.float32\n ) # shape (h, w) in PIL, but note: PIL size is (w, h)\n H, W = arr.shape\n\n # 5) Compute simple gradients (forward differences)\n gx = np.zeros_like(arr)\n gy = np.zeros_like(arr)\n\n gx[:, :-1] = arr[:, 1:] - arr[:, :-1] # horizontal differences\n gy[:-1, :] = arr[1:, :] - arr[:-1, :] # vertical differences\n\n mag = np.hypot(gx, gy) # gradient magnitude\n\n # 6) Threshold gradients to get edges (boolean mask)\n edges = mag > grad_threshold\n edge_fraction = edges.mean()\n\n # Quick early-exit: almost no edges => empty\n if edge_fraction < edge_fraction_threshold:\n return True, float(edge_fraction), {\"reason\": \"few_edges\"}\n\n # 7) Detect strong vertical & horizontal lines via edge sums\n col_sum = edges.sum(axis=0) # per column\n row_sum = edges.sum(axis=1) # per row\n\n # Line must have edge pixels in at least `min_line_coverage` of the dimension\n vert_line_cols = np.where(col_sum >= min_line_coverage * H)[0]\n horiz_line_rows = np.where(row_sum >= min_line_coverage * W)[0]\n\n num_lines = len(vert_line_cols) + len(horiz_line_rows)\n\n # If we have more long lines than allowed => non-empty\n if num_lines > max_allowed_lines:\n return (\n False,\n float(edge_fraction),\n {\n \"reason\": \"too_many_lines\",\n \"num_lines\": int(num_lines),\n \"edge_fraction\": float(edge_fraction),\n },\n )\n\n # 8) Mask out those lines and recompute remaining edges\n line_mask = np.zeros_like(edges, dtype=bool)\n if len(vert_line_cols) > 0:\n line_mask[:, vert_line_cols] = True\n if len(horiz_line_rows) > 0:\n line_mask[horiz_line_rows, :] = True\n\n remaining_edges = edges & ~line_mask\n remaining_edge_fraction = remaining_edges.mean()\n\n is_empty = remaining_edge_fraction < edge_fraction_threshold\n\n debug = {\n \"original_edge_fraction\": float(edge_fraction),\n \"remaining_edge_fraction\": float(remaining_edge_fraction),\n \"num_vert_lines\": len(vert_line_cols),\n \"num_horiz_lines\": len(horiz_line_rows),\n }\n return is_empty, float(remaining_edge_fraction), debug\n def is_empty_fast_with_lines_pil( pil_img: Image.Image, downscale_max_side: int = 48, # 64 grad_threshold: float = 15.0, # how strong a gradient must be to count as edge min_line_coverage: float = 0.6, # line must cover 60% of height/width max_allowed_lines: int = 10, # allow up to this many strong lines (default 4) edge_fraction_threshold: float = 0.0035, ): \"\"\" Fast 'empty' detector using only PIL + NumPy. Treats an image as empty if: - It has very few edges overall, OR - Edges can be explained by at most `max_allowed_lines` long vertical/horizontal lines. Returns: (is_empty: bool, remaining_edge_fraction: float, debug: dict) \"\"\" # 1) Convert to grayscale gray = pil_img.convert(\"L\") # 2) Aggressive downscale, keeping aspect ratio w0, h0 = gray.size max_side = max(w0, h0) if max_side > downscale_max_side: # scale = downscale_max_side / max_side # new_w = max(1, int(w0 * scale)) # new_h = max(1, int(h0 * scale)) new_w = downscale_max_side new_h = downscale_max_side gray = gray.resize((new_w, new_h), resample=Image.BILINEAR) w, h = gray.size if w == 0 or h == 0: return True, 0.0, {\"reason\": \"zero_size\"} # 3) Small blur to reduce noise gray = gray.filter(ImageFilter.BoxBlur(1)) # 4) Convert to NumPy arr = np.asarray( gray, dtype=np.float32 ) # shape (h, w) in PIL, but note: PIL size is (w, h) H, W = arr.shape # 5) Compute simple gradients (forward differences) gx = np.zeros_like(arr) gy = np.zeros_like(arr) gx[:, :-1] = arr[:, 1:] - arr[:, :-1] # horizontal differences gy[:-1, :] = arr[1:, :] - arr[:-1, :] # vertical differences mag = np.hypot(gx, gy) # gradient magnitude # 6) Threshold gradients to get edges (boolean mask) edges = mag > grad_threshold edge_fraction = edges.mean() # Quick early-exit: almost no edges => empty if edge_fraction < edge_fraction_threshold: return True, float(edge_fraction), {\"reason\": \"few_edges\"} # 7) Detect strong vertical & horizontal lines via edge sums col_sum = edges.sum(axis=0) # per column row_sum = edges.sum(axis=1) # per row # Line must have edge pixels in at least `min_line_coverage` of the dimension vert_line_cols = np.where(col_sum >= min_line_coverage * H)[0] horiz_line_rows = np.where(row_sum >= min_line_coverage * W)[0] num_lines = len(vert_line_cols) + len(horiz_line_rows) # If we have more long lines than allowed => non-empty if num_lines > max_allowed_lines: return ( False, float(edge_fraction), { \"reason\": \"too_many_lines\", \"num_lines\": int(num_lines), \"edge_fraction\": float(edge_fraction), }, ) # 8) Mask out those lines and recompute remaining edges line_mask = np.zeros_like(edges, dtype=bool) if len(vert_line_cols) > 0: line_mask[:, vert_line_cols] = True if len(horiz_line_rows) > 0: line_mask[horiz_line_rows, :] = True remaining_edges = edges & ~line_mask remaining_edge_fraction = remaining_edges.mean() is_empty = remaining_edge_fraction < edge_fraction_threshold debug = { \"original_edge_fraction\": float(edge_fraction), \"remaining_edge_fraction\": float(remaining_edge_fraction), \"num_vert_lines\": len(vert_line_cols), \"num_horiz_lines\": len(horiz_line_rows), } return is_empty, float(remaining_edge_fraction), debug In\u00a0[\u00a0]: Copied! def remove_break_lines(text: str) -> str:\n # Replace any newline types with a single space\n cleaned = re.sub(r\"[\\r\\n]+\", \" \", text)\n # Collapse multiple spaces into one\n cleaned = re.sub(r\"\\s+\", \" \", cleaned)\n return cleaned.strip()\ndef remove_break_lines(text: str) -> str: # Replace any newline types with a single space cleaned = re.sub(r\"[\\r\\n]+\", \" \", text) # Collapse multiple spaces into one cleaned = re.sub(r\"\\s+\", \" \", cleaned) return cleaned.strip() In\u00a0[\u00a0]: Copied!
def safe_crop(img: Image.Image, bbox):\n left, top, right, bottom = bbox\n # Clamp to image boundaries\n left = max(0, min(left, img.width))\n top = max(0, min(top, img.height))\n right = max(0, min(right, img.width))\n bottom = max(0, min(bottom, img.height))\n return img.crop((left, top, right, bottom))\ndef safe_crop(img: Image.Image, bbox): left, top, right, bottom = bbox # Clamp to image boundaries left = max(0, min(left, img.width)) top = max(0, min(top, img.height)) right = max(0, min(right, img.width)) bottom = max(0, min(bottom, img.height)) return img.crop((left, top, right, bottom)) In\u00a0[\u00a0]: Copied!
def no_long_repeats(s: str, threshold: int) -> bool:\n \"\"\"\n Returns False if the string `s` contains more than `threshold`\n identical characters in a row, otherwise True.\n \"\"\"\n pattern = r\"(.)\\1{\" + str(threshold) + \",}\"\n return re.search(pattern, s) is None\n def no_long_repeats(s: str, threshold: int) -> bool: \"\"\" Returns False if the string `s` contains more than `threshold` identical characters in a row, otherwise True. \"\"\" pattern = r\"(.)\\1{\" + str(threshold) + \",}\" return re.search(pattern, s) is None In\u00a0[\u00a0]: Copied! class PostOcrEnrichmentElement(BaseModel):\n model_config = ConfigDict(arbitrary_types_allowed=True)\n\n item: Union[DocItem, TableCell, RichTableCell, GraphCell]\n image: list[\n Image.Image\n ] # Needs to be an a list of images for multi-provenance elements\nclass PostOcrEnrichmentElement(BaseModel): model_config = ConfigDict(arbitrary_types_allowed=True) item: Union[DocItem, TableCell, RichTableCell, GraphCell] image: list[ Image.Image ] # Needs to be an a list of images for multi-provenance elements In\u00a0[\u00a0]: Copied!
class PostOcrEnrichmentPipelineOptions(ConvertPipelineOptions):\n api_options: PictureDescriptionApiOptions\nclass PostOcrEnrichmentPipelineOptions(ConvertPipelineOptions): api_options: PictureDescriptionApiOptions In\u00a0[\u00a0]: Copied!
class PostOcrEnrichmentPipeline(SimplePipeline):\n def __init__(self, pipeline_options: PostOcrEnrichmentPipelineOptions):\n super().__init__(pipeline_options)\n self.pipeline_options: PostOcrEnrichmentPipelineOptions\n\n self.enrichment_pipe = [\n PostOcrApiEnrichmentModel(\n enabled=True,\n enable_remote_services=True,\n artifacts_path=None,\n options=self.pipeline_options.api_options,\n accelerator_options=AcceleratorOptions(),\n )\n ]\n\n @classmethod\n def get_default_options(cls) -> PostOcrEnrichmentPipelineOptions:\n return PostOcrEnrichmentPipelineOptions()\n\n def _enrich_document(self, conv_res: ConversionResult) -> ConversionResult:\n def _prepare_elements(\n conv_res: ConversionResult, model: GenericEnrichmentModel[Any]\n ) -> Iterable[NodeItem]:\n for doc_element, _level in conv_res.document.iterate_items(\n traverse_pictures=True,\n included_content_layers={\n ContentLayer.BODY,\n ContentLayer.FURNITURE,\n },\n ): # With all content layers, with traverse_pictures=True\n prepared_elements = (\n model.prepare_element( # make this one yield multiple items.\n conv_res=conv_res, element=doc_element\n )\n )\n if prepared_elements is not None:\n yield prepared_elements\n\n with TimeRecorder(conv_res, \"doc_enrich\", scope=ProfilingScope.DOCUMENT):\n for model in self.enrichment_pipe:\n for element_batch in chunkify(\n _prepare_elements(conv_res, model),\n model.elements_batch_size,\n ):\n for element in model(\n doc=conv_res.document, element_batch=element_batch\n ): # Must exhaust!\n pass\n return conv_res\n class PostOcrEnrichmentPipeline(SimplePipeline): def __init__(self, pipeline_options: PostOcrEnrichmentPipelineOptions): super().__init__(pipeline_options) self.pipeline_options: PostOcrEnrichmentPipelineOptions self.enrichment_pipe = [ PostOcrApiEnrichmentModel( enabled=True, enable_remote_services=True, artifacts_path=None, options=self.pipeline_options.api_options, accelerator_options=AcceleratorOptions(), ) ] @classmethod def get_default_options(cls) -> PostOcrEnrichmentPipelineOptions: return PostOcrEnrichmentPipelineOptions() def _enrich_document(self, conv_res: ConversionResult) -> ConversionResult: def _prepare_elements( conv_res: ConversionResult, model: GenericEnrichmentModel[Any] ) -> Iterable[NodeItem]: for doc_element, _level in conv_res.document.iterate_items( traverse_pictures=True, included_content_layers={ ContentLayer.BODY, ContentLayer.FURNITURE, }, ): # With all content layers, with traverse_pictures=True prepared_elements = ( model.prepare_element( # make this one yield multiple items. conv_res=conv_res, element=doc_element ) ) if prepared_elements is not None: yield prepared_elements with TimeRecorder(conv_res, \"doc_enrich\", scope=ProfilingScope.DOCUMENT): for model in self.enrichment_pipe: for element_batch in chunkify( _prepare_elements(conv_res, model), model.elements_batch_size, ): for element in model( doc=conv_res.document, element_batch=element_batch ): # Must exhaust! pass return conv_res In\u00a0[\u00a0]: Copied! class PostOcrApiEnrichmentModel(\n GenericEnrichmentModel[PostOcrEnrichmentElement], BaseModelWithOptions\n):\n expansion_factor: float = 0.001\n\n def prepare_element(\n self, conv_res: ConversionResult, element: NodeItem\n ) -> Optional[list[PostOcrEnrichmentElement]]:\n if not self.is_processable(doc=conv_res.document, element=element):\n return None\n\n allowed = (DocItem, TableItem, GraphCell)\n assert isinstance(element, allowed)\n\n if isinstance(element, (KeyValueItem, FormItem)):\n # Yield from the graphCells inside here.\n result = []\n for c in element.graph.cells:\n element_prov = c.prov # Key / Value have only one provenance!\n bbox = element_prov.bbox\n page_ix = element_prov.page_no\n bbox = bbox.scale_to_size(\n old_size=conv_res.document.pages[page_ix].size,\n new_size=conv_res.document.pages[page_ix].image.size,\n )\n expanded_bbox = bbox.expand_by_scale(\n x_scale=self.expansion_factor, y_scale=self.expansion_factor\n ).to_top_left_origin(\n page_height=conv_res.document.pages[page_ix].image.size.height\n )\n\n good_bbox = True\n if (\n expanded_bbox.l > expanded_bbox.r\n or expanded_bbox.t > expanded_bbox.b\n ):\n good_bbox = False\n\n if good_bbox:\n cropped_image = conv_res.document.pages[\n page_ix\n ].image.pil_image.crop(expanded_bbox.as_tuple())\n\n is_empty, rem_frac, debug = is_empty_fast_with_lines_pil(\n cropped_image\n )\n if is_empty:\n if SHOW_EMPTY_CROPS:\n try:\n cropped_image.show()\n except Exception as e:\n print(f\"Error with image: {e}\")\n print(\n f\"Detected empty form item image crop: {rem_frac} - {debug}\"\n )\n else:\n result.append(\n PostOcrEnrichmentElement(item=c, image=[cropped_image])\n )\n return result\n elif isinstance(element, TableItem):\n element_prov = element.prov[0]\n page_ix = element_prov.page_no\n result = []\n for i, row in enumerate(element.data.grid):\n for j, cell in enumerate(row):\n if hasattr(cell, \"bbox\"):\n if cell.bbox:\n bbox = cell.bbox\n bbox = bbox.scale_to_size(\n old_size=conv_res.document.pages[page_ix].size,\n new_size=conv_res.document.pages[page_ix].image.size,\n )\n\n expanded_bbox = bbox.expand_by_scale(\n x_scale=self.table_cell_expansion_factor,\n y_scale=self.table_cell_expansion_factor,\n ).to_top_left_origin(\n page_height=conv_res.document.pages[\n page_ix\n ].image.size.height\n )\n\n good_bbox = True\n if (\n expanded_bbox.l > expanded_bbox.r\n or expanded_bbox.t > expanded_bbox.b\n ):\n good_bbox = False\n\n if good_bbox:\n cropped_image = conv_res.document.pages[\n page_ix\n ].image.pil_image.crop(expanded_bbox.as_tuple())\n\n is_empty, rem_frac, debug = (\n is_empty_fast_with_lines_pil(cropped_image)\n )\n if is_empty:\n if SHOW_EMPTY_CROPS:\n try:\n cropped_image.show()\n except Exception as e:\n print(f\"Error with image: {e}\")\n print(\n f\"Detected empty table cell image crop: {rem_frac} - {debug}\"\n )\n else:\n if SHOW_NONEMPTY_CROPS:\n cropped_image.show()\n result.append(\n PostOcrEnrichmentElement(\n item=cell, image=[cropped_image]\n )\n )\n return result\n else:\n multiple_crops = []\n # Crop the image form the page\n for element_prov in element.prov:\n # Iterate over provenances\n bbox = element_prov.bbox\n\n page_ix = element_prov.page_no\n bbox = bbox.scale_to_size(\n old_size=conv_res.document.pages[page_ix].size,\n new_size=conv_res.document.pages[page_ix].image.size,\n )\n expanded_bbox = bbox.expand_by_scale(\n x_scale=self.expansion_factor, y_scale=self.expansion_factor\n ).to_top_left_origin(\n page_height=conv_res.document.pages[page_ix].image.size.height\n )\n\n good_bbox = True\n if (\n expanded_bbox.l > expanded_bbox.r\n or expanded_bbox.t > expanded_bbox.b\n ):\n good_bbox = False\n\n if hasattr(element, \"text\"):\n if good_bbox:\n cropped_image = conv_res.document.pages[\n page_ix\n ].image.pil_image.crop(expanded_bbox.as_tuple())\n\n is_empty, rem_frac, debug = is_empty_fast_with_lines_pil(\n cropped_image\n )\n if is_empty:\n if SHOW_EMPTY_CROPS:\n try:\n cropped_image.show()\n except Exception as e:\n print(f\"Error with image: {e}\")\n print(f\"Detected empty text crop: {rem_frac} - {debug}\")\n else:\n multiple_crops.append(cropped_image)\n if hasattr(element, \"text\"):\n print(f\"\\nOLD TEXT: {element.text}\")\n else:\n print(\"Not a text element\")\n if len(multiple_crops) > 0:\n # good crops\n return [PostOcrEnrichmentElement(item=element, image=multiple_crops)]\n else:\n # nothing\n return []\n\n @classmethod\n def get_options_type(cls) -> type[PictureDescriptionApiOptions]:\n return PictureDescriptionApiOptions\n\n def __init__(\n self,\n *,\n enabled: bool,\n enable_remote_services: bool,\n artifacts_path: Optional[Union[Path, str]],\n options: PictureDescriptionApiOptions,\n accelerator_options: AcceleratorOptions,\n ):\n self.enabled = enabled\n self.options = options\n self.concurrency = 2\n self.expansion_factor = 0.05\n self.table_cell_expansion_factor = 0.0 # do not modify table cell size\n self.elements_batch_size = 4\n self._accelerator_options = accelerator_options\n self._artifacts_path = (\n Path(artifacts_path) if isinstance(artifacts_path, str) else artifacts_path\n )\n\n if self.enabled and not enable_remote_services:\n raise OperationNotAllowed(\n \"Enable remote services by setting pipeline_options.enable_remote_services=True.\"\n )\n\n def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:\n return self.enabled\n\n def _annotate_images(self, images: Iterable[Image.Image]) -> Iterable[str]:\n def _api_request(image: Image.Image) -> str:\n res = api_image_request(\n image=image,\n prompt=self.options.prompt,\n url=self.options.url,\n # timeout=self.options.timeout,\n timeout=30,\n headers=self.options.headers,\n **self.options.params,\n )\n return res[0]\n\n with ThreadPoolExecutor(max_workers=self.concurrency) as executor:\n yield from executor.map(_api_request, images)\n\n def __call__(\n self,\n doc: DoclingDocument,\n element_batch: Iterable[ItemAndImageEnrichmentElement],\n ) -> Iterable[NodeItem]:\n if not self.enabled:\n for element in element_batch:\n yield element.item\n return\n\n elements: list[TextItem] = []\n images: list[Image.Image] = []\n img_ind_per_element: list[int] = []\n\n for element_stack in element_batch:\n for element in element_stack:\n allowed = (DocItem, TableCell, RichTableCell, GraphCell)\n assert isinstance(element.item, allowed)\n for ind, img in enumerate(element.image):\n elements.append(element.item)\n images.append(img)\n # images.append(element.image)\n img_ind_per_element.append(ind)\n\n if not images:\n return\n\n outputs = list(self._annotate_images(images))\n\n for item, output, img_ind in zip(elements, outputs, img_ind_per_element):\n # Sometimes model can return html tags, which are not strictly needed in our, so it's better to clean them\n def clean_html_tags(text):\n for tag in [\n \"<table>\",\n \"<tr>\",\n \"<td>\",\n \"<strong>\",\n \"</table>\",\n \"</tr>\",\n \"</td>\",\n \"</strong>\",\n \"<th>\",\n \"</th>\",\n \"<tbody>\",\n \"<tbody>\",\n \"<thead>\",\n \"</thead>\",\n ]:\n text = text.replace(tag, \"\")\n return text\n\n output = clean_html_tags(output).strip()\n output = remove_break_lines(output)\n # The last measure against hallucinations\n # Detect hallucinated string...\n if output.startswith(\"The first of these\"):\n output = \"\"\n\n if no_long_repeats(output, 50):\n if VERBOSE:\n if isinstance(item, (TextItem)):\n print(f\"\\nOLD TEXT: {item.text}\")\n\n # Re-populate text\n if isinstance(item, (TextItem, GraphCell)):\n if img_ind > 0:\n # Concat texts across several provenances\n item.text += \" \" + output\n # item.orig += \" \" + output\n else:\n item.text = output\n # item.orig = output\n elif isinstance(item, (TableCell, RichTableCell)):\n item.text = output\n elif isinstance(item, PictureItem):\n pass\n else:\n raise ValueError(f\"Unknown item type: {type(item)}\")\n\n if VERBOSE:\n if isinstance(item, (TextItem)):\n print(f\"NEW TEXT: {item.text}\")\n\n # Take care of charspans for relevant types\n if isinstance(item, GraphCell):\n item.prov.charspan = (0, len(item.text))\n elif isinstance(item, TextItem):\n item.prov[0].charspan = (0, len(item.text))\n\n yield item\n class PostOcrApiEnrichmentModel( GenericEnrichmentModel[PostOcrEnrichmentElement], BaseModelWithOptions ): expansion_factor: float = 0.001 def prepare_element( self, conv_res: ConversionResult, element: NodeItem ) -> Optional[list[PostOcrEnrichmentElement]]: if not self.is_processable(doc=conv_res.document, element=element): return None allowed = (DocItem, TableItem, GraphCell) assert isinstance(element, allowed) if isinstance(element, (KeyValueItem, FormItem)): # Yield from the graphCells inside here. result = [] for c in element.graph.cells: element_prov = c.prov # Key / Value have only one provenance! bbox = element_prov.bbox page_ix = element_prov.page_no bbox = bbox.scale_to_size( old_size=conv_res.document.pages[page_ix].size, new_size=conv_res.document.pages[page_ix].image.size, ) expanded_bbox = bbox.expand_by_scale( x_scale=self.expansion_factor, y_scale=self.expansion_factor ).to_top_left_origin( page_height=conv_res.document.pages[page_ix].image.size.height ) good_bbox = True if ( expanded_bbox.l > expanded_bbox.r or expanded_bbox.t > expanded_bbox.b ): good_bbox = False if good_bbox: cropped_image = conv_res.document.pages[ page_ix ].image.pil_image.crop(expanded_bbox.as_tuple()) is_empty, rem_frac, debug = is_empty_fast_with_lines_pil( cropped_image ) if is_empty: if SHOW_EMPTY_CROPS: try: cropped_image.show() except Exception as e: print(f\"Error with image: {e}\") print( f\"Detected empty form item image crop: {rem_frac} - {debug}\" ) else: result.append( PostOcrEnrichmentElement(item=c, image=[cropped_image]) ) return result elif isinstance(element, TableItem): element_prov = element.prov[0] page_ix = element_prov.page_no result = [] for i, row in enumerate(element.data.grid): for j, cell in enumerate(row): if hasattr(cell, \"bbox\"): if cell.bbox: bbox = cell.bbox bbox = bbox.scale_to_size( old_size=conv_res.document.pages[page_ix].size, new_size=conv_res.document.pages[page_ix].image.size, ) expanded_bbox = bbox.expand_by_scale( x_scale=self.table_cell_expansion_factor, y_scale=self.table_cell_expansion_factor, ).to_top_left_origin( page_height=conv_res.document.pages[ page_ix ].image.size.height ) good_bbox = True if ( expanded_bbox.l > expanded_bbox.r or expanded_bbox.t > expanded_bbox.b ): good_bbox = False if good_bbox: cropped_image = conv_res.document.pages[ page_ix ].image.pil_image.crop(expanded_bbox.as_tuple()) is_empty, rem_frac, debug = ( is_empty_fast_with_lines_pil(cropped_image) ) if is_empty: if SHOW_EMPTY_CROPS: try: cropped_image.show() except Exception as e: print(f\"Error with image: {e}\") print( f\"Detected empty table cell image crop: {rem_frac} - {debug}\" ) else: if SHOW_NONEMPTY_CROPS: cropped_image.show() result.append( PostOcrEnrichmentElement( item=cell, image=[cropped_image] ) ) return result else: multiple_crops = [] # Crop the image form the page for element_prov in element.prov: # Iterate over provenances bbox = element_prov.bbox page_ix = element_prov.page_no bbox = bbox.scale_to_size( old_size=conv_res.document.pages[page_ix].size, new_size=conv_res.document.pages[page_ix].image.size, ) expanded_bbox = bbox.expand_by_scale( x_scale=self.expansion_factor, y_scale=self.expansion_factor ).to_top_left_origin( page_height=conv_res.document.pages[page_ix].image.size.height ) good_bbox = True if ( expanded_bbox.l > expanded_bbox.r or expanded_bbox.t > expanded_bbox.b ): good_bbox = False if hasattr(element, \"text\"): if good_bbox: cropped_image = conv_res.document.pages[ page_ix ].image.pil_image.crop(expanded_bbox.as_tuple()) is_empty, rem_frac, debug = is_empty_fast_with_lines_pil( cropped_image ) if is_empty: if SHOW_EMPTY_CROPS: try: cropped_image.show() except Exception as e: print(f\"Error with image: {e}\") print(f\"Detected empty text crop: {rem_frac} - {debug}\") else: multiple_crops.append(cropped_image) if hasattr(element, \"text\"): print(f\"\\nOLD TEXT: {element.text}\") else: print(\"Not a text element\") if len(multiple_crops) > 0: # good crops return [PostOcrEnrichmentElement(item=element, image=multiple_crops)] else: # nothing return [] @classmethod def get_options_type(cls) -> type[PictureDescriptionApiOptions]: return PictureDescriptionApiOptions def __init__( self, *, enabled: bool, enable_remote_services: bool, artifacts_path: Optional[Union[Path, str]], options: PictureDescriptionApiOptions, accelerator_options: AcceleratorOptions, ): self.enabled = enabled self.options = options self.concurrency = 2 self.expansion_factor = 0.05 self.table_cell_expansion_factor = 0.0 # do not modify table cell size self.elements_batch_size = 4 self._accelerator_options = accelerator_options self._artifacts_path = ( Path(artifacts_path) if isinstance(artifacts_path, str) else artifacts_path ) if self.enabled and not enable_remote_services: raise OperationNotAllowed( \"Enable remote services by setting pipeline_options.enable_remote_services=True.\" ) def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool: return self.enabled def _annotate_images(self, images: Iterable[Image.Image]) -> Iterable[str]: def _api_request(image: Image.Image) -> str: res = api_image_request( image=image, prompt=self.options.prompt, url=self.options.url, # timeout=self.options.timeout, timeout=30, headers=self.options.headers, **self.options.params, ) return res[0] with ThreadPoolExecutor(max_workers=self.concurrency) as executor: yield from executor.map(_api_request, images) def __call__( self, doc: DoclingDocument, element_batch: Iterable[ItemAndImageEnrichmentElement], ) -> Iterable[NodeItem]: if not self.enabled: for element in element_batch: yield element.item return elements: list[TextItem] = [] images: list[Image.Image] = [] img_ind_per_element: list[int] = [] for element_stack in element_batch: for element in element_stack: allowed = (DocItem, TableCell, RichTableCell, GraphCell) assert isinstance(element.item, allowed) for ind, img in enumerate(element.image): elements.append(element.item) images.append(img) # images.append(element.image) img_ind_per_element.append(ind) if not images: return outputs = list(self._annotate_images(images)) for item, output, img_ind in zip(elements, outputs, img_ind_per_element): # Sometimes model can return html tags, which are not strictly needed in our, so it's better to clean them def clean_html_tags(text): for tag in [ \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", ]: text = text.replace(tag, \"\") return text output = clean_html_tags(output).strip() output = remove_break_lines(output) # The last measure against hallucinations # Detect hallucinated string... if output.startswith(\"The first of these\"): output = \"\" if no_long_repeats(output, 50): if VERBOSE: if isinstance(item, (TextItem)): print(f\"\\nOLD TEXT: {item.text}\") # Re-populate text if isinstance(item, (TextItem, GraphCell)): if img_ind > 0: # Concat texts across several provenances item.text += \" \" + output # item.orig += \" \" + output else: item.text = output # item.orig = output elif isinstance(item, (TableCell, RichTableCell)): item.text = output elif isinstance(item, PictureItem): pass else: raise ValueError(f\"Unknown item type: {type(item)}\") if VERBOSE: if isinstance(item, (TextItem)): print(f\"NEW TEXT: {item.text}\") # Take care of charspans for relevant types if isinstance(item, GraphCell): item.prov.charspan = (0, len(item.text)) elif isinstance(item, TextItem): item.prov[0].charspan = (0, len(item.text)) yield item In\u00a0[\u00a0]: Copied! def convert_pdf(pdf_path: Path, out_intermediate_json: Path):\n # Let's prepare a Docling document json with embedded page images\n pipeline_options = PdfPipelineOptions()\n pipeline_options.generate_page_images = True\n pipeline_options.generate_picture_images = True\n # pipeline_options.images_scale = 4.0\n pipeline_options.images_scale = 2.0\n\n doc_converter = (\n DocumentConverter( # all of the below is optional, has internal defaults.\n allowed_formats=[InputFormat.PDF],\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=StandardPdfPipeline, pipeline_options=pipeline_options\n )\n },\n )\n )\n\n if VERBOSE:\n print(\n \"Converting PDF to get a Docling document json with embedded page images...\"\n )\n conv_result = doc_converter.convert(pdf_path)\n conv_result.document.save_as_json(\n filename=out_intermediate_json, image_mode=ImageRefMode.EMBEDDED\n )\n if PRINT_RESULT_MARKDOWN:\n md1 = conv_result.document.export_to_markdown()\n print(\"*** ORIGINAL MARKDOWN ***\")\n print(md1)\n def convert_pdf(pdf_path: Path, out_intermediate_json: Path): # Let's prepare a Docling document json with embedded page images pipeline_options = PdfPipelineOptions() pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True # pipeline_options.images_scale = 4.0 pipeline_options.images_scale = 2.0 doc_converter = ( DocumentConverter( # all of the below is optional, has internal defaults. allowed_formats=[InputFormat.PDF], format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=StandardPdfPipeline, pipeline_options=pipeline_options ) }, ) ) if VERBOSE: print( \"Converting PDF to get a Docling document json with embedded page images...\" ) conv_result = doc_converter.convert(pdf_path) conv_result.document.save_as_json( filename=out_intermediate_json, image_mode=ImageRefMode.EMBEDDED ) if PRINT_RESULT_MARKDOWN: md1 = conv_result.document.export_to_markdown() print(\"*** ORIGINAL MARKDOWN ***\") print(md1) In\u00a0[\u00a0]: Copied! def post_process_json(in_json: Path, out_final_json: Path):\n # Post-Process OCR on top of existing Docling document, per element's bounding box:\n print(f\"Post-process all bounding boxes with OCR... {os.path.basename(in_json)}\")\n pipeline_options = PostOcrEnrichmentPipelineOptions(\n api_options=PictureDescriptionApiOptions(\n url=LM_STUDIO_URL,\n prompt=DEFAULT_PROMPT,\n provenance=\"lm-studio-ocr\",\n batch_size=4,\n concurrency=2,\n scale=2.0,\n params={\"model\": LM_STUDIO_MODEL},\n )\n )\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.JSON_DOCLING: FormatOption(\n pipeline_cls=PostOcrEnrichmentPipeline,\n pipeline_options=pipeline_options,\n backend=DoclingJSONBackend,\n )\n }\n )\n result = doc_converter.convert(in_json)\n if SHOW_IMAGE:\n result.document.pages[1].image.pil_image.show()\n result.document.save_as_json(out_final_json)\n if PRINT_RESULT_MARKDOWN:\n md = result.document.export_to_markdown()\n print(\"*** MARKDOWN ***\")\n print(md)\n def post_process_json(in_json: Path, out_final_json: Path): # Post-Process OCR on top of existing Docling document, per element's bounding box: print(f\"Post-process all bounding boxes with OCR... {os.path.basename(in_json)}\") pipeline_options = PostOcrEnrichmentPipelineOptions( api_options=PictureDescriptionApiOptions( url=LM_STUDIO_URL, prompt=DEFAULT_PROMPT, provenance=\"lm-studio-ocr\", batch_size=4, concurrency=2, scale=2.0, params={\"model\": LM_STUDIO_MODEL}, ) ) doc_converter = DocumentConverter( format_options={ InputFormat.JSON_DOCLING: FormatOption( pipeline_cls=PostOcrEnrichmentPipeline, pipeline_options=pipeline_options, backend=DoclingJSONBackend, ) } ) result = doc_converter.convert(in_json) if SHOW_IMAGE: result.document.pages[1].image.pil_image.show() result.document.save_as_json(out_final_json) if PRINT_RESULT_MARKDOWN: md = result.document.export_to_markdown() print(\"*** MARKDOWN ***\") print(md) In\u00a0[\u00a0]: Copied! def process_pdf(pdf_path: Path, scratch_dir: Path, out_dir: Path):\n inter_json = scratch_dir / (pdf_path.stem + \".json\")\n final_json = out_dir / (pdf_path.stem + \".json\")\n inter_json.parent.mkdir(parents=True, exist_ok=True)\n final_json.parent.mkdir(parents=True, exist_ok=True)\n if final_json.exists() and final_json.stat().st_size > 0:\n print(f\"Result already found here: '{final_json}', aborting...\")\n return # already done\n convert_pdf(pdf_path, inter_json)\n post_process_json(inter_json, final_json)\n def process_pdf(pdf_path: Path, scratch_dir: Path, out_dir: Path): inter_json = scratch_dir / (pdf_path.stem + \".json\") final_json = out_dir / (pdf_path.stem + \".json\") inter_json.parent.mkdir(parents=True, exist_ok=True) final_json.parent.mkdir(parents=True, exist_ok=True) if final_json.exists() and final_json.stat().st_size > 0: print(f\"Result already found here: '{final_json}', aborting...\") return # already done convert_pdf(pdf_path, inter_json) post_process_json(inter_json, final_json) In\u00a0[\u00a0]: Copied! def process_json(json_path: Path, out_dir: Path):\n final_json = out_dir / (json_path.stem + \".json\")\n final_json.parent.mkdir(parents=True, exist_ok=True)\n if final_json.exists() and final_json.stat().st_size > 0:\n return # already done\n post_process_json(json_path, final_json)\ndef process_json(json_path: Path, out_dir: Path): final_json = out_dir / (json_path.stem + \".json\") final_json.parent.mkdir(parents=True, exist_ok=True) if final_json.exists() and final_json.stat().st_size > 0: return # already done post_process_json(json_path, final_json) In\u00a0[\u00a0]: Copied!
def filter_jsons_by_ocr_list(jsons, folder):\n \"\"\"\n jsons: list[Path] - JSON files\n folder: Path - folder containing ocr_documents.txt\n \"\"\"\n ocr_file = folder / \"ocr_documents.txt\"\n\n # If the file doesn't exist, return the list unchanged\n if not ocr_file.exists():\n return jsons\n\n # Read file names (strip whitespace, ignore empty lines)\n with ocr_file.open(\"r\", encoding=\"utf-8\") as f:\n allowed = {line.strip() for line in f if line.strip()}\n\n # Keep only JSONs whose stem is in allowed list\n filtered = [p for p in jsons if p.stem in allowed]\n return filtered\n def filter_jsons_by_ocr_list(jsons, folder): \"\"\" jsons: list[Path] - JSON files folder: Path - folder containing ocr_documents.txt \"\"\" ocr_file = folder / \"ocr_documents.txt\" # If the file doesn't exist, return the list unchanged if not ocr_file.exists(): return jsons # Read file names (strip whitespace, ignore empty lines) with ocr_file.open(\"r\", encoding=\"utf-8\") as f: allowed = {line.strip() for line in f if line.strip()} # Keep only JSONs whose stem is in allowed list filtered = [p for p in jsons if p.stem in allowed] return filtered In\u00a0[\u00a0]: Copied! def run_jsons(in_path: Path, out_dir: Path):\n if in_path.is_dir():\n jsons = sorted(in_path.glob(\"*.json\"))\n if not jsons:\n raise SystemExit(\"Folder mode expects one or more .json files\")\n # Look for ocr_documents.txt, in case found, respect only the jsons\n filtered_jsons = filter_jsons_by_ocr_list(jsons, in_path)\n for j in tqdm(filtered_jsons):\n print(\"\")\n print(\"Processing file...\")\n print(j)\n process_json(j, out_dir)\n else:\n raise SystemExit(\"Invalid --in path\")\ndef run_jsons(in_path: Path, out_dir: Path): if in_path.is_dir(): jsons = sorted(in_path.glob(\"*.json\")) if not jsons: raise SystemExit(\"Folder mode expects one or more .json files\") # Look for ocr_documents.txt, in case found, respect only the jsons filtered_jsons = filter_jsons_by_ocr_list(jsons, in_path) for j in tqdm(filtered_jsons): print(\"\") print(\"Processing file...\") print(j) process_json(j, out_dir) else: raise SystemExit(\"Invalid --in path\") In\u00a0[\u00a0]: Copied!
def main():\n logging.getLogger().setLevel(logging.ERROR)\n p = argparse.ArgumentParser(description=\"PDF/JSON -> final JSON pipeline\")\n p.add_argument(\n \"--in\",\n dest=\"in_path\",\n default=\"tests/data/pdf/2305.03393v1-pg9.pdf\",\n required=False,\n help=\"Path to a PDF/JSON file or a folder of JSONs\",\n )\n p.add_argument(\n \"--out\",\n dest=\"out_dir\",\n default=\"scratch/\",\n required=False,\n help=\"Folder for final JSONs (scratch goes inside)\",\n )\n args = p.parse_args()\n\n in_path = Path(args.in_path).expanduser().resolve()\n out_dir = Path(args.out_dir).expanduser().resolve()\n print(f\"in_path: {in_path}\")\n print(f\"out_dir: {out_dir}\")\n scratch_dir = out_dir / \"temp\"\n\n if not in_path.exists():\n raise SystemExit(f\"Input not found: {in_path}\")\n\n if in_path.is_file():\n if in_path.suffix.lower() == \".pdf\":\n process_pdf(in_path, scratch_dir, out_dir)\n elif in_path.suffix.lower() == \".json\":\n process_json(in_path, out_dir)\n else:\n raise SystemExit(\"Single-file mode expects a .pdf or .json\")\n else:\n run_jsons(in_path, out_dir)\n def main(): logging.getLogger().setLevel(logging.ERROR) p = argparse.ArgumentParser(description=\"PDF/JSON -> final JSON pipeline\") p.add_argument( \"--in\", dest=\"in_path\", default=\"tests/data/pdf/2305.03393v1-pg9.pdf\", required=False, help=\"Path to a PDF/JSON file or a folder of JSONs\", ) p.add_argument( \"--out\", dest=\"out_dir\", default=\"scratch/\", required=False, help=\"Folder for final JSONs (scratch goes inside)\", ) args = p.parse_args() in_path = Path(args.in_path).expanduser().resolve() out_dir = Path(args.out_dir).expanduser().resolve() print(f\"in_path: {in_path}\") print(f\"out_dir: {out_dir}\") scratch_dir = out_dir / \"temp\" if not in_path.exists(): raise SystemExit(f\"Input not found: {in_path}\") if in_path.is_file(): if in_path.suffix.lower() == \".pdf\": process_pdf(in_path, scratch_dir, out_dir) elif in_path.suffix.lower() == \".json\": process_json(in_path, out_dir) else: raise SystemExit(\"Single-file mode expects a .pdf or .json\") else: run_jsons(in_path, out_dir) In\u00a0[\u00a0]: Copied! if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/rag_azuresearch/","title":"RAG with Azure AI Search","text":"Step Tech Execution Embedding Azure OpenAI \ud83c\udf10 Remote Vector Store Azure AI Search \ud83c\udf10 Remote Gen AI Azure OpenAI \ud83c\udf10 Remote In\u00a0[\u00a0]: Copied!
# If running in a fresh environment (like Google Colab), uncomment and run this single command:\n%pip install \"docling~=2.12\" azure-search-documents==11.5.2 azure-identity openai rich torch python-dotenv\n# If running in a fresh environment (like Google Colab), uncomment and run this single command: %pip install \"docling~=2.12\" azure-search-documents==11.5.2 azure-identity openai rich torch python-dotenv In\u00a0[1]: Copied!
import os\n\nfrom dotenv import load_dotenv\n\nload_dotenv()\n\n\ndef _get_env(key, default=None):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key, default)\n\n\nAZURE_SEARCH_ENDPOINT = _get_env(\"AZURE_SEARCH_ENDPOINT\")\nAZURE_SEARCH_KEY = _get_env(\"AZURE_SEARCH_KEY\") # Ensure this is your Admin Key\nAZURE_SEARCH_INDEX_NAME = _get_env(\"AZURE_SEARCH_INDEX_NAME\", \"docling-rag-sample\")\nAZURE_OPENAI_ENDPOINT = _get_env(\"AZURE_OPENAI_ENDPOINT\")\nAZURE_OPENAI_API_KEY = _get_env(\"AZURE_OPENAI_API_KEY\")\nAZURE_OPENAI_API_VERSION = _get_env(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\")\nAZURE_OPENAI_CHAT_MODEL = _get_env(\n \"AZURE_OPENAI_CHAT_MODEL\"\n) # Using a deployed model named \"gpt-4o\"\nAZURE_OPENAI_EMBEDDINGS = _get_env(\n \"AZURE_OPENAI_EMBEDDINGS\", \"text-embedding-3-small\"\n) # Using a deployed model named \"text-embeddings-3-small\"\nimport os from dotenv import load_dotenv load_dotenv() def _get_env(key, default=None): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key, default) AZURE_SEARCH_ENDPOINT = _get_env(\"AZURE_SEARCH_ENDPOINT\") AZURE_SEARCH_KEY = _get_env(\"AZURE_SEARCH_KEY\") # Ensure this is your Admin Key AZURE_SEARCH_INDEX_NAME = _get_env(\"AZURE_SEARCH_INDEX_NAME\", \"docling-rag-sample\") AZURE_OPENAI_ENDPOINT = _get_env(\"AZURE_OPENAI_ENDPOINT\") AZURE_OPENAI_API_KEY = _get_env(\"AZURE_OPENAI_API_KEY\") AZURE_OPENAI_API_VERSION = _get_env(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\") AZURE_OPENAI_CHAT_MODEL = _get_env( \"AZURE_OPENAI_CHAT_MODEL\" ) # Using a deployed model named \"gpt-4o\" AZURE_OPENAI_EMBEDDINGS = _get_env( \"AZURE_OPENAI_EMBEDDINGS\", \"text-embedding-3-small\" ) # Using a deployed model named \"text-embeddings-3-small\" In\u00a0[11]: Copied!
from rich.console import Console\nfrom rich.panel import Panel\n\nfrom docling.document_converter import DocumentConverter\n\nconsole = Console()\n\n# This URL points to the Microsoft GraphRAG Research Paper (arXiv: 2404.16130), ~15 pages\nsource_url = \"https://arxiv.org/pdf/2404.16130\"\n\nconsole.print(\n \"[bold yellow]Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...[/bold yellow]\"\n)\nconverter = DocumentConverter()\nresult = converter.convert(source_url)\n\n# Optional: preview the parsed Markdown\nmd_preview = result.document.export_to_markdown()\nconsole.print(Panel(md_preview[:500] + \"...\", title=\"Docling Markdown Preview\"))\nfrom rich.console import Console from rich.panel import Panel from docling.document_converter import DocumentConverter console = Console() # This URL points to the Microsoft GraphRAG Research Paper (arXiv: 2404.16130), ~15 pages source_url = \"https://arxiv.org/pdf/2404.16130\" console.print( \"[bold yellow]Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...[/bold yellow]\" ) converter = DocumentConverter() result = converter.convert(source_url) # Optional: preview the parsed Markdown md_preview = result.document.export_to_markdown() console.print(Panel(md_preview[:500] + \"...\", title=\"Docling Markdown Preview\"))
Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Docling Markdown Preview \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 ## From Local to Global: A Graph RAG Approach to Query-Focused Summarization \u2502\n\u2502 \u2502\n\u2502 Darren Edge 1\u2020 \u2502\n\u2502 \u2502\n\u2502 Ha Trinh 1\u2020 \u2502\n\u2502 \u2502\n\u2502 Newman Cheng 2 \u2502\n\u2502 \u2502\n\u2502 Joshua Bradley 2 \u2502\n\u2502 \u2502\n\u2502 Alex Chao 3 \u2502\n\u2502 \u2502\n\u2502 Apurva Mody 3 \u2502\n\u2502 \u2502\n\u2502 Steven Truitt 2 \u2502\n\u2502 \u2502\n\u2502 ## Jonathan Larson 1 \u2502\n\u2502 \u2502\n\u2502 1 Microsoft Research 2 Microsoft Strategic Missions and Technologies 3 Microsoft Office of the CTO \u2502\n\u2502 \u2502\n\u2502 { daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso } @microsoft.com \u2502\n\u2502 \u2502\n\u2502 \u2020 These authors contributed equally to this work \u2502\n\u2502 \u2502\n\u2502 ## Abstract \u2502\n\u2502 \u2502\n\u2502 The use of retrieval-augmented gen... \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n In\u00a0[22]: Copied! from docling.chunking import HierarchicalChunker\n\nchunker = HierarchicalChunker()\ndoc_chunks = list(chunker.chunk(result.document))\n\nall_chunks = []\nfor idx, c in enumerate(doc_chunks):\n chunk_text = c.text\n all_chunks.append((f\"chunk_{idx}\", chunk_text))\n\nconsole.print(f\"Total chunks from PDF: {len(all_chunks)}\")\n from docling.chunking import HierarchicalChunker chunker = HierarchicalChunker() doc_chunks = list(chunker.chunk(result.document)) all_chunks = [] for idx, c in enumerate(doc_chunks): chunk_text = c.text all_chunks.append((f\"chunk_{idx}\", chunk_text)) console.print(f\"Total chunks from PDF: {len(all_chunks)}\") Total chunks from PDF: 106\nIn\u00a0[\u00a0]: Copied!
from azure.core.credentials import AzureKeyCredential\nfrom azure.search.documents.indexes import SearchIndexClient\nfrom azure.search.documents.indexes.models import (\n AzureOpenAIVectorizer,\n AzureOpenAIVectorizerParameters,\n HnswAlgorithmConfiguration,\n SearchableField,\n SearchField,\n SearchFieldDataType,\n SearchIndex,\n SimpleField,\n VectorSearch,\n VectorSearchProfile,\n)\nfrom rich.console import Console\n\nconsole = Console()\n\nVECTOR_DIM = 1536 # Adjust based on your chosen embeddings model\n\nindex_client = SearchIndexClient(\n AZURE_SEARCH_ENDPOINT, AzureKeyCredential(AZURE_SEARCH_KEY)\n)\n\n\ndef create_search_index(index_name: str):\n # Define fields\n fields = [\n SimpleField(name=\"chunk_id\", type=SearchFieldDataType.String, key=True),\n SearchableField(name=\"content\", type=SearchFieldDataType.String),\n SearchField(\n name=\"content_vector\",\n type=SearchFieldDataType.Collection(SearchFieldDataType.Single),\n searchable=True,\n filterable=False,\n sortable=False,\n facetable=False,\n vector_search_dimensions=VECTOR_DIM,\n vector_search_profile_name=\"default\",\n ),\n ]\n # Vector search config with an AzureOpenAIVectorizer\n vector_search = VectorSearch(\n algorithms=[HnswAlgorithmConfiguration(name=\"default\")],\n profiles=[\n VectorSearchProfile(\n name=\"default\",\n algorithm_configuration_name=\"default\",\n vectorizer_name=\"default\",\n )\n ],\n vectorizers=[\n AzureOpenAIVectorizer(\n vectorizer_name=\"default\",\n parameters=AzureOpenAIVectorizerParameters(\n resource_url=AZURE_OPENAI_ENDPOINT,\n deployment_name=AZURE_OPENAI_EMBEDDINGS,\n model_name=\"text-embedding-3-small\",\n api_key=AZURE_OPENAI_API_KEY,\n ),\n )\n ],\n )\n\n # Create or update the index\n new_index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)\n try:\n index_client.delete_index(index_name)\n except Exception:\n pass\n\n index_client.create_or_update_index(new_index)\n console.print(f\"Index '{index_name}' created.\")\n\n\ncreate_search_index(AZURE_SEARCH_INDEX_NAME)\n from azure.core.credentials import AzureKeyCredential from azure.search.documents.indexes import SearchIndexClient from azure.search.documents.indexes.models import ( AzureOpenAIVectorizer, AzureOpenAIVectorizerParameters, HnswAlgorithmConfiguration, SearchableField, SearchField, SearchFieldDataType, SearchIndex, SimpleField, VectorSearch, VectorSearchProfile, ) from rich.console import Console console = Console() VECTOR_DIM = 1536 # Adjust based on your chosen embeddings model index_client = SearchIndexClient( AZURE_SEARCH_ENDPOINT, AzureKeyCredential(AZURE_SEARCH_KEY) ) def create_search_index(index_name: str): # Define fields fields = [ SimpleField(name=\"chunk_id\", type=SearchFieldDataType.String, key=True), SearchableField(name=\"content\", type=SearchFieldDataType.String), SearchField( name=\"content_vector\", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), searchable=True, filterable=False, sortable=False, facetable=False, vector_search_dimensions=VECTOR_DIM, vector_search_profile_name=\"default\", ), ] # Vector search config with an AzureOpenAIVectorizer vector_search = VectorSearch( algorithms=[HnswAlgorithmConfiguration(name=\"default\")], profiles=[ VectorSearchProfile( name=\"default\", algorithm_configuration_name=\"default\", vectorizer_name=\"default\", ) ], vectorizers=[ AzureOpenAIVectorizer( vectorizer_name=\"default\", parameters=AzureOpenAIVectorizerParameters( resource_url=AZURE_OPENAI_ENDPOINT, deployment_name=AZURE_OPENAI_EMBEDDINGS, model_name=\"text-embedding-3-small\", api_key=AZURE_OPENAI_API_KEY, ), ) ], ) # Create or update the index new_index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search) try: index_client.delete_index(index_name) except Exception: pass index_client.create_or_update_index(new_index) console.print(f\"Index '{index_name}' created.\") create_search_index(AZURE_SEARCH_INDEX_NAME) Index 'docling-rag-sample-2' created.\nIn\u00a0[28]: Copied!
from azure.search.documents import SearchClient\nfrom openai import AzureOpenAI\n\nsearch_client = SearchClient(\n AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_INDEX_NAME, AzureKeyCredential(AZURE_SEARCH_KEY)\n)\nopenai_client = AzureOpenAI(\n api_key=AZURE_OPENAI_API_KEY,\n api_version=AZURE_OPENAI_API_VERSION,\n azure_endpoint=AZURE_OPENAI_ENDPOINT,\n)\n\n\ndef embed_text(text: str):\n \"\"\"\n Helper to generate embeddings with Azure OpenAI.\n \"\"\"\n response = openai_client.embeddings.create(\n input=text, model=AZURE_OPENAI_EMBEDDINGS\n )\n return response.data[0].embedding\n\n\nupload_docs = []\nfor chunk_id, chunk_text in all_chunks:\n embedding_vector = embed_text(chunk_text)\n upload_docs.append(\n {\n \"chunk_id\": chunk_id,\n \"content\": chunk_text,\n \"content_vector\": embedding_vector,\n }\n )\n\n\nBATCH_SIZE = 50\nfor i in range(0, len(upload_docs), BATCH_SIZE):\n subset = upload_docs[i : i + BATCH_SIZE]\n resp = search_client.upload_documents(documents=subset)\n\n all_succeeded = all(r.succeeded for r in resp)\n console.print(\n f\"Uploaded batch {i} -> {i + len(subset)}; all_succeeded: {all_succeeded}, \"\n f\"first_doc_status_code: {resp[0].status_code}\"\n )\n\nconsole.print(\"All chunks uploaded to Azure Search.\")\n from azure.search.documents import SearchClient from openai import AzureOpenAI search_client = SearchClient( AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_INDEX_NAME, AzureKeyCredential(AZURE_SEARCH_KEY) ) openai_client = AzureOpenAI( api_key=AZURE_OPENAI_API_KEY, api_version=AZURE_OPENAI_API_VERSION, azure_endpoint=AZURE_OPENAI_ENDPOINT, ) def embed_text(text: str): \"\"\" Helper to generate embeddings with Azure OpenAI. \"\"\" response = openai_client.embeddings.create( input=text, model=AZURE_OPENAI_EMBEDDINGS ) return response.data[0].embedding upload_docs = [] for chunk_id, chunk_text in all_chunks: embedding_vector = embed_text(chunk_text) upload_docs.append( { \"chunk_id\": chunk_id, \"content\": chunk_text, \"content_vector\": embedding_vector, } ) BATCH_SIZE = 50 for i in range(0, len(upload_docs), BATCH_SIZE): subset = upload_docs[i : i + BATCH_SIZE] resp = search_client.upload_documents(documents=subset) all_succeeded = all(r.succeeded for r in resp) console.print( f\"Uploaded batch {i} -> {i + len(subset)}; all_succeeded: {all_succeeded}, \" f\"first_doc_status_code: {resp[0].status_code}\" ) console.print(\"All chunks uploaded to Azure Search.\") Uploaded batch 0 -> 50; all_succeeded: True, first_doc_status_code: 201\n
Uploaded batch 50 -> 100; all_succeeded: True, first_doc_status_code: 201\n
Uploaded batch 100 -> 106; all_succeeded: True, first_doc_status_code: 201\n
All chunks uploaded to Azure Search.\nIn\u00a0[29]: Copied!
from typing import Optional\n\nfrom azure.search.documents.models import VectorizableTextQuery\n\n\ndef generate_chat_response(prompt: str, system_message: Optional[str] = None):\n \"\"\"\n Generates a single-turn chat response using Azure OpenAI Chat.\n If you need multi-turn conversation or follow-up queries, you'll have to\n maintain the messages list externally.\n \"\"\"\n messages = []\n if system_message:\n messages.append({\"role\": \"system\", \"content\": system_message})\n messages.append({\"role\": \"user\", \"content\": prompt})\n\n completion = openai_client.chat.completions.create(\n model=AZURE_OPENAI_CHAT_MODEL, messages=messages, temperature=0.7\n )\n return completion.choices[0].message.content\n\n\nuser_query = \"What are the main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG methods?\"\nuser_embed = embed_text(user_query)\n\nvector_query = VectorizableTextQuery(\n text=user_query, # passing in text for a hybrid search\n k_nearest_neighbors=5,\n fields=\"content_vector\",\n)\n\nsearch_results = search_client.search(\n search_text=user_query, vector_queries=[vector_query], select=[\"content\"], top=10\n)\n\nretrieved_chunks = []\nfor result in search_results:\n snippet = result[\"content\"]\n retrieved_chunks.append(snippet)\n\ncontext_str = \"\\n---\\n\".join(retrieved_chunks)\nrag_prompt = f\"\"\"\nYou are an AI assistant helping answering questions about Microsoft GraphRAG.\nUse ONLY the text below to answer the user's question.\nIf the answer isn't in the text, say you don't know.\n\nContext:\n{context_str}\n\nQuestion: {user_query}\nAnswer:\n\"\"\"\n\nfinal_answer = generate_chat_response(rag_prompt)\n\nconsole.print(Panel(rag_prompt, title=\"RAG Prompt\", style=\"bold red\"))\nconsole.print(Panel(final_answer, title=\"RAG Response\", style=\"bold green\"))\n from typing import Optional from azure.search.documents.models import VectorizableTextQuery def generate_chat_response(prompt: str, system_message: Optional[str] = None): \"\"\" Generates a single-turn chat response using Azure OpenAI Chat. If you need multi-turn conversation or follow-up queries, you'll have to maintain the messages list externally. \"\"\" messages = [] if system_message: messages.append({\"role\": \"system\", \"content\": system_message}) messages.append({\"role\": \"user\", \"content\": prompt}) completion = openai_client.chat.completions.create( model=AZURE_OPENAI_CHAT_MODEL, messages=messages, temperature=0.7 ) return completion.choices[0].message.content user_query = \"What are the main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG methods?\" user_embed = embed_text(user_query) vector_query = VectorizableTextQuery( text=user_query, # passing in text for a hybrid search k_nearest_neighbors=5, fields=\"content_vector\", ) search_results = search_client.search( search_text=user_query, vector_queries=[vector_query], select=[\"content\"], top=10 ) retrieved_chunks = [] for result in search_results: snippet = result[\"content\"] retrieved_chunks.append(snippet) context_str = \"\\n---\\n\".join(retrieved_chunks) rag_prompt = f\"\"\" You are an AI assistant helping answering questions about Microsoft GraphRAG. Use ONLY the text below to answer the user's question. If the answer isn't in the text, say you don't know. Context: {context_str} Question: {user_query} Answer: \"\"\" final_answer = generate_chat_response(rag_prompt) console.print(Panel(rag_prompt, title=\"RAG Prompt\", style=\"bold red\")) console.print(Panel(final_answer, title=\"RAG Response\", style=\"bold green\")) \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 RAG Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 \u2502\n\u2502 You are an AI assistant helping answering questions about Microsoft GraphRAG. \u2502\n\u2502 Use ONLY the text below to answer the user's question. \u2502\n\u2502 If the answer isn't in the text, say you don't know. \u2502\n\u2502 \u2502\n\u2502 Context: \u2502\n\u2502 Community summaries vs. source texts. When comparing community summaries to source texts using Graph RAG, \u2502\n\u2502 community summaries generally provided a small but consistent improvement in answer comprehensiveness and \u2502\n\u2502 diversity, except for root-level summaries. Intermediate-level summaries in the Podcast dataset and low-level \u2502\n\u2502 community summaries in the News dataset achieved comprehensiveness win rates of 57% and 64%, respectively. \u2502\n\u2502 Diversity win rates were 57% for Podcast intermediate-level summaries and 60% for News low-level community \u2502\n\u2502 summaries. Table 3 also illustrates the scalability advantages of Graph RAG compared to source text \u2502\n\u2502 summarization: for low-level community summaries ( C3 ), Graph RAG required 26-33% fewer context tokens, while \u2502\n\u2502 for root-level community summaries ( C0 ), it required over 97% fewer tokens. For a modest drop in performance \u2502\n\u2502 compared with other global methods, root-level Graph RAG offers a highly efficient method for the iterative \u2502\n\u2502 question answering that characterizes sensemaking activity, while retaining advantages in comprehensiveness \u2502\n\u2502 (72% win rate) and diversity (62% win rate) over na\u00a8\u0131ve RAG. \u2502\n\u2502 --- \u2502\n\u2502 We have presented a global approach to Graph RAG, combining knowledge graph generation, retrieval-augmented \u2502\n\u2502 generation (RAG), and query-focused summarization (QFS) to support human sensemaking over entire text corpora. \u2502\n\u2502 Initial evaluations show substantial improvements over a na\u00a8\u0131ve RAG baseline for both the comprehensiveness and \u2502\n\u2502 diversity of answers, as well as favorable comparisons to a global but graph-free approach using map-reduce \u2502\n\u2502 source text summarization. For situations requiring many global queries over the same dataset, summaries of \u2502\n\u2502 root-level communities in the entity-based graph index provide a data index that is both superior to na\u00a8\u0131ve RAG \u2502\n\u2502 and achieves competitive performance to other global methods at a fraction of the token cost. \u2502\n\u2502 --- \u2502\n\u2502 Trade-offs of building a graph index . We consistently observed Graph RAG achieve the best headto-head results \u2502\n\u2502 against other methods, but in many cases the graph-free approach to global summarization of source texts \u2502\n\u2502 performed competitively. The real-world decision about whether to invest in building a graph index depends on \u2502\n\u2502 multiple factors, including the compute budget, expected number of lifetime queries per dataset, and value \u2502\n\u2502 obtained from other aspects of the graph index (including the generic community summaries and the use of other \u2502\n\u2502 graph-related RAG approaches). \u2502\n\u2502 --- \u2502\n\u2502 Future work . The graph index, rich text annotations, and hierarchical community structure supporting the \u2502\n\u2502 current Graph RAG approach offer many possibilities for refinement and adaptation. This includes RAG approaches \u2502\n\u2502 that operate in a more local manner, via embedding-based matching of user queries and graph annotations, as \u2502\n\u2502 well as the possibility of hybrid RAG schemes that combine embedding-based matching against community reports \u2502\n\u2502 before employing our map-reduce summarization mechanisms. This 'roll-up' operation could also be extended \u2502\n\u2502 across more levels of the community hierarchy, as well as implemented as a more exploratory 'drill down' \u2502\n\u2502 mechanism that follows the information scent contained in higher-level community summaries. \u2502\n\u2502 --- \u2502\n\u2502 Advanced RAG systems include pre-retrieval, retrieval, post-retrieval strategies designed to overcome the \u2502\n\u2502 drawbacks of Na\u00a8\u0131ve RAG, while Modular RAG systems include patterns for iterative and dynamic cycles of \u2502\n\u2502 interleaved retrieval and generation (Gao et al., 2023). Our implementation of Graph RAG incorporates multiple \u2502\n\u2502 concepts related to other systems. For example, our community summaries are a kind of self-memory (Selfmem, \u2502\n\u2502 Cheng et al., 2024) for generation-augmented retrieval (GAR, Mao et al., 2020) that facilitates future \u2502\n\u2502 generation cycles, while our parallel generation of community answers from these summaries is a kind of \u2502\n\u2502 iterative (Iter-RetGen, Shao et al., 2023) or federated (FeB4RAG, Wang et al., 2024) retrieval-generation \u2502\n\u2502 strategy. Other systems have also combined these concepts for multi-document summarization (CAiRE-COVID, Su et \u2502\n\u2502 al., 2020) and multi-hop question answering (ITRG, Feng et al., 2023; IR-CoT, Trivedi et al., 2022; DSP, \u2502\n\u2502 Khattab et al., 2022). Our use of a hierarchical index and summarization also bears resemblance to further \u2502\n\u2502 approaches, such as generating a hierarchical index of text chunks by clustering the vectors of text embeddings \u2502\n\u2502 (RAPTOR, Sarthi et al., 2024) or generating a 'tree of clarifications' to answer multiple interpretations of \u2502\n\u2502 ambiguous questions (Kim et al., 2023). However, none of these iterative or hierarchical approaches use the \u2502\n\u2502 kind of self-generated graph index that enables Graph RAG. \u2502\n\u2502 --- \u2502\n\u2502 The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge \u2502\n\u2502 source enables large language models (LLMs) to answer questions over private and/or previously unseen document \u2502\n\u2502 collections. However, RAG fails on global questions directed at an entire text corpus, such as 'What are the \u2502\n\u2502 main themes in the dataset?', since this is inherently a queryfocused summarization (QFS) task, rather than an \u2502\n\u2502 explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by \u2502\n\u2502 typical RAGsystems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to \u2502\n\u2502 question answering over private text corpora that scales with both the generality of user questions and the \u2502\n\u2502 quantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two \u2502\n\u2502 stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community \u2502\n\u2502 summaries for all groups of closely-related entities. Given a question, each community summary is used to \u2502\n\u2502 generate a partial response, before all partial responses are again summarized in a final response to the user. \u2502\n\u2502 For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG \u2502\n\u2502 leads to substantial improvements over a na\u00a8\u0131ve RAG baseline for both the comprehensiveness and diversity of \u2502\n\u2502 generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is \u2502\n\u2502 forthcoming at https://aka . ms/graphrag . \u2502\n\u2502 --- \u2502\n\u2502 Given the multi-stage nature of our Graph RAG mechanism, the multiple conditions we wanted to compare, and the \u2502\n\u2502 lack of gold standard answers to our activity-based sensemaking questions, we decided to adopt a head-to-head \u2502\n\u2502 comparison approach using an LLM evaluator. We selected three target metrics capturing qualities that are \u2502\n\u2502 desirable for sensemaking activities, as well as a control metric (directness) used as a indicator of validity. \u2502\n\u2502 Since directness is effectively in opposition to comprehensiveness and diversity, we would not expect any \u2502\n\u2502 method to win across all four metrics. \u2502\n\u2502 --- \u2502\n\u2502 Figure 1: Graph RAG pipeline using an LLM-derived graph index of source document text. This index spans nodes \u2502\n\u2502 (e.g., entities), edges (e.g., relationships), and covariates (e.g., claims) that have been detected, \u2502\n\u2502 extracted, and summarized by LLM prompts tailored to the domain of the dataset. Community detection (e.g., \u2502\n\u2502 Leiden, Traag et al., 2019) is used to partition the graph index into groups of elements (nodes, edges, \u2502\n\u2502 covariates) that the LLM can summarize in parallel at both indexing time and query time. The 'global answer' to \u2502\n\u2502 a given query is produced using a final round of query-focused summarization over all community summaries \u2502\n\u2502 reporting relevance to that query. \u2502\n\u2502 --- \u2502\n\u2502 Retrieval-augmented generation (RAG, Lewis et al., 2020) is an established approach to answering user questions \u2502\n\u2502 over entire datasets, but it is designed for situations where these answers are contained locally within \u2502\n\u2502 regions of text whose retrieval provides sufficient grounding for the generation task. Instead, a more \u2502\n\u2502 appropriate task framing is query-focused summarization (QFS, Dang, 2006), and in particular, query-focused \u2502\n\u2502 abstractive summarization that generates natural language summaries and not just concatenated excerpts (Baumel \u2502\n\u2502 et al., 2018; Laskar et al., 2020; Yao et al., 2017) . In recent years, however, such distinctions between \u2502\n\u2502 summarization tasks that are abstractive versus extractive, generic versus query-focused, and single-document \u2502\n\u2502 versus multi-document, have become less relevant. While early applications of the transformer architecture \u2502\n\u2502 showed substantial improvements on the state-of-the-art for all such summarization tasks (Goodwin et al., 2020; \u2502\n\u2502 Laskar et al., 2022; Liu and Lapata, 2019), these tasks are now trivialized by modern LLMs, including the GPT \u2502\n\u2502 (Achiam et al., 2023; Brown et al., 2020), Llama (Touvron et al., 2023), and Gemini (Anil et al., 2023) series, \u2502\n\u2502 all of which can use in-context learning to summarize any content provided in their context window. \u2502\n\u2502 --- \u2502\n\u2502 community descriptions provide complete coverage of the underlying graph index and the input documents it \u2502\n\u2502 represents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach: \u2502\n\u2502 first using each community summary to answer the query independently and in parallel, then summarizing all \u2502\n\u2502 relevant partial answers into a final global answer. \u2502\n\u2502 \u2502\n\u2502 Question: What are the main advantages of using the Graph RAG approach for query-focused summarization compared \u2502\n\u2502 to traditional RAG methods? \u2502\n\u2502 Answer: \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 RAG Response \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 The main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG \u2502\n\u2502 methods include: \u2502\n\u2502 \u2502\n\u2502 1. **Improved Comprehensiveness and Diversity**: Graph RAG shows substantial improvements over a na\u00efve RAG \u2502\n\u2502 baseline in terms of the comprehensiveness and diversity of answers. This is particularly beneficial for global \u2502\n\u2502 sensemaking questions over large datasets. \u2502\n\u2502 \u2502\n\u2502 2. **Scalability**: Graph RAG provides scalability advantages, achieving efficient summarization with \u2502\n\u2502 significantly fewer context tokens required. For instance, it requires 26-33% fewer tokens for low-level \u2502\n\u2502 community summaries and over 97% fewer tokens for root-level summaries compared to source text summarization. \u2502\n\u2502 \u2502\n\u2502 3. **Efficiency in Iterative Question Answering**: Root-level Graph RAG offers a highly efficient method for \u2502\n\u2502 iterative question answering, which is crucial for sensemaking activities, with only a modest drop in \u2502\n\u2502 performance compared to other global methods. \u2502\n\u2502 \u2502\n\u2502 4. **Global Query Handling**: It supports handling global queries effectively, as it combines knowledge graph \u2502\n\u2502 generation, retrieval-augmented generation, and query-focused summarization, making it suitable for sensemaking \u2502\n\u2502 over entire text corpora. \u2502\n\u2502 \u2502\n\u2502 5. **Hierarchical Indexing and Summarization**: The use of a hierarchical index and summarization allows for \u2502\n\u2502 efficient processing and summarizing of community summaries into a final global answer, facilitating a \u2502\n\u2502 comprehensive coverage of the underlying graph index and input documents. \u2502\n\u2502 \u2502\n\u2502 6. **Reduced Token Cost**: For situations requiring many global queries over the same dataset, Graph RAG \u2502\n\u2502 achieves competitive performance to other global methods at a fraction of the token cost. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n"},{"location":"examples/rag_azuresearch/#rag-with-azure-ai-search","title":"RAG with Azure AI Search\u00b6","text":""},{"location":"examples/rag_azuresearch/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"
This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:
This sample demonstrates how to:
Azure AI Search resource
Azure OpenAI resource with a deployed embedding and chat completion model (e.g. text-embedding-3-small and gpt-4o)
Docling 2.12+ (installs docling_core automatically) Docling installed (Python 3.8+ environment)
A GPU-enabled environment is preferred for faster parsing. Docling 2.12 automatically detects GPU if present.
We\u2019ll parse the Microsoft GraphRAG Research Paper (~15 pages). Parsing should be relatively quick, even on CPU, but it will be faster on a GPU or MPS device if available.
(If you prefer a different document, simply provide a different URL or local file path.)
"},{"location":"examples/rag_azuresearch/#part-2-hierarchical-chunking","title":"Part 2: Hierarchical Chunking\u00b6","text":"We convert the Document into smaller chunks for embedding and indexing. The built-in HierarchicalChunker preserves structure.
We\u2019ll define a vector index in Azure AI Search, then embed each chunk using Azure OpenAI and upload in batches.
"},{"location":"examples/rag_azuresearch/#generate-embeddings-and-upload-to-azure-ai-search","title":"Generate Embeddings and Upload to Azure AI Search\u00b6","text":""},{"location":"examples/rag_azuresearch/#part-4-perform-rag-over-pdf","title":"Part 4: Perform RAG over PDF\u00b6","text":"Combine retrieval from Azure AI Search with Azure OpenAI Chat Completions (aka. grounding your LLM)
"},{"location":"examples/rag_haystack/","title":"RAG with Haystack","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 RemoteThis example leverages the Haystack Docling extension, along with Milvus-based document store and retriever instances, as well as sentence-transformers embeddings.
The presented DoclingConverter component enables you to:
DoclingConverter supports two different export modes:
ExportType.MARKDOWN: if you want to capture each input document as a separate Haystack document, orExportType.DOC_CHUNKS (default): if you want to have each input document chunked and to then capture each individual chunk as a separate Haystack document downstream.The example allows to explore both modes via parameter EXPORT_TYPE; depending on the value set, the ingestion and RAG pipelines are then set up accordingly.
HF_TOKEN.--no-warn-conflicts meant for Colab's pre-populated Python env; feel free to remove for stricter usage):%pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling \"pymilvus[milvus-lite]\" milvus-haystack sentence-transformers python-dotenv\n%pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling \"pymilvus[milvus-lite]\" milvus-haystack sentence-transformers python-dotenv
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom docling_haystack.converter import ExportType\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nPATHS = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGENERATION_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nEXPORT_TYPE = ExportType.DOC_CHUNKS\nQUESTION = \"Which are the main AI models in Docling?\"\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\nimport os from pathlib import Path from tempfile import mkdtemp from docling_haystack.converter import ExportType from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") PATHS = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GENERATION_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" EXPORT_TYPE = ExportType.DOC_CHUNKS QUESTION = \"Which are the main AI models in Docling?\" TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied!
from docling_haystack.converter import DoclingConverter\nfrom haystack import Pipeline\nfrom haystack.components.embedders import (\n SentenceTransformersDocumentEmbedder,\n SentenceTransformersTextEmbedder,\n)\nfrom haystack.components.preprocessors import DocumentSplitter\nfrom haystack.components.writers import DocumentWriter\nfrom milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever\n\nfrom docling.chunking import HybridChunker\n\ndocument_store = MilvusDocumentStore(\n connection_args={\"uri\": MILVUS_URI},\n drop_old=True,\n text_field=\"txt\", # set for preventing conflict with same-name metadata field\n)\n\nidx_pipe = Pipeline()\nidx_pipe.add_component(\n \"converter\",\n DoclingConverter(\n export_type=EXPORT_TYPE,\n chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n ),\n)\nidx_pipe.add_component(\n \"embedder\",\n SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID),\n)\nidx_pipe.add_component(\"writer\", DocumentWriter(document_store=document_store))\nif EXPORT_TYPE == ExportType.DOC_CHUNKS:\n idx_pipe.connect(\"converter\", \"embedder\")\nelif EXPORT_TYPE == ExportType.MARKDOWN:\n idx_pipe.add_component(\n \"splitter\",\n DocumentSplitter(split_by=\"sentence\", split_length=1),\n )\n idx_pipe.connect(\"converter.documents\", \"splitter.documents\")\n idx_pipe.connect(\"splitter.documents\", \"embedder.documents\")\nelse:\n raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\nidx_pipe.connect(\"embedder\", \"writer\")\nidx_pipe.run({\"converter\": {\"paths\": PATHS}})\n from docling_haystack.converter import DoclingConverter from haystack import Pipeline from haystack.components.embedders import ( SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder, ) from haystack.components.preprocessors import DocumentSplitter from haystack.components.writers import DocumentWriter from milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever from docling.chunking import HybridChunker document_store = MilvusDocumentStore( connection_args={\"uri\": MILVUS_URI}, drop_old=True, text_field=\"txt\", # set for preventing conflict with same-name metadata field ) idx_pipe = Pipeline() idx_pipe.add_component( \"converter\", DoclingConverter( export_type=EXPORT_TYPE, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ), ) idx_pipe.add_component( \"embedder\", SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID), ) idx_pipe.add_component(\"writer\", DocumentWriter(document_store=document_store)) if EXPORT_TYPE == ExportType.DOC_CHUNKS: idx_pipe.connect(\"converter\", \"embedder\") elif EXPORT_TYPE == ExportType.MARKDOWN: idx_pipe.add_component( \"splitter\", DocumentSplitter(split_by=\"sentence\", split_length=1), ) idx_pipe.connect(\"converter.documents\", \"splitter.documents\") idx_pipe.connect(\"splitter.documents\", \"embedder.documents\") else: raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\") idx_pipe.connect(\"embedder\", \"writer\") idx_pipe.run({\"converter\": {\"paths\": PATHS}}) Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors\n
Batches: 0%| | 0/2 [00:00<?, ?it/s]Out[3]:
{'writer': {'documents_written': 54}} In\u00a0[4]: Copied! from haystack.components.builders import AnswerBuilder\nfrom haystack.components.builders.prompt_builder import PromptBuilder\nfrom haystack.components.generators import HuggingFaceAPIGenerator\nfrom haystack.utils import Secret\n\nprompt_template = \"\"\"\n Given these documents, answer the question.\n Documents:\n {% for doc in documents %}\n {{ doc.content }}\n {% endfor %}\n Question: {{query}}\n Answer:\n \"\"\"\n\nrag_pipe = Pipeline()\nrag_pipe.add_component(\n \"embedder\",\n SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID),\n)\nrag_pipe.add_component(\n \"retriever\",\n MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K),\n)\nrag_pipe.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template))\nrag_pipe.add_component(\n \"llm\",\n HuggingFaceAPIGenerator(\n api_type=\"serverless_inference_api\",\n api_params={\"model\": GENERATION_MODEL_ID},\n token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None,\n ),\n)\nrag_pipe.add_component(\"answer_builder\", AnswerBuilder())\nrag_pipe.connect(\"embedder.embedding\", \"retriever\")\nrag_pipe.connect(\"retriever\", \"prompt_builder.documents\")\nrag_pipe.connect(\"prompt_builder\", \"llm\")\nrag_pipe.connect(\"llm.replies\", \"answer_builder.replies\")\nrag_pipe.connect(\"llm.meta\", \"answer_builder.meta\")\nrag_pipe.connect(\"retriever\", \"answer_builder.documents\")\nrag_res = rag_pipe.run(\n {\n \"embedder\": {\"text\": QUESTION},\n \"prompt_builder\": {\"query\": QUESTION},\n \"answer_builder\": {\"query\": QUESTION},\n }\n)\n from haystack.components.builders import AnswerBuilder from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.generators import HuggingFaceAPIGenerator from haystack.utils import Secret prompt_template = \"\"\" Given these documents, answer the question. Documents: {% for doc in documents %} {{ doc.content }} {% endfor %} Question: {{query}} Answer: \"\"\" rag_pipe = Pipeline() rag_pipe.add_component( \"embedder\", SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID), ) rag_pipe.add_component( \"retriever\", MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K), ) rag_pipe.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template)) rag_pipe.add_component( \"llm\", HuggingFaceAPIGenerator( api_type=\"serverless_inference_api\", api_params={\"model\": GENERATION_MODEL_ID}, token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None, ), ) rag_pipe.add_component(\"answer_builder\", AnswerBuilder()) rag_pipe.connect(\"embedder.embedding\", \"retriever\") rag_pipe.connect(\"retriever\", \"prompt_builder.documents\") rag_pipe.connect(\"prompt_builder\", \"llm\") rag_pipe.connect(\"llm.replies\", \"answer_builder.replies\") rag_pipe.connect(\"llm.meta\", \"answer_builder.meta\") rag_pipe.connect(\"retriever\", \"answer_builder.documents\") rag_res = rag_pipe.run( { \"embedder\": {\"text\": QUESTION}, \"prompt_builder\": {\"query\": QUESTION}, \"answer_builder\": {\"query\": QUESTION}, } ) Batches: 0%| | 0/1 [00:00<?, ?it/s]
/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n warnings.warn(\n
Below we print out the RAG results. If you have used ExportType.DOC_CHUNKS, notice how the sources contain document-level grounding (e.g. page number or bounding box information):
from docling.chunking import DocChunk\n\nprint(f\"Question:\\n{QUESTION}\\n\")\nprint(f\"Answer:\\n{rag_res['answer_builder']['answers'][0].data.strip()}\\n\")\nprint(\"Sources:\")\nsources = rag_res[\"answer_builder\"][\"answers\"][0].documents\nfor source in sources:\n if EXPORT_TYPE == ExportType.DOC_CHUNKS:\n doc_chunk = DocChunk.model_validate(source.meta[\"dl_meta\"])\n print(f\"- text: {doc_chunk.text!r}\")\n if doc_chunk.meta.origin:\n print(f\" file: {doc_chunk.meta.origin.filename}\")\n if doc_chunk.meta.headings:\n print(f\" section: {' / '.join(doc_chunk.meta.headings)}\")\n bbox = doc_chunk.meta.doc_items[0].prov[0].bbox\n print(\n f\" page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, \"\n f\"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]\"\n )\n elif EXPORT_TYPE == ExportType.MARKDOWN:\n print(repr(source.content))\n else:\n raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\n from docling.chunking import DocChunk print(f\"Question:\\n{QUESTION}\\n\") print(f\"Answer:\\n{rag_res['answer_builder']['answers'][0].data.strip()}\\n\") print(\"Sources:\") sources = rag_res[\"answer_builder\"][\"answers\"][0].documents for source in sources: if EXPORT_TYPE == ExportType.DOC_CHUNKS: doc_chunk = DocChunk.model_validate(source.meta[\"dl_meta\"]) print(f\"- text: {doc_chunk.text!r}\") if doc_chunk.meta.origin: print(f\" file: {doc_chunk.meta.origin.filename}\") if doc_chunk.meta.headings: print(f\" section: {' / '.join(doc_chunk.meta.headings)}\") bbox = doc_chunk.meta.doc_items[0].prov[0].bbox print( f\" page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, \" f\"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]\" ) elif EXPORT_TYPE == ExportType.MARKDOWN: print(repr(source.content)) else: raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\") Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nThe main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, while TableFormer is a state-of-the-art table structure recognition model. These models are provided with pre-trained weights and a separate package for the inference code as docling-ibm-models. They are also used in the open-access deepsearch-experience, a cloud-native service for knowledge exploration tasks. Additionally, Docling plans to extend its model library with a figure-classifier model, an equation-recognition model, a code-recognition model, and more in the future.\n\nSources:\n- text: 'As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.'\n file: 2408.09869v5.pdf\n section: 3.2 AI models\n page: 3, bounding box: [107, 406, 504, 330]\n- text: 'Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading-order and eventually assembles a typed document object which can be serialized to JSON or Markdown.'\n file: 2408.09869v5.pdf\n section: 3 Processing pipeline\n page: 2, bounding box: [107, 273, 504, 176]\n- text: 'Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of content, as well as augment extracted document metadata with additional information. Further investment into testing and optimizing GPU acceleration as well as improving the Docling-native PDF backend are on our roadmap, too.\\nWe encourage everyone to propose or implement additional features and models, and will gladly take your inputs and contributions under review . The codebase of Docling is open for use and contribution, under the MIT license agreement and in alignment with our contributing guidelines included in the Docling repository. If you use Docling in your projects, please consider citing this technical report.'\n section: 6 Future work and contributions\n page: 5, bounding box: [106, 323, 504, 258]\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/rag_haystack/#rag-with-haystack","title":"RAG with Haystack\u00b6","text":""},{"location":"examples/rag_haystack/#overview","title":"Overview\u00b6","text":""},{"location":"examples/rag_haystack/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_haystack/#indexing-pipeline","title":"Indexing pipeline\u00b6","text":""},{"location":"examples/rag_haystack/#rag-pipeline","title":"RAG pipeline\u00b6","text":""},{"location":"examples/rag_langchain/","title":"RAG with LangChain","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote
This example leverages the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings.
The presented DoclingLoader component enables you to:
DoclingLoader supports two different export modes:
ExportType.MARKDOWN: if you want to capture each input document as a separate LangChain document, orExportType.DOC_CHUNKS (default): if you want to have each input document chunked and to then capture each individual chunk as a separate LangChain document downstream.The example allows exploring both modes via parameter EXPORT_TYPE; depending on the value set, the example pipeline is then set up accordingly.
HF_TOKEN.--no-warn-conflicts meant for Colab's pre-populated Python env; feel free to remove for stricter usage):%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain python-dotenv\n%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain python-dotenv
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom dotenv import load_dotenv\nfrom langchain_core.prompts import PromptTemplate\nfrom langchain_docling.loader import ExportType\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nFILE_PATH = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nEXPORT_TYPE = ExportType.DOC_CHUNKS\nQUESTION = \"Which are the main AI models in Docling?\"\nPROMPT = PromptTemplate.from_template(\n \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\",\n)\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\n import os from pathlib import Path from tempfile import mkdtemp from dotenv import load_dotenv from langchain_core.prompts import PromptTemplate from langchain_docling.loader import ExportType def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") FILE_PATH = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" EXPORT_TYPE = ExportType.DOC_CHUNKS QUESTION = \"Which are the main AI models in Docling?\" PROMPT = PromptTemplate.from_template( \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\", ) TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied! from langchain_docling import DoclingLoader\n\nfrom docling.chunking import HybridChunker\n\nloader = DoclingLoader(\n file_path=FILE_PATH,\n export_type=EXPORT_TYPE,\n chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n)\n\ndocs = loader.load()\nfrom langchain_docling import DoclingLoader from docling.chunking import HybridChunker loader = DoclingLoader( file_path=FILE_PATH, export_type=EXPORT_TYPE, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ) docs = loader.load()
Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors\n
Note: a message saying \"Token indices sequence length is longer than the specified maximum sequence length...\" can be ignored in this case \u2014 details here.
Determining the splits:
In\u00a0[4]: Copied!if EXPORT_TYPE == ExportType.DOC_CHUNKS:\n splits = docs\nelif EXPORT_TYPE == ExportType.MARKDOWN:\n from langchain_text_splitters import MarkdownHeaderTextSplitter\n\n splitter = MarkdownHeaderTextSplitter(\n headers_to_split_on=[\n (\"#\", \"Header_1\"),\n (\"##\", \"Header_2\"),\n (\"###\", \"Header_3\"),\n ],\n )\n splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]\nelse:\n raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\n if EXPORT_TYPE == ExportType.DOC_CHUNKS: splits = docs elif EXPORT_TYPE == ExportType.MARKDOWN: from langchain_text_splitters import MarkdownHeaderTextSplitter splitter = MarkdownHeaderTextSplitter( headers_to_split_on=[ (\"#\", \"Header_1\"), (\"##\", \"Header_2\"), (\"###\", \"Header_3\"), ], ) splits = [split for doc in docs for split in splitter.split_text(doc.page_content)] else: raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\") Inspecting some sample splits:
In\u00a0[5]: Copied!for d in splits[:3]:\n print(f\"- {d.page_content=}\")\nprint(\"...\")\n for d in splits[:3]: print(f\"- {d.page_content=}\") print(\"...\") - d.page_content='arXiv:2408.09869v5 [cs.CL] 9 Dec 2024'\n- d.page_content='Docling Technical Report\\nVersion 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R\u00a8uschlikon, Switzerland'\n- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n...\nIn\u00a0[6]: Copied!
import json\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom langchain_huggingface.embeddings import HuggingFaceEmbeddings\nfrom langchain_milvus import Milvus\n\nembedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)\n\n\nmilvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed\nvectorstore = Milvus.from_documents(\n documents=splits,\n embedding=embedding,\n collection_name=\"docling_demo\",\n connection_args={\"uri\": milvus_uri},\n index_params={\"index_type\": \"FLAT\"},\n drop_old=True,\n)\n import json from pathlib import Path from tempfile import mkdtemp from langchain_huggingface.embeddings import HuggingFaceEmbeddings from langchain_milvus import Milvus embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID) milvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed vectorstore = Milvus.from_documents( documents=splits, embedding=embedding, collection_name=\"docling_demo\", connection_args={\"uri\": milvus_uri}, index_params={\"index_type\": \"FLAT\"}, drop_old=True, ) In\u00a0[7]: Copied! from langchain.chains import create_retrieval_chain\nfrom langchain.chains.combine_documents import create_stuff_documents_chain\nfrom langchain_huggingface import HuggingFaceEndpoint\n\nretriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K})\nllm = HuggingFaceEndpoint(\n repo_id=GEN_MODEL_ID,\n huggingfacehub_api_token=HF_TOKEN,\n)\n\n\ndef clip_text(text, threshold=100):\n return f\"{text[:threshold]}...\" if len(text) > threshold else text\n from langchain.chains import create_retrieval_chain from langchain.chains.combine_documents import create_stuff_documents_chain from langchain_huggingface import HuggingFaceEndpoint retriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K}) llm = HuggingFaceEndpoint( repo_id=GEN_MODEL_ID, huggingfacehub_api_token=HF_TOKEN, ) def clip_text(text, threshold=100): return f\"{text[:threshold]}...\" if len(text) > threshold else text Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\nIn\u00a0[8]: Copied!
question_answer_chain = create_stuff_documents_chain(llm, PROMPT)\nrag_chain = create_retrieval_chain(retriever, question_answer_chain)\nresp_dict = rag_chain.invoke({\"input\": QUESTION})\n\nclipped_answer = clip_text(resp_dict[\"answer\"], threshold=200)\nprint(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")\nfor i, doc in enumerate(resp_dict[\"context\"]):\n print()\n print(f\"Source {i + 1}:\")\n print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")\n for key in doc.metadata:\n if key != \"pk\":\n val = doc.metadata.get(key)\n clipped_val = clip_text(val) if isinstance(val, str) else val\n print(f\" {key}: {clipped_val}\")\n question_answer_chain = create_stuff_documents_chain(llm, PROMPT) rag_chain = create_retrieval_chain(retriever, question_answer_chain) resp_dict = rag_chain.invoke({\"input\": QUESTION}) clipped_answer = clip_text(resp_dict[\"answer\"], threshold=200) print(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\") for i, doc in enumerate(resp_dict[\"context\"]): print() print(f\"Source {i + 1}:\") print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\") for key in doc.metadata: if key != \"pk\": val = doc.metadata.get(key) clipped_val = clip_text(val) if isinstance(val, str) else val print(f\" {key}: {clipped_val}\") Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nDocling initially releases two AI models, a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art tab...\n\nSource 1:\n text: \"3.2 AI models\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re...\"\n dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 405.1419982910156, 'r': 504.00299072265625, 'b': 330.7799987792969, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n source: https://arxiv.org/pdf/2408.09869\n\nSource 2:\n text: \"3 Processing pipeline\\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ...\"\n dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/26', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 2, 'bbox': {'l': 108.0, 't': 273.01800537109375, 'r': 504.00299072265625, 'b': 176.83799743652344, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 796]}]}], 'headings': ['3 Processing pipeline'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n source: https://arxiv.org/pdf/2408.09869\n\nSource 3:\n text: \"6 Future work and contributions\\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ...\"\n dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/76', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 322.468994140625, 'r': 504.00299072265625, 'b': 259.0169982910156, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}, {'self_ref': '#/texts/77', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 251.6540069580078, 'r': 504.00299072265625, 'b': 198.99200439453125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 402]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n source: https://arxiv.org/pdf/2408.09869\n In\u00a0[\u00a0]: Copied! \n"},{"location":"examples/rag_langchain/#rag-with-langchain","title":"RAG with LangChain\u00b6","text":""},{"location":"examples/rag_langchain/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_langchain/#document-loading","title":"Document loading\u00b6","text":"
Now we can instantiate our loader and load documents.
"},{"location":"examples/rag_langchain/#ingestion","title":"Ingestion\u00b6","text":""},{"location":"examples/rag_langchain/#rag","title":"RAG\u00b6","text":""},{"location":"examples/rag_llamaindex/","title":"RAG with LlamaIndex","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 RemoteThis example leverages the official LlamaIndex Docling extension.
Presented extensions DoclingReader and DoclingNodeParser enable you to:
HF_TOKEN.--no-warn-conflicts meant for Colab's pre-populated Python env; feel free to remove for stricter usage):%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv\n%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\nfrom warnings import filterwarnings\n\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\nfilterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")\nfilterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\")\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\nimport os from pathlib import Path from tempfile import mkdtemp from warnings import filterwarnings from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\") filterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\") # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"
We can now define the main parameters:
In\u00a0[3]: Copied!from llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n\nEMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\nGEN_MODEL = HuggingFaceInferenceAPI(\n token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n)\nSOURCE = \"https://arxiv.org/pdf/2408.09869\" # Docling Technical Report\nQUERY = \"Which are the main AI models in Docling?\"\n\nembed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI EMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\") MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") GEN_MODEL = HuggingFaceInferenceAPI( token=_get_env_from_colab_or_os(\"HF_TOKEN\"), model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", ) SOURCE = \"https://arxiv.org/pdf/2408.09869\" # Docling Technical Report QUERY = \"Which are the main AI models in Docling?\" embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))
To create a simple RAG pipeline, we can:
DoclingReader, which by default exports to Markdown, andMarkdownNodeParserfrom llama_index.core import StorageContext, VectorStoreIndex\nfrom llama_index.core.node_parser import MarkdownNodeParser\nfrom llama_index.readers.docling import DoclingReader\nfrom llama_index.vector_stores.milvus import MilvusVectorStore\n\nreader = DoclingReader()\nnode_parser = MarkdownNodeParser()\n\nvector_store = MilvusVectorStore(\n uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed\n dim=embed_dim,\n overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n documents=reader.load_data(SOURCE),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\n from llama_index.core import StorageContext, VectorStoreIndex from llama_index.core.node_parser import MarkdownNodeParser from llama_index.readers.docling import DoclingReader from llama_index.vector_stores.milvus import MilvusVectorStore reader = DoclingReader() node_parser = MarkdownNodeParser() vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes]) Q: Which are the main AI models in Docling?\nA: The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.\n\nSources:\n
[('3.2 AI models\\n\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n {'Header_2': '3.2 AI models'}),\n (\"5 Applications\\n\\nThanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.\",\n {'Header_2': '5 Applications'})] To leverage Docling's rich native format, we:
DoclingReader with JSON export type, andDoclingNodeParser in order to appropriately parse that Docling format.Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):
In\u00a0[5]: Copied!from llama_index.node_parser.docling import DoclingNodeParser\n\nreader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\nnode_parser = DoclingNodeParser()\n\nvector_store = MilvusVectorStore(\n uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed\n dim=embed_dim,\n overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n documents=reader.load_data(SOURCE),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\n from llama_index.node_parser.docling import DoclingNodeParser reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) node_parser = DoclingNodeParser() vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes]) Q: Which are the main AI models in Docling?\nA: The main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art table structure recognition model.\n\nSources:\n
[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n {'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/34',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 3,\n 'bbox': {'l': 107.07593536376953,\n 't': 406.1695251464844,\n 'r': 504.1148681640625,\n 'b': 330.2677307128906,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 608]}]}],\n 'headings': ['3.2 AI models'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869v3.pdf'}}),\n ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n {'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/9',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 1,\n 'bbox': {'l': 107.0031967163086,\n 't': 136.7283935546875,\n 'r': 504.04998779296875,\n 'b': 83.30133056640625,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 488]}]}],\n 'headings': ['1 Introduction'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869v3.pdf'}})] To demonstrate this usage pattern, we first set up a test document directory.
In\u00a0[6]: Copied!from pathlib import Path\nfrom tempfile import mkdtemp\n\nimport requests\n\ntmp_dir_path = Path(mkdtemp())\nr = requests.get(SOURCE)\nwith open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:\n out_file.write(r.content)\n from pathlib import Path from tempfile import mkdtemp import requests tmp_dir_path = Path(mkdtemp()) r = requests.get(SOURCE) with open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file: out_file.write(r.content) Using the reader and node_parser definitions from any of the above variants, usage with SimpleDirectoryReader then looks as follows:
from llama_index.core import SimpleDirectoryReader\n\ndir_reader = SimpleDirectoryReader(\n input_dir=tmp_dir_path,\n file_extractor={\".pdf\": reader},\n)\n\nvector_store = MilvusVectorStore(\n uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed\n dim=embed_dim,\n overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n documents=dir_reader.load_data(SOURCE),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\n from llama_index.core import SimpleDirectoryReader dir_reader = SimpleDirectoryReader( input_dir=tmp_dir_path, file_extractor={\".pdf\": reader}, ) vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=dir_reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes]) Loading files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:11<00:00, 11.27s/file]\n
Q: Which are the main AI models in Docling?\nA: 1. A layout analysis model, an accurate object-detector for page elements. 2. TableFormer, a state-of-the-art table structure recognition model.\n\nSources:\n
[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf',\n 'file_name': '2408.09869.pdf',\n 'file_type': 'application/pdf',\n 'file_size': 5566574,\n 'creation_date': '2024-10-28',\n 'last_modified_date': '2024-10-28',\n 'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/34',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 3,\n 'bbox': {'l': 107.07593536376953,\n 't': 406.1695251464844,\n 'r': 504.1148681640625,\n 'b': 330.2677307128906,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 608]}]}],\n 'headings': ['3.2 AI models'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869.pdf'}}),\n ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf',\n 'file_name': '2408.09869.pdf',\n 'file_type': 'application/pdf',\n 'file_size': 5566574,\n 'creation_date': '2024-10-28',\n 'last_modified_date': '2024-10-28',\n 'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/9',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 1,\n 'bbox': {'l': 107.0031967163086,\n 't': 136.7283935546875,\n 'r': 504.04998779296875,\n 'b': 83.30133056640625,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 488]}]}],\n 'headings': ['1 Introduction'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869.pdf'}})] In\u00a0[\u00a0]: Copied! \n"},{"location":"examples/rag_llamaindex/#rag-with-llamaindex","title":"RAG with LlamaIndex\u00b6","text":""},{"location":"examples/rag_llamaindex/#overview","title":"Overview\u00b6","text":""},{"location":"examples/rag_llamaindex/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_llamaindex/#using-markdown-export","title":"Using Markdown export\u00b6","text":""},{"location":"examples/rag_llamaindex/#using-docling-format","title":"Using Docling format\u00b6","text":""},{"location":"examples/rag_llamaindex/#with-simple-directory-reader","title":"With Simple Directory Reader\u00b6","text":""},{"location":"examples/rag_milvus/","title":"RAG with Milvus","text":"In\u00a0[\u00a0]: Copied!
! pip install --upgrade \"pymilvus[milvus-lite]\" docling openai torch\n! pip install --upgrade \"pymilvus[milvus-lite]\" docling openai torch
If you are using Google Colab, to enable dependencies just installed, you may need to restart the runtime (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu).
Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.
The code below checks to see if a GPU is available, either via CUDA or MPS.
In\u00a0[1]: Copied!import torch\n\n# Check if GPU or MPS is available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\n print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n device = torch.device(\"mps\")\n print(\"MPS GPU is enabled.\")\nelse:\n raise OSError(\n \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n )\n import torch # Check if GPU or MPS is available if torch.cuda.is_available(): device = torch.device(\"cuda\") print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available(): device = torch.device(\"mps\") print(\"MPS GPU is enabled.\") else: raise OSError( \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\" ) MPS GPU is enabled.\nIn\u00a0[2]: Copied!
import os\n\nos.environ[\"OPENAI_API_KEY\"] = \"sk-***********\"\nimport os os.environ[\"OPENAI_API_KEY\"] = \"sk-***********\" In\u00a0[3]: Copied!
from openai import OpenAI\n\nopenai_client = OpenAI()\nfrom openai import OpenAI openai_client = OpenAI()
Define a function to generate text embeddings using OpenAI client. We use the text-embedding-3-small model as an example.
In\u00a0[4]: Copied!def emb_text(text):\n return (\n openai_client.embeddings.create(input=text, model=\"text-embedding-3-small\")\n .data[0]\n .embedding\n )\ndef emb_text(text): return ( openai_client.embeddings.create(input=text, model=\"text-embedding-3-small\") .data[0] .embedding )
Generate a test embedding and print its dimension and first few elements.
In\u00a0[5]: Copied!test_embedding = emb_text(\"This is a test\")\nembedding_dim = len(test_embedding)\nprint(embedding_dim)\nprint(test_embedding[:10])\ntest_embedding = emb_text(\"This is a test\") embedding_dim = len(test_embedding) print(embedding_dim) print(test_embedding[:10])
1536\n[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]\n
In this tutorial, we will use a Markdown file (source) as the input. We will process the document using a HierarchicalChunker provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks.
In\u00a0[6]: Copied!from docling_core.transforms.chunker import HierarchicalChunker\n\nfrom docling.document_converter import DocumentConverter\n\nconverter = DocumentConverter()\nchunker = HierarchicalChunker()\n\n# Convert the input file to Docling Document\nsource = \"https://milvus.io/docs/overview.md\"\ndoc = converter.convert(source).document\n\n# Perform hierarchical chunking\ntexts = [chunk.text for chunk in chunker.chunk(doc)]\nfrom docling_core.transforms.chunker import HierarchicalChunker from docling.document_converter import DocumentConverter converter = DocumentConverter() chunker = HierarchicalChunker() # Convert the input file to Docling Document source = \"https://milvus.io/docs/overview.md\" doc = converter.convert(source).document # Perform hierarchical chunking texts = [chunk.text for chunk in chunker.chunk(doc)] In\u00a0[7]: Copied!
from pymilvus import MilvusClient\n\nmilvus_client = MilvusClient(uri=\"./milvus_demo.db\")\ncollection_name = \"my_rag_collection\"\nfrom pymilvus import MilvusClient milvus_client = MilvusClient(uri=\"./milvus_demo.db\") collection_name = \"my_rag_collection\"
As for the argument of MilvusClient:
uri as a local file, e.g../milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file.http://localhost:19530, as your uri.uri and token, which correspond to the Public Endpoint and Api key in Zilliz Cloud.Check if the collection already exists and drop it if it does.
In\u00a0[8]: Copied!if milvus_client.has_collection(collection_name):\n milvus_client.drop_collection(collection_name)\nif milvus_client.has_collection(collection_name): milvus_client.drop_collection(collection_name)
Create a new collection with specified parameters.
If we don\u2019t specify any field information, Milvus will automatically create a default id field for primary key, and a vector field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values.
milvus_client.create_collection(\n collection_name=collection_name,\n dimension=embedding_dim,\n metric_type=\"IP\", # Inner product distance\n consistency_level=\"Strong\", # Supported values are (`\"Strong\"`, `\"Session\"`, `\"Bounded\"`, `\"Eventually\"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details.\n)\nmilvus_client.create_collection( collection_name=collection_name, dimension=embedding_dim, metric_type=\"IP\", # Inner product distance consistency_level=\"Strong\", # Supported values are (`\"Strong\"`, `\"Session\"`, `\"Bounded\"`, `\"Eventually\"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details. ) In\u00a0[10]: Copied!
from tqdm import tqdm\n\ndata = []\n\nfor i, chunk in enumerate(tqdm(texts, desc=\"Processing chunks\")):\n embedding = emb_text(chunk)\n data.append({\"id\": i, \"vector\": embedding, \"text\": chunk})\n\nmilvus_client.insert(collection_name=collection_name, data=data)\n from tqdm import tqdm data = [] for i, chunk in enumerate(tqdm(texts, desc=\"Processing chunks\")): embedding = emb_text(chunk) data.append({\"id\": i, \"vector\": embedding, \"text\": chunk}) milvus_client.insert(collection_name=collection_name, data=data) Processing chunks: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 38/38 [00:14<00:00, 2.59it/s]\nOut[10]:
{'insert_count': 38, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37], 'cost': 0} In\u00a0[11]: Copied! question = (\n \"What are the three deployment modes of Milvus, and what are their differences?\"\n)\nquestion = ( \"What are the three deployment modes of Milvus, and what are their differences?\" )
Search for the question in the collection and retrieve the semantic top-3 matches.
In\u00a0[12]: Copied!search_res = milvus_client.search(\n collection_name=collection_name,\n data=[emb_text(question)],\n limit=3,\n search_params={\"metric_type\": \"IP\", \"params\": {}},\n output_fields=[\"text\"],\n)\n search_res = milvus_client.search( collection_name=collection_name, data=[emb_text(question)], limit=3, search_params={\"metric_type\": \"IP\", \"params\": {}}, output_fields=[\"text\"], ) Let\u2019s take a look at the search results of the query
In\u00a0[13]: Copied!import json\n\nretrieved_lines_with_distances = [\n (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0]\n]\nprint(json.dumps(retrieved_lines_with_distances, indent=4))\nimport json retrieved_lines_with_distances = [ (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0] ] print(json.dumps(retrieved_lines_with_distances, indent=4))
[\n [\n \"Milvus offers three deployment modes, covering a wide range of data scales\\u2014from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors:\",\n 0.6503315567970276\n ],\n [\n \"Milvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it\\u2019s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\\nMilvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\\nMilvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more.\",\n 0.6281915903091431\n ],\n [\n \"What is Milvus?\\nUnstructured Data, Embeddings, and Milvus\\nWhat Makes Milvus so Fast\\uff1f\\nWhat Makes Milvus so Scalable\\nTypes of Searches Supported by Milvus\\nComprehensive Feature Set\",\n 0.6117826700210571\n ]\n]\nIn\u00a0[14]: Copied!
context = \"\\n\".join(\n [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]\n)\ncontext = \"\\n\".join( [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances] )
Define system and user prompts for the Lanage Model. This prompt is assembled with the retrieved documents from Milvus.
In\u00a0[16]: Copied!SYSTEM_PROMPT = \"\"\"\nHuman: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.\n\"\"\"\nUSER_PROMPT = f\"\"\"\nUse the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.\n<context>\n{context}\n</context>\n<question>\n{question}\n</question>\n\"\"\"\n SYSTEM_PROMPT = \"\"\" Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. \"\"\" USER_PROMPT = f\"\"\" Use the following pieces of information enclosed in tags to provide an answer to the question enclosed in tags. {context} {question} \"\"\" Use OpenAI ChatGPT to generate a response based on the prompts.
In\u00a0[17]: Copied!response = openai_client.chat.completions.create(\n model=\"gpt-4o\",\n messages=[\n {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n {\"role\": \"user\", \"content\": USER_PROMPT},\n ],\n)\nprint(response.choices[0].message.content)\n response = openai_client.chat.completions.create( model=\"gpt-4o\", messages=[ {\"role\": \"system\", \"content\": SYSTEM_PROMPT}, {\"role\": \"user\", \"content\": USER_PROMPT}, ], ) print(response.choices[0].message.content) The three deployment modes of Milvus are:\n\n1. **Milvus Lite**: This is a Python library that integrates easily into your applications. It's a lightweight version ideal for quick prototyping in Jupyter Notebooks or for running on edge devices with limited resources.\n\n2. **Milvus Standalone**: This mode is a single-machine server deployment where all components are bundled into a single Docker image, making it convenient to deploy.\n\n3. **Milvus Distributed**: This mode is designed for deployment on Kubernetes clusters. It features a cloud-native architecture suited for managing scenarios at a billion-scale or larger, ensuring redundancy in critical components.\n"},{"location":"examples/rag_milvus/#rag-with-milvus","title":"RAG with Milvus\u00b6","text":"Step Tech Execution Embedding OpenAI (text-embedding-3-small) \ud83c\udf10 Remote Vector store Milvus \ud83d\udcbb Local Gen AI OpenAI (gpt-4o) \ud83c\udf10 Remote"},{"location":"examples/rag_milvus/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"
This is a code recipe that uses Milvus, the world's most advanced open-source vector database, to perform RAG over documents parsed by Docling.
In this notebook, we accomplish the following:
Note: For best results, please use GPU acceleration to run this notebook. Here are two options for running this notebook:
To start, install the required dependencies by running the following command:
"},{"location":"examples/rag_milvus/#gpu-checking","title":"GPU Checking\u00b6","text":""},{"location":"examples/rag_milvus/#setting-up-api-keys","title":"Setting Up API Keys\u00b6","text":"We will use OpenAI as the LLM in this example. You should prepare the OPENAI_API_KEY as an environment variable.
"},{"location":"examples/rag_milvus/#prepare-the-llm-and-embedding-model","title":"Prepare the LLM and Embedding Model\u00b6","text":"We initialize the OpenAI client to prepare the embedding model.
"},{"location":"examples/rag_milvus/#process-data-using-docling","title":"Process Data Using Docling\u00b6","text":"Docling can parse various document formats into a unified representation (Docling Document), which can then be exported to different output formats. For a full list of supported input and output formats, please refer to the official documentation.
"},{"location":"examples/rag_milvus/#load-data-into-milvus","title":"Load Data into Milvus\u00b6","text":""},{"location":"examples/rag_milvus/#create-the-collection","title":"Create the collection\u00b6","text":"With data in hand, we can create a MilvusClient instance and insert the data into a Milvus collection.
Let\u2019s specify a query question about the website we just scraped.
"},{"location":"examples/rag_milvus/#use-llm-to-get-a-rag-response","title":"Use LLM to get a RAG response\u00b6","text":"Convert the retrieved documents into a string format.
"},{"location":"examples/rag_mongodb/","title":"RAG with MongoDB + VoyageAI","text":"Step Tech Execution Embedding Voyage AI \ud83c\udf10 Remote Vector store MongoDB \ud83c\udf10 Remote Gen AI Azure Open AI \ud83c\udf10 Remote In\u00a0[124]: Copied!%%capture\n%pip install docling~=\"2.7.0\"\n%pip install pymongo[srv]\n%pip install voyageai\n%pip install openai\n\nimport logging\nimport warnings\n\nwarnings.filterwarnings(\"ignore\")\nlogging.getLogger(\"pymongo\").setLevel(logging.ERROR)\n%%capture %pip install docling~=\"2.7.0\" %pip install pymongo[srv] %pip install voyageai %pip install openai import logging import warnings warnings.filterwarnings(\"ignore\") logging.getLogger(\"pymongo\").setLevel(logging.ERROR) In\u00a0[125]: Copied!
import torch\n\n# Check if GPU or MPS is available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\n print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n device = torch.device(\"mps\")\n print(\"MPS GPU is enabled.\")\nelse:\n raise OSError(\n \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n )\n import torch # Check if GPU or MPS is available if torch.cuda.is_available(): device = torch.device(\"cuda\") print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available(): device = torch.device(\"mps\") print(\"MPS GPU is enabled.\") else: raise OSError( \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\" ) MPS GPU is enabled.\nIn\u00a0[126]: Copied!
# Influential machine learning papers\nsource_urls = [\n \"https://arxiv.org/pdf/1706.03762\" # Attention is All You Need\n]\n# Influential machine learning papers source_urls = [ \"https://arxiv.org/pdf/1706.03762\" # Attention is All You Need ] In\u00a0[127]: Copied!
from pprint import pprint\n\nfrom docling.document_converter import DocumentConverter\n\n# Instantiate the doc converter\ndoc_converter = DocumentConverter()\n\n# Since we want to use a single document, we will convert just the first URL. For multiple documents, you can use convert_all() method and then iterate through the list of converted documents.\npdf_doc = source_urls[0]\nconverted_doc = doc_converter.convert(pdf_doc).document\nfrom pprint import pprint from docling.document_converter import DocumentConverter # Instantiate the doc converter doc_converter = DocumentConverter() # Since we want to use a single document, we will convert just the first URL. For multiple documents, you can use convert_all() method and then iterate through the list of converted documents. pdf_doc = source_urls[0] converted_doc = doc_converter.convert(pdf_doc).document
Fetching 9 files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 9/9 [00:00<00:00, 73728.00it/s]\nIn\u00a0[137]: Copied!
from docling_core.transforms.chunker import HierarchicalChunker\n\n# Initialize the chunker\nchunker = HierarchicalChunker()\n\n# Perform hierarchical chunking on the converted document and get text from chunks\nchunks = list(chunker.chunk(converted_doc))\nchunk_texts = [chunk.text for chunk in chunks]\nchunk_texts[:20] # Display a few chunk texts\nfrom docling_core.transforms.chunker import HierarchicalChunker # Initialize the chunker chunker = HierarchicalChunker() # Perform hierarchical chunking on the converted document and get text from chunks chunks = list(chunker.chunk(converted_doc)) chunk_texts = [chunk.text for chunk in chunks] chunk_texts[:20] # Display a few chunk texts Out[137]:
['arXiv:1706.03762v7 [cs.CL] 2 Aug 2023',\n 'Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.',\n 'Ashish Vaswani \u2217 Google Brain avaswani@google.com',\n 'Noam Shazeer \u2217 Google Brain noam@google.com',\n 'Niki Parmar \u2217 Google Research nikip@google.com',\n 'Jakob Uszkoreit \u2217 Google Research usz@google.com',\n 'Llion Jones \u2217 Google Research llion@google.com',\n 'Aidan N. Gomez \u2217 \u2020 University of Toronto aidan@cs.toronto.edu',\n '\u0141ukasz Kaiser \u2217 Google Brain lukaszkaiser@google.com',\n 'Illia Polosukhin \u2217 \u2021',\n 'illia.polosukhin@gmail.com',\n 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.',\n '$^{\u2217}$Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.',\n '$^{\u2020}$Work performed while at Google Brain.',\n '$^{\u2021}$Work performed while at Google Research.',\n '31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.',\n 'Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].',\n 'Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states h$_{t}$ , as a function of the previous hidden state h$_{t}$$_{-}$$_{1}$ and the input for position t . This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.',\n 'Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.',\n 'In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.'] We will be using VoyageAI embedding model for converting the above chunks to embeddings, thereafter pushing them to MongoDB for further consumption.
VoyageAI has a load of offerings for embedding models, we will be using voyage-context-3 for best results in this case, which is a contextualized chunk embedding model, where chunk embedding encodes not only the chunk\u2019s own content, but also captures the contextual information from the full document.
You can go through the blogpost to understand how it performas in comparison to other embedding models.
Create an account on Voyage and get you API key.
In\u00a0[\u00a0]: Copied!import voyageai\n\n# Voyage API key\nVOYAGE_API_KEY = \"**********************\"\n\n# Initialize the VoyageAI client\nvo = voyageai.Client(VOYAGE_API_KEY)\nresult = vo.contextualized_embed(inputs=[chunk_texts], model=\"voyage-context-3\")\ncontextualized_chunk_embds = [emb for r in result.results for emb in r.embeddings]\nimport voyageai # Voyage API key VOYAGE_API_KEY = \"**********************\" # Initialize the VoyageAI client vo = voyageai.Client(VOYAGE_API_KEY) result = vo.contextualized_embed(inputs=[chunk_texts], model=\"voyage-context-3\") contextualized_chunk_embds = [emb for r in result.results for emb in r.embeddings] In\u00a0[121]: Copied!
# Check lengths to ensure they match\nprint(\"Chunk Texts Length:\", chunk_texts.__len__())\nprint(\"Contextualized Chunk Embeddings Length:\", contextualized_chunk_embds.__len__())\n# Check lengths to ensure they match print(\"Chunk Texts Length:\", chunk_texts.__len__()) print(\"Contextualized Chunk Embeddings Length:\", contextualized_chunk_embds.__len__())
Chunk Texts Length: 118\nContextualized Chunk Embeddings Length: 118\nIn\u00a0[115]: Copied!
# Combine chunks with their embeddings\nchunk_data = [\n {\"text\": text, \"embedding\": emb}\n for text, emb in zip(chunk_texts, contextualized_chunk_embds)\n]\n # Combine chunks with their embeddings chunk_data = [ {\"text\": text, \"embedding\": emb} for text, emb in zip(chunk_texts, contextualized_chunk_embds) ] In\u00a0[\u00a0]: Copied! # Insert to MongoDB\nfrom pymongo import MongoClient\n\nclient = MongoClient(\n \"mongodb+srv://*******.mongodb.net/\"\n) # Replace with your MongoDB connection string\ndb = client[\"rag_db\"] # Database name\ncollection = db[\"documents\"] # Collection name\n\n# Insert chunk data into MongoDB\nresponse = collection.insert_many(chunk_data)\nprint(f\"Inserted {len(response.inserted_ids)} documents into MongoDB.\")\n # Insert to MongoDB from pymongo import MongoClient client = MongoClient( \"mongodb+srv://*******.mongodb.net/\" ) # Replace with your MongoDB connection string db = client[\"rag_db\"] # Database name collection = db[\"documents\"] # Collection name # Insert chunk data into MongoDB response = collection.insert_many(chunk_data) print(f\"Inserted {len(response.inserted_ids)} documents into MongoDB.\") Inserted 118 documents into MongoDB.\nIn\u00a0[117]: Copied!
from pymongo.operations import SearchIndexModel\n\n# Create your index model, then create the search index\nsearch_index_model = SearchIndexModel(\n definition={\n \"fields\": [\n {\n \"type\": \"vector\",\n \"path\": \"embedding\",\n \"numDimensions\": 1024,\n \"similarity\": \"dotProduct\",\n }\n ]\n },\n name=\"vector_index\",\n type=\"vectorSearch\",\n)\nresult = collection.create_search_index(model=search_index_model)\nprint(\"New search index named \" + result + \" is building.\")\n from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition={ \"fields\": [ { \"type\": \"vector\", \"path\": \"embedding\", \"numDimensions\": 1024, \"similarity\": \"dotProduct\", } ] }, name=\"vector_index\", type=\"vectorSearch\", ) result = collection.create_search_index(model=search_index_model) print(\"New search index named \" + result + \" is building.\") New search index named vector_index is building.\nIn\u00a0[\u00a0]: Copied!
import os\n\nfrom openai import AzureOpenAI\nfrom rich.console import Console\nfrom rich.panel import Panel\n\n# Create MongoDB vector search query for \"Attention is All You Need\"\n# (prompt already defined above, reuse if present; else keep this definition)\nprompt = \"Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context.\"\n\n# Generate embedding for the query using VoyageAI (vo already initialized earlier)\nquery_embd_context = (\n vo.contextualized_embed(\n inputs=[[prompt]], model=\"voyage-context-3\", input_type=\"query\"\n )\n .results[0]\n .embeddings[0]\n)\n\n# Vector search pipeline\nsearch_pipeline = [\n {\n \"$vectorSearch\": {\n \"index\": \"vector_index\",\n \"path\": \"embedding\",\n \"queryVector\": query_embd_context,\n \"numCandidates\": 10,\n \"limit\": 10,\n }\n },\n {\"$project\": {\"text\": 1, \"_id\": 0, \"score\": {\"$meta\": \"vectorSearchScore\"}}},\n]\n\nresults = list(collection.aggregate(search_pipeline))\nif not results:\n raise ValueError(\n \"No vector search results returned. Verify the index is built before querying.\"\n )\n\ncontext_texts = [doc[\"text\"] for doc in results]\ncombined_context = \"\\n\\n\".join(context_texts)\n\n# Expect these environment variables to be set (do NOT hardcode secrets):\n# AZURE_OPENAI_API_KEY\n# AZURE_OPENAI_ENDPOINT -> e.g. https://your-resource-name.openai.azure.com/\n# AZURE_OPENAI_API_VERSION (optional, else fallback)\nAZURE_OPENAI_API_KEY = \"**********************\"\nAZURE_OPENAI_ENDPOINT = \"**********************\"\nAZURE_OPENAI_API_VERSION = \"**********************\"\n\n# Initialize Azure OpenAI client (endpoint must NOT include path segments)\nclient = AzureOpenAI(\n api_key=AZURE_OPENAI_API_KEY,\n azure_endpoint=AZURE_OPENAI_ENDPOINT.rstrip(\"/\"),\n api_version=AZURE_OPENAI_API_VERSION,\n)\n\n# Chat completion using retrieved context\nresponse = client.chat.completions.create(\n model=\"gpt-4o-mini\", # Azure deployment name\n messages=[\n {\n \"role\": \"system\",\n \"content\": \"You are a helpful assistant. Use only the provided context to answer questions. If the context is insufficient, say so.\",\n },\n {\n \"role\": \"user\",\n \"content\": f\"Context:\\n{combined_context}\\n\\nQuestion: {prompt}\",\n },\n ],\n temperature=0.2,\n)\n\nresponse_text = response.choices[0].message.content\n\nconsole = Console()\nconsole.print(Panel(f\"{prompt}\", title=\"Prompt\", border_style=\"bold red\"))\nconsole.print(\n Panel(response_text, title=\"Generated Content\", border_style=\"bold green\")\n)\n import os from openai import AzureOpenAI from rich.console import Console from rich.panel import Panel # Create MongoDB vector search query for \"Attention is All You Need\" # (prompt already defined above, reuse if present; else keep this definition) prompt = \"Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context.\" # Generate embedding for the query using VoyageAI (vo already initialized earlier) query_embd_context = ( vo.contextualized_embed( inputs=[[prompt]], model=\"voyage-context-3\", input_type=\"query\" ) .results[0] .embeddings[0] ) # Vector search pipeline search_pipeline = [ { \"$vectorSearch\": { \"index\": \"vector_index\", \"path\": \"embedding\", \"queryVector\": query_embd_context, \"numCandidates\": 10, \"limit\": 10, } }, {\"$project\": {\"text\": 1, \"_id\": 0, \"score\": {\"$meta\": \"vectorSearchScore\"}}}, ] results = list(collection.aggregate(search_pipeline)) if not results: raise ValueError( \"No vector search results returned. Verify the index is built before querying.\" ) context_texts = [doc[\"text\"] for doc in results] combined_context = \"\\n\\n\".join(context_texts) # Expect these environment variables to be set (do NOT hardcode secrets): # AZURE_OPENAI_API_KEY # AZURE_OPENAI_ENDPOINT -> e.g. https://your-resource-name.openai.azure.com/ # AZURE_OPENAI_API_VERSION (optional, else fallback) AZURE_OPENAI_API_KEY = \"**********************\" AZURE_OPENAI_ENDPOINT = \"**********************\" AZURE_OPENAI_API_VERSION = \"**********************\" # Initialize Azure OpenAI client (endpoint must NOT include path segments) client = AzureOpenAI( api_key=AZURE_OPENAI_API_KEY, azure_endpoint=AZURE_OPENAI_ENDPOINT.rstrip(\"/\"), api_version=AZURE_OPENAI_API_VERSION, ) # Chat completion using retrieved context response = client.chat.completions.create( model=\"gpt-4o-mini\", # Azure deployment name messages=[ { \"role\": \"system\", \"content\": \"You are a helpful assistant. Use only the provided context to answer questions. If the context is insufficient, say so.\", }, { \"role\": \"user\", \"content\": f\"Context:\\n{combined_context}\\n\\nQuestion: {prompt}\", }, ], temperature=0.2, ) response_text = response.choices[0].message.content console = Console() console.print(Panel(f\"{prompt}\", title=\"Prompt\", border_style=\"bold red\")) console.print( Panel(response_text, title=\"Generated Content\", border_style=\"bold green\") ) \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 1. **Introduction of the Transformer Architecture**: The Transformer model is a novel architecture that relies \u2502\n\u2502 entirely on attention mechanisms, eliminating the need for recurrence and convolutions. This allows for \u2502\n\u2502 significantly more parallelization during training and leads to superior performance in tasks such as machine \u2502\n\u2502 translation. \u2502\n\u2502 \u2502\n\u2502 2. **Performance and Efficiency**: The Transformer achieves state-of-the-art results on machine translation \u2502\n\u2502 tasks, such as a BLEU score of 28.4 on the WMT 2014 English-to-German task and 41.8 on the English-to-French \u2502\n\u2502 task, while requiring much less training time (3.5 days on eight GPUs) compared to previous models. This \u2502\n\u2502 demonstrates the efficiency and effectiveness of the architecture. \u2502\n\u2502 \u2502\n\u2502 3. **Self-Attention Mechanism**: The self-attention layers in both the encoder and decoder allow for each \u2502\n\u2502 position to attend to all other positions in the sequence, enabling the model to capture global dependencies. \u2502\n\u2502 This mechanism is more computationally efficient than recurrent layers, which require sequential operations, \u2502\n\u2502 thus improving the model's speed and scalability. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
This notebook demonstrated a powerful RAG pipeline using MongoDB, VoyageAI, and Azure OpenAI. By combining MongoDB's vector search capabilities with VoyageAI's embeddings and Azure OpenAI's language models, we created an intelligent document retrieval system.
"},{"location":"examples/rag_mongodb/#rag-with-mongodb-voyageai","title":"RAG with MongoDB + VoyageAI\u00b6","text":""},{"location":"examples/rag_mongodb/#how-to-cook","title":"How to cook\u00b6","text":"This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) pipeline using MongoDB as a vector store and Voyage AI embedding models for semantic search. The workflow involves extracting and chunking text from documents, generating embeddings with Voyage AI, storing vectors in MongoDB, and leveraging OpenAI for generative responses.
By combining these technologies, you can build scalable, production-ready RAG systems for advanced document understanding and question answering.
"},{"location":"examples/rag_mongodb/#setting-up-your-environment","title":"Setting Up Your Environment\u00b6","text":"First, we'll install the necessary libraries and configure our environment. These packages enable document processing, database connections, embedding generation, and AI model interaction. We're using Docling for document handling, PyMongo for MongoDB integration, VoyageAI for embeddings, and OpenAI client for generation capabilities.
"},{"location":"examples/rag_mongodb/#part-1-setting-up-docling","title":"Part 1: Setting up Docling\u00b6","text":"Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.
The code below checks to see if a GPU is available, either via CUDA or MPS.
"},{"location":"examples/rag_mongodb/#single-document-rag-baseline","title":"Single-Document RAG Baseline\u00b6","text":"To begin, we will focus on a single seminal paper and treat it as the entire knowledge base. Building a Retrieval-Augmented Generation (RAG) pipeline on just one document serves as a clear, controlled baseline before scaling to multiple sources. This helps validate each stage of the workflow (parsing, chunking, embedding, retrieval, generation) without confounding factors introduced by inter-document noise.
"},{"location":"examples/rag_mongodb/#convert-source-documents-to-markdown","title":"Convert Source Documents to Markdown\u00b6","text":"Convert each source URL to Markdown with Docling, reusing any already-converted document to avoid redundant downloads/parsing. Produces a dict mapping URLs to their Markdown content.
There are other methods that can be used to
"},{"location":"examples/rag_mongodb/#post-process-extracted-document-data","title":"Post-process extracted document data\u00b6","text":"We use Docling's HierarchicalChunker() to perform hierarchy-aware chunking of our list of documents. This is meant to preserve some of the structure and relationships within the document, which enables more accurate and relevant retrieval in our RAG pipeline.
With the generated embeddings prepared, we now insert them into MongoDB so they can be leveraged in the RAG pipeline.
MongoDB is an ideal vector store for RAG applications because:
The chunks with their embeddings will be stored in a MongoDB collection, allowing us to perform similarity searches when responding to user queries.
"},{"location":"examples/rag_mongodb/#creating-atlas-vector-search-index","title":"Creating Atlas Vector search index\u00b6","text":"Using pymongo we can create a vector index, that will help us search through our vectors and respond to user queries. This index is crucial for efficient similarity searches between user questions and our document chunks. MongoDB Atlas Vector Search provides fast and accurate retrieval of semantically related content, which forms the foundation of our RAG pipeline.
To perform a query on the vectorized data stored in MongoDB, we can use the $vectorSearch aggregation pipeline. This powerful feature of MongoDB Atlas enables semantic search capabilities by finding documents based on vector similarity.
When executing a vector search query:
This enables us to find semantically related content rather than relying on exact keyword matches. The similarity metric we're using (dot product) measures the cosine similarity between vectors, allowing us to identify content that is conceptually similar even if it uses different terminology.
For RAG applications, this vector search capability is crucial as it allows us to retrieve the most relevant context from our document collection based on the semantic meaning of a user's query, providing the foundation for generating accurate and contextually appropriate responses.
"},{"location":"examples/rag_mongodb/#part-4-perform-rag-on-parsed-articles","title":"Part 4: Perform RAG on parsed articles\u00b6","text":"We specify a prompt that includes the field we want to search through in the database (in this case it's text), a query that includes our search term, and the number of retrieved results to use in the generation.
Start building your own intelligent document retrieval system today!
"},{"location":"examples/rag_opensearch/","title":"RAG with OpenSearch","text":"In\u00a0[1]: Copied!import os\n\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n\n! uv pip install -q --no-progress notebook ipywidgets docling llama-index-readers-file llama-index-readers-docling llama-index-readers-elasticsearch llama-index-node-parser-docling llama-index-vector-stores-opensearch llama-index-embeddings-huggingface llama-index-llms-ollama\nimport os os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" ! uv pip install -q --no-progress notebook ipywidgets docling llama-index-readers-file llama-index-readers-docling llama-index-readers-elasticsearch llama-index-node-parser-docling llama-index-vector-stores-opensearch llama-index-embeddings-huggingface llama-index-llms-ollama
We now import all the necessary modules for this notebook:
In\u00a0[2]: Copied!import logging\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nimport requests\nimport torch\nfrom docling_core.transforms.chunker import HierarchicalChunker\nfrom docling_core.transforms.chunker.hierarchical_chunker import (\n ChunkingDocSerializer,\n ChunkingSerializerProvider,\n)\nfrom docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\nfrom docling_core.transforms.serializer.markdown import MarkdownTableSerializer\nfrom llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex\nfrom llama_index.core.data_structs import Node\nfrom llama_index.core.response_synthesizers import get_response_synthesizer\nfrom llama_index.core.schema import NodeWithScore, TransformComponent\nfrom llama_index.core.vector_stores import MetadataFilter, MetadataFilters\nfrom llama_index.core.vector_stores.types import VectorStoreQueryMode\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.ollama import Ollama\nfrom llama_index.node_parser.docling import DoclingNodeParser\nfrom llama_index.readers.docling import DoclingReader\nfrom llama_index.readers.elasticsearch import ElasticsearchReader\nfrom llama_index.vector_stores.opensearch import (\n OpensearchVectorClient,\n OpensearchVectorStore,\n)\nfrom rich.console import Console\nfrom rich.pretty import pprint\nfrom transformers import AutoTokenizer\n\nfrom docling.chunking import HybridChunker\n\nlogging.getLogger().setLevel(logging.WARNING)\nimport logging from pathlib import Path from tempfile import mkdtemp import requests import torch from docling_core.transforms.chunker import HierarchicalChunker from docling_core.transforms.chunker.hierarchical_chunker import ( ChunkingDocSerializer, ChunkingSerializerProvider, ) from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer from docling_core.transforms.serializer.markdown import MarkdownTableSerializer from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex from llama_index.core.data_structs import Node from llama_index.core.response_synthesizers import get_response_synthesizer from llama_index.core.schema import NodeWithScore, TransformComponent from llama_index.core.vector_stores import MetadataFilter, MetadataFilters from llama_index.core.vector_stores.types import VectorStoreQueryMode from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.ollama import Ollama from llama_index.node_parser.docling import DoclingNodeParser from llama_index.readers.docling import DoclingReader from llama_index.readers.elasticsearch import ElasticsearchReader from llama_index.vector_stores.opensearch import ( OpensearchVectorClient, OpensearchVectorStore, ) from rich.console import Console from rich.pretty import pprint from transformers import AutoTokenizer from docling.chunking import HybridChunker logging.getLogger().setLevel(logging.WARNING)
/Users/ceb/git/docling/.venv/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'validate_default' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'validate_default' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.\n warnings.warn(\n
Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.
The code below checks if a GPU is available, either via CUDA or MPS.
In\u00a0[3]: Copied!# Check if GPU or MPS is available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\n print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n device = torch.device(\"mps\")\n print(\"MPS GPU is enabled.\")\nelse:\n raise OSError(\n \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n )\n # Check if GPU or MPS is available if torch.cuda.is_available(): device = torch.device(\"cuda\") print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available(): device = torch.device(\"mps\") print(\"MPS GPU is enabled.\") else: raise OSError( \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\" ) MPS GPU is enabled.\nIn\u00a0[4]: Copied!
response = requests.get(\"http://localhost:9200\")\nprint(response.text)\nresponse = requests.get(\"http://localhost:9200\") print(response.text)
{\n \"name\" : \"b20d8368e745\",\n \"cluster_name\" : \"docker-cluster\",\n \"cluster_uuid\" : \"0gEZCJQwRHabS_E-n_3i9g\",\n \"version\" : {\n \"distribution\" : \"opensearch\",\n \"number\" : \"3.0.0\",\n \"build_type\" : \"tar\",\n \"build_hash\" : \"dc4efa821904cc2d7ea7ef61c0f577d3fc0d8be9\",\n \"build_date\" : \"2025-05-03T06:23:50.311109522Z\",\n \"build_snapshot\" : false,\n \"lucene_version\" : \"10.1.0\",\n \"minimum_wire_compatibility_version\" : \"2.19.0\",\n \"minimum_index_compatibility_version\" : \"2.0.0\"\n },\n \"tagline\" : \"The OpenSearch Project: https://opensearch.org/\"\n}\n\n In\u00a0[5]: Copied! # http endpoint for your cluster\nOPENSEARCH_ENDPOINT = \"http://localhost:9200\"\n# index to store the Docling document vectors\nOPENSEARCH_INDEX = \"docling-index\"\n# the embedding model\nEMBED_MODEL = HuggingFaceEmbedding(\n model_name=\"ibm-granite/granite-embedding-30m-english\"\n)\n# maximum chunk size in tokens\nEMBED_MAX_TOKENS = 200\n# the generation model\nGEN_MODEL = Ollama(\n model=\"granite4:tiny-h\",\n request_timeout=120.0,\n # Manually set the context window to limit memory usage\n context_window=8000,\n # Set temperature to 0 for reproducibility of the results\n temperature=0.0,\n)\n# a sample document\nSOURCE = \"https://arxiv.org/pdf/2408.09869\"\n\nembed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\nprint(f\"The embedding dimension is {embed_dim}.\")\n # http endpoint for your cluster OPENSEARCH_ENDPOINT = \"http://localhost:9200\" # index to store the Docling document vectors OPENSEARCH_INDEX = \"docling-index\" # the embedding model EMBED_MODEL = HuggingFaceEmbedding( model_name=\"ibm-granite/granite-embedding-30m-english\" ) # maximum chunk size in tokens EMBED_MAX_TOKENS = 200 # the generation model GEN_MODEL = Ollama( model=\"granite4:tiny-h\", request_timeout=120.0, # Manually set the context window to limit memory usage context_window=8000, # Set temperature to 0 for reproducibility of the results temperature=0.0, ) # a sample document SOURCE = \"https://arxiv.org/pdf/2408.09869\" embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\")) print(f\"The embedding dimension is {embed_dim}.\") The embedding dimension is 384.\n
In this recipe, we will use a single PDF file, the Docling Technical Report. We will process it using the Hybrid Chunker provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks.
In\u00a0[6]: Copied!tmp_dir_path = Path(mkdtemp())\nreq = requests.get(SOURCE)\nwith open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:\n out_file.write(req.content)\n\nreader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\ndir_reader = SimpleDirectoryReader(\n input_dir=tmp_dir_path,\n file_extractor={\".pdf\": reader},\n)\n\n# load the PDF files\ndocuments = dir_reader.load_data()\n tmp_dir_path = Path(mkdtemp()) req = requests.get(SOURCE) with open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file: out_file.write(req.content) reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) dir_reader = SimpleDirectoryReader( input_dir=tmp_dir_path, file_extractor={\".pdf\": reader}, ) # load the PDF files documents = dir_reader.load_data() In\u00a0[7]: Copied! # create the hybrid chunker\ntokenizer = HuggingFaceTokenizer(\n tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL.model_name),\n max_tokens=EMBED_MAX_TOKENS,\n)\nchunker = HybridChunker(tokenizer=tokenizer)\n\n# create a Docling node parser\nnode_parser = DoclingNodeParser(chunker=chunker)\n\n\n# create a custom transformation to avoid out-of-range integers\nclass MetadataTransform(TransformComponent):\n def __call__(self, nodes, **kwargs):\n for node in nodes:\n binary_hash = node.metadata.get(\"origin\", {}).get(\"binary_hash\", None)\n if binary_hash is not None:\n node.metadata[\"origin\"][\"binary_hash\"] = str(binary_hash)\n return nodes\n # create the hybrid chunker tokenizer = HuggingFaceTokenizer( tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL.model_name), max_tokens=EMBED_MAX_TOKENS, ) chunker = HybridChunker(tokenizer=tokenizer) # create a Docling node parser node_parser = DoclingNodeParser(chunker=chunker) # create a custom transformation to avoid out-of-range integers class MetadataTransform(TransformComponent): def __call__(self, nodes, **kwargs): for node in nodes: binary_hash = node.metadata.get(\"origin\", {}).get(\"binary_hash\", None) if binary_hash is not None: node.metadata[\"origin\"][\"binary_hash\"] = str(binary_hash) return nodes In\u00a0[8]: Copied! # OpensearchVectorClient stores text in this field by default\ntext_field = \"content\"\n# OpensearchVectorClient stores embeddings in this field by default\nembed_field = \"embedding\"\n\nclient = OpensearchVectorClient(\n endpoint=OPENSEARCH_ENDPOINT,\n index=OPENSEARCH_INDEX,\n dim=embed_dim,\n engine=\"faiss\",\n embedding_field=embed_field,\n text_field=text_field,\n)\n\nvector_store = OpensearchVectorStore(client)\nstorage_context = StorageContext.from_defaults(vector_store=vector_store)\n\nindex = VectorStoreIndex.from_documents(\n documents=documents,\n transformations=[node_parser, MetadataTransform()],\n storage_context=storage_context,\n embed_model=EMBED_MODEL,\n)\n# OpensearchVectorClient stores text in this field by default text_field = \"content\" # OpensearchVectorClient stores embeddings in this field by default embed_field = \"embedding\" client = OpensearchVectorClient( endpoint=OPENSEARCH_ENDPOINT, index=OPENSEARCH_INDEX, dim=embed_dim, engine=\"faiss\", embedding_field=embed_field, text_field=text_field, ) vector_store = OpensearchVectorStore(client) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents=documents, transformations=[node_parser, MetadataTransform()], storage_context=storage_context, embed_model=EMBED_MODEL, )
2025-10-24 15:05:49,841 - WARNING - GET http://localhost:9200/docling-index [status:404 request:0.006s]\nIn\u00a0[9]: Copied!
console = Console(width=88)\n\nQUERY = \"Which are the main AI models in Docling?\"\nquery_engine = index.as_query_engine(llm=GEN_MODEL)\nres = query_engine.query(QUERY)\n\nconsole.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\")\n console = Console(width=88) QUERY = \"Which are the main AI models in Docling?\" query_engine = index.as_query_engine(llm=GEN_MODEL) res = query_engine.query(QUERY) console.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\") \ud83d\udc64: Which are the main AI models in Docling?\n\ud83e\udd16: The two main AI models used in Docling are:\n\n1. A layout analysis model, an accurate object-detector for page elements \n2. TableFormer, a state-of-the-art table structure recognition model\n\nThese models were initially released as part of the open-source Docling package to help \nwith document understanding tasks.\nIn\u00a0[10]: Copied!
QUERY = \"What is the time to solution with the native backend on Intel?\"\nquery_engine = index.as_query_engine(llm=GEN_MODEL)\nres = query_engine.query(QUERY)\nconsole.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\")\n QUERY = \"What is the time to solution with the native backend on Intel?\" query_engine = index.as_query_engine(llm=GEN_MODEL) res = query_engine.query(QUERY) console.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\") \ud83d\udc64: What is the time to solution with the native backend on Intel?\n\ud83e\udd16: The time to solution (TTS) for the native backend on Intel is:\n- For Apple M3 Max (16 cores): 375 seconds \n- For Intel(R) Xeon E5-2690, native backend: 244 seconds\n\nSo the TTS with the native backend on Intel ranges from approximately 244 to 375 seconds\ndepending on the specific configuration.\n
The result above was generated with the table serialized in a triplet format. Language models may perform better on complex tables if the structure is represented in a format that is widely adopted, like markdown.
For this purpose, we can leverage a custom serializer that transforms tables in markdown format:
In\u00a0[11]: Copied!class MDTableSerializerProvider(ChunkingSerializerProvider):\n def get_serializer(self, doc):\n return ChunkingDocSerializer(\n doc=doc,\n # configuring a different table serializer\n table_serializer=MarkdownTableSerializer(),\n )\n\n\n# clear the database from the previous chunks\nclient.clear()\nvector_store.clear()\n\nchunker = HybridChunker(\n tokenizer=tokenizer,\n max_tokens=EMBED_MAX_TOKENS,\n serializer_provider=MDTableSerializerProvider(),\n)\nnode_parser = DoclingNodeParser(chunker=chunker)\nindex = VectorStoreIndex.from_documents(\n documents=documents,\n transformations=[node_parser, MetadataTransform()],\n storage_context=storage_context,\n embed_model=EMBED_MODEL,\n)\nclass MDTableSerializerProvider(ChunkingSerializerProvider): def get_serializer(self, doc): return ChunkingDocSerializer( doc=doc, # configuring a different table serializer table_serializer=MarkdownTableSerializer(), ) # clear the database from the previous chunks client.clear() vector_store.clear() chunker = HybridChunker( tokenizer=tokenizer, max_tokens=EMBED_MAX_TOKENS, serializer_provider=MDTableSerializerProvider(), ) node_parser = DoclingNodeParser(chunker=chunker) index = VectorStoreIndex.from_documents( documents=documents, transformations=[node_parser, MetadataTransform()], storage_context=storage_context, embed_model=EMBED_MODEL, )
Token indices sequence length is longer than the specified maximum sequence length for this model (538 > 512). Running this sequence through the model will result in indexing errors\nIn\u00a0[12]: Copied!
query_engine = index.as_query_engine(llm=GEN_MODEL)\nres = query_engine.query(QUERY)\nconsole.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\")\n query_engine = index.as_query_engine(llm=GEN_MODEL) res = query_engine.query(QUERY) console.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\") \ud83d\udc64: What is the time to solution with the native backend on Intel?\n\ud83e\udd16: The table shows that for the native backend on Intel systems, the time-to-solution \n(TTS) ranges from 239 seconds to 375 seconds. Specifically:\n- With 4 threads, the TTS is 239 seconds.\n- With 16 threads, the TTS is 244 seconds.\n\nSo the time to solution with the native backend on Intel varies between approximately \n239 and 375 seconds depending on the thread budget used.\n
Observe that the generated response is now more accurate. Refer to the Advanced chunking & serialization example for more details on serialization strategies.
In\u00a0[13]: Copied!def display_nodes(nodes):\n res = []\n for idx, item in enumerate(nodes):\n doc_res = {\"k\": idx + 1, \"score\": item.score, \"text\": item.text, \"items\": []}\n doc_items = item.metadata[\"doc_items\"]\n for doc in doc_items:\n doc_res[\"items\"].append({\"ref\": doc[\"self_ref\"], \"label\": doc[\"label\"]})\n res.append(doc_res)\n pprint(res, max_string=200)\n def display_nodes(nodes): res = [] for idx, item in enumerate(nodes): doc_res = {\"k\": idx + 1, \"score\": item.score, \"text\": item.text, \"items\": []} doc_items = item.metadata[\"doc_items\"] for doc in doc_items: doc_res[\"items\"].append({\"ref\": doc[\"self_ref\"], \"label\": doc[\"label\"]}) res.append(doc_res) pprint(res, max_string=200) In\u00a0[14]: Copied! retriever = index.as_retriever(similarity_top_k=1)\n\nQUERY = \"How does pypdfium perform?\"\nnodes = retriever.retrieve(QUERY)\n\nprint(QUERY)\ndisplay_nodes(nodes)\nretriever = index.as_retriever(similarity_top_k=1) QUERY = \"How does pypdfium perform?\" nodes = retriever.retrieve(QUERY) print(QUERY) display_nodes(nodes)
How does pypdfium perform?\n
[\n\u2502 {\n\u2502 \u2502 'k': 1,\n\u2502 \u2502 'score': 0.694972,\n\u2502 \u2502 'text': '- [13] B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar. Doclaynet: a large humanannotated dataset for document-layout segmentation. pages 3743-3751, 2022.\\n- [14] pypdf Maintainers. pypdf: '+314,\n\u2502 \u2502 'items': [\n\u2502 \u2502 \u2502 {'ref': '#/texts/93', 'label': 'list_item'},\n\u2502 \u2502 \u2502 {'ref': '#/texts/94', 'label': 'list_item'},\n\u2502 \u2502 \u2502 {'ref': '#/texts/95', 'label': 'list_item'},\n\u2502 \u2502 \u2502 {'ref': '#/texts/96', 'label': 'list_item'}\n\u2502 \u2502 ]\n\u2502 }\n]\n We may want to restrict the retrieval to only those chunks containing tabular data, expecting to retrieve more quantitative information for our type of question:
In\u00a0[15]: Copied!filters = MetadataFilters(\n filters=[MetadataFilter(key=\"doc_items.label\", value=\"table\")]\n)\n\ntable_retriever = index.as_retriever(filters=filters, similarity_top_k=1)\nnodes = table_retriever.retrieve(QUERY)\n\nprint(QUERY)\ndisplay_nodes(nodes)\nfilters = MetadataFilters( filters=[MetadataFilter(key=\"doc_items.label\", value=\"table\")] ) table_retriever = index.as_retriever(filters=filters, similarity_top_k=1) nodes = table_retriever.retrieve(QUERY) print(QUERY) display_nodes(nodes)
How does pypdfium perform?\n
[\n\u2502 {\n\u2502 \u2502 'k': 1,\n\u2502 \u2502 'score': 0.6238112,\n\u2502 \u2502 'text': 'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution (TT'+515,\n\u2502 \u2502 'items': [{'ref': '#/tables/0', 'label': 'table'}, {'ref': '#/tables/0', 'label': 'table'}]\n\u2502 }\n]\n In\u00a0[16]: Copied! url = f\"{OPENSEARCH_ENDPOINT}/_search/pipeline/rrf-pipeline\"\nheaders = {\"Content-Type\": \"application/json\"}\nbody = {\n \"description\": \"Post processor for hybrid RRF search\",\n \"phase_results_processors\": [\n {\"score-ranker-processor\": {\"combination\": {\"technique\": \"rrf\"}}}\n ],\n}\n\nresponse = requests.put(url, json=body, headers=headers)\nprint(response.text)\n url = f\"{OPENSEARCH_ENDPOINT}/_search/pipeline/rrf-pipeline\" headers = {\"Content-Type\": \"application/json\"} body = { \"description\": \"Post processor for hybrid RRF search\", \"phase_results_processors\": [ {\"score-ranker-processor\": {\"combination\": {\"technique\": \"rrf\"}}} ], } response = requests.put(url, json=body, headers=headers) print(response.text) {\"acknowledged\":true}\n We can then repeat the previous steps to get a VectorStoreIndex object, leveraging the search pipeline that we just created:
client_rrf = OpensearchVectorClient(\n endpoint=OPENSEARCH_ENDPOINT,\n index=f\"{OPENSEARCH_INDEX}-rrf\",\n dim=embed_dim,\n engine=\"faiss\",\n embedding_field=embed_field,\n text_field=text_field,\n search_pipeline=\"rrf-pipeline\",\n)\n\nvector_store_rrf = OpensearchVectorStore(client_rrf)\nstorage_context_rrf = StorageContext.from_defaults(vector_store=vector_store_rrf)\nindex_hybrid = VectorStoreIndex.from_documents(\n documents=documents,\n transformations=[node_parser, MetadataTransform()],\n storage_context=storage_context_rrf,\n embed_model=EMBED_MODEL,\n)\n client_rrf = OpensearchVectorClient( endpoint=OPENSEARCH_ENDPOINT, index=f\"{OPENSEARCH_INDEX}-rrf\", dim=embed_dim, engine=\"faiss\", embedding_field=embed_field, text_field=text_field, search_pipeline=\"rrf-pipeline\", ) vector_store_rrf = OpensearchVectorStore(client_rrf) storage_context_rrf = StorageContext.from_defaults(vector_store=vector_store_rrf) index_hybrid = VectorStoreIndex.from_documents( documents=documents, transformations=[node_parser, MetadataTransform()], storage_context=storage_context_rrf, embed_model=EMBED_MODEL, ) 2025-10-24 15:06:05,175 - WARNING - GET http://localhost:9200/docling-index-rrf [status:404 request:0.001s]\n
The first retriever, which entirely relies on semantic (vector) search, fails to catch the supporting chunk for the given question in the top 1 position. Note that we highlight few expected keywords for illustration purposes.
In\u00a0[18]: Copied!QUERY = \"Does Docling project provide a Dockerfile?\"\nretriever = index.as_retriever(similarity_top_k=3)\nnodes = retriever.retrieve(QUERY)\nexp = \"Docling also provides a Dockerfile\"\nstart = \"[bold yellow]\"\nend = \"[/]\"\nfor idx, item in enumerate(nodes):\n console.print(\n f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n )\n QUERY = \"Does Docling project provide a Dockerfile?\" retriever = index.as_retriever(similarity_top_k=3) nodes = retriever.retrieve(QUERY) exp = \"Docling also provides a Dockerfile\" start = \"[bold yellow]\" end = \"[/]\" for idx, item in enumerate(nodes): console.print( f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\" ) *** k=1 ***\nDocling is designed to allow easy extension of the model library and pipelines. In the \nfuture, we plan to extend Docling with several more models, such as a figure-classifier \nmodel, an equationrecognition model, a code-recognition model and more. This will help \nimprove the quality of conversion for specific types of content, as well as augment \nextracted document metadata with additional information. Further investment into testing\nand optimizing GPU acceleration as well as improving the Docling-native PDF backend are \non our roadmap, too.\nWe encourage everyone to propose or implement additional features and models, and will \ngladly take your inputs and contributions under review . The codebase of Docling is open\nfor use and contribution, under the MIT license agreement and in alignment with our \ncontributing guidelines included in the Docling repository. If you use Docling in your \nprojects, please consider citing this technical report.\n
*** k=2 ***\nIn the final pipeline stage, Docling assembles all prediction results produced on each \npage into a well-defined datatype that encapsulates a converted document, as defined in \nthe auxiliary package docling-core . The generated document object is passed through a \npost-processing model which leverages several algorithms to augment features, such as \ndetection of the document language, correcting the reading order, matching figures with \ncaptions and labelling metadata such as title, authors and references. The final output \ncan then be serialized to JSON or transformed into a Markdown representation at the \nusers request.\n
*** k=3 ***\n```\nsource = \"https://arxiv.org/pdf/2206.01062\" # PDF path or URL converter = \nDocumentConverter() result = converter.convert_single(source) \nprint(result.render_as_markdown()) # output: \"## DocLayNet: A Large Human -Annotated \nDataset for Document -Layout Analysis [...]\"\n```\nOptionally, you can configure custom pipeline features and runtime options, such as \nturning on or off features (e.g. OCR, table structure recognition), enforcing limits on \nthe input document size, and defining the budget of CPU threads. Advanced usage examples\nand options are documented in the README file. Docling also provides a Dockerfile to \ndemonstrate how to install and run it inside a container.\n
However, the retriever with the hybrid search pipeline effectively recognizes the key paragraph in the first position:
In\u00a0[19]: Copied!retriever_rrf = index_hybrid.as_retriever(\n vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=3\n)\nnodes = retriever_rrf.retrieve(QUERY)\nfor idx, item in enumerate(nodes):\n console.print(\n f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n )\n retriever_rrf = index_hybrid.as_retriever( vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=3 ) nodes = retriever_rrf.retrieve(QUERY) for idx, item in enumerate(nodes): console.print( f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\" ) *** k=1 ***\n```\nsource = \"https://arxiv.org/pdf/2206.01062\" # PDF path or URL converter = \nDocumentConverter() result = converter.convert_single(source) \nprint(result.render_as_markdown()) # output: \"## DocLayNet: A Large Human -Annotated \nDataset for Document -Layout Analysis [...]\"\n```\nOptionally, you can configure custom pipeline features and runtime options, such as \nturning on or off features (e.g. OCR, table structure recognition), enforcing limits on \nthe input document size, and defining the budget of CPU threads. Advanced usage examples\nand options are documented in the README file. Docling also provides a Dockerfile to \ndemonstrate how to install and run it inside a container.\n
*** k=2 ***\nDocling is designed to allow easy extension of the model library and pipelines. In the \nfuture, we plan to extend Docling with several more models, such as a figure-classifier \nmodel, an equationrecognition model, a code-recognition model and more. This will help \nimprove the quality of conversion for specific types of content, as well as augment \nextracted document metadata with additional information. Further investment into testing\nand optimizing GPU acceleration as well as improving the Docling-native PDF backend are \non our roadmap, too.\nWe encourage everyone to propose or implement additional features and models, and will \ngladly take your inputs and contributions under review . The codebase of Docling is open\nfor use and contribution, under the MIT license agreement and in alignment with our \ncontributing guidelines included in the Docling repository. If you use Docling in your \nprojects, please consider citing this technical report.\n
*** k=3 ***\nWe therefore decided to provide multiple backend choices, and additionally open-source a\ncustombuilt PDF parser, which is based on the low-level qpdf [4] library. It is made \navailable in a separate package named docling-parse and powers the default PDF backend \nin Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \nbe a safe backup choice in certain cases, e.g. if issues are seen with particular font \nencodings.\n
In the following example, the generated response is wrong, since the top retrieved chunks do not contain all the information that is required to answer the question.
In\u00a0[20]: Copied!QUERY = \"According to the tests with arXiv and IBM Redbooks, which backend should I use if I have limited resources and complex tables?\"\nquery_rrf = index_hybrid.as_query_engine(\n vector_store_query_mode=VectorStoreQueryMode.HYBRID,\n llm=GEN_MODEL,\n similarity_top_k=3,\n)\nres = query_rrf.query(QUERY)\nconsole.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\")\n QUERY = \"According to the tests with arXiv and IBM Redbooks, which backend should I use if I have limited resources and complex tables?\" query_rrf = index_hybrid.as_query_engine( vector_store_query_mode=VectorStoreQueryMode.HYBRID, llm=GEN_MODEL, similarity_top_k=3, ) res = query_rrf.query(QUERY) console.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\") \ud83d\udc64: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \nhave limited resources and complex tables?\n\ud83e\udd16: According to the tests in this section using both the MacBook Pro M3 Max and \nbare-metal server running Ubuntu 20.04 LTS on an Intel Xeon E5-2690 CPU with a fixed \nthread budget of 4, Docling achieved faster processing speeds when using the \ncustom-built PDF backend based on the low-level qpdf library (docling-parse) compared to\nthe alternative PDF backend relying on pypdfium.\n\nFurthermore, the context mentions that Docling provides a separate package named \ndocling-ibm-models which includes pre-trained weights and inference code for \nTableFormer, a state-of-the-art table structure recognition model. This suggests that if\nyou have complex tables in your documents, using this specialized table recognition \nmodel could be beneficial.\n\nTherefore, based on the tests with arXiv papers and IBM Redbooks, if you have limited \nresources (likely referring to computational power) and need to process documents \ncontaining complex tables, it would be recommended to use the docling-parse PDF backend \nalong with the TableFormer AI model from docling-ibm-models. This combination should \nprovide a good balance of performance and table recognition capabilities for your \nspecific needs.\nIn\u00a0[21]: Copied!
nodes = retriever_rrf.retrieve(QUERY)\nfor idx, item in enumerate(nodes):\n console.print(\n f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n )\n nodes = retriever_rrf.retrieve(QUERY) for idx, item in enumerate(nodes): console.print( f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\" ) *** k=1 ***\nIn this section, we establish some reference numbers for the processing speed of Docling\nand the resource budget it requires. All tests in this section are run with default \noptions on our standard test set distributed with Docling, which consists of three \npapers from arXiv and two IBM Redbooks, with a total of 225 pages. Measurements were \ntaken using both available PDF backends on two different hardware systems: one MacBook \nPro M3 Max, and one bare-metal server running Ubuntu 20.04 LTS on an Intel Xeon E5-2690 \nCPU. For reproducibility, we fixed the thread budget (through setting OMP NUM THREADS \nenvironment variable ) once to 4 (Docling default) and once to 16 (equal to full core \ncount on the test hardware). All results are shown in Table 1.\n
*** k=2 ***\nWe therefore decided to provide multiple backend choices, and additionally open-source a\ncustombuilt PDF parser, which is based on the low-level qpdf [4] library. It is made \navailable in a separate package named docling-parse and powers the default PDF backend \nin Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \nbe a safe backup choice in certain cases, e.g. if issues are seen with particular font \nencodings.\n
*** k=3 ***\nAs part of Docling, we initially release two highly capable AI models to the open-source\ncommunity, which have been developed and published recently by our team. The first model\nis a layout analysis model, an accurate object-detector for page elements [13]. The \nsecond model is TableFormer [12, 9], a state-of-the-art table structure recognition \nmodel. We provide the pre-trained weights (hosted on huggingface) and a separate package\nfor the inference code as docling-ibm-models . Both models are also powering the \nopen-access deepsearch-experience, our cloud-native service for knowledge exploration \ntasks.\n
Even though the top retrieved chunks are relevant for the question, the key information lays in the paragraph after the first chunk:
If you need to run Docling in very low-resource environments, please consider configuring the pypdfium backend. While it is faster and more memory efficient than the default docling-parse backend, it will come at the expense of worse quality results, especially in table structure recovery.
We next examine the fragments that immediately precede and follow the top\u2011retrieved chunk, so long as those neighbors remain within the same section, to preserve the semantic integrity of the context. The generated answer is now accurate because it has been grounded in the necessary contextual information.
\ud83d\udca1 In a production setting, it may be preferable to persist the parsed documents (i.e., DoclingDocument objects) as JSON in an object store or database and then fetch them when you need to traverse the document for context\u2011expansion scenarios. In this simplified example, however, we will query the OpenSearch index directly to obtain the required chunks.
top_headings = nodes[0].metadata[\"headings\"]\ntop_text = nodes[0].text\n\nrdr = ElasticsearchReader(endpoint=OPENSEARCH_ENDPOINT, index=OPENSEARCH_INDEX)\ndocs = rdr.load_data(\n field=text_field,\n query={\n \"query\": {\n \"terms_set\": {\n \"metadata.headings.keyword\": {\n \"terms\": top_headings,\n \"minimum_should_match_script\": {\"source\": \"params.num_terms\"},\n }\n }\n }\n },\n)\next_nodes = []\nfor idx, item in enumerate(docs):\n if item.text == top_text:\n ext_nodes.append(NodeWithScore(node=Node(text=item.text), score=1.0))\n if idx > 0:\n ext_nodes.append(\n NodeWithScore(node=Node(text=docs[idx - 1].text), score=1.0)\n )\n if idx < len(docs) - 1:\n ext_nodes.append(\n NodeWithScore(node=Node(text=docs[idx + 1].text), score=1.0)\n )\n break\n\nsynthesizer = get_response_synthesizer(llm=GEN_MODEL)\nres = synthesizer.synthesize(query=QUERY, nodes=ext_nodes)\nconsole.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\")\n top_headings = nodes[0].metadata[\"headings\"] top_text = nodes[0].text rdr = ElasticsearchReader(endpoint=OPENSEARCH_ENDPOINT, index=OPENSEARCH_INDEX) docs = rdr.load_data( field=text_field, query={ \"query\": { \"terms_set\": { \"metadata.headings.keyword\": { \"terms\": top_headings, \"minimum_should_match_script\": {\"source\": \"params.num_terms\"}, } } } }, ) ext_nodes = [] for idx, item in enumerate(docs): if item.text == top_text: ext_nodes.append(NodeWithScore(node=Node(text=item.text), score=1.0)) if idx > 0: ext_nodes.append( NodeWithScore(node=Node(text=docs[idx - 1].text), score=1.0) ) if idx < len(docs) - 1: ext_nodes.append( NodeWithScore(node=Node(text=docs[idx + 1].text), score=1.0) ) break synthesizer = get_response_synthesizer(llm=GEN_MODEL) res = synthesizer.synthesize(query=QUERY, nodes=ext_nodes) console.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\") \ud83d\udc64: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \nhave limited resources and complex tables?\n\ud83e\udd16: According to the tests described in the provided context, if you need to run Docling\nin a very low-resource environment and are dealing with complex tables that require \nhigh-quality table structure recovery, you should consider configuring the pypdfium \nbackend. The context mentions that while it is faster and more memory efficient than the\ndefault docling-parse backend, it may come at the expense of worse quality results, \nespecially in table structure recovery. Therefore, for limited resources and complex \ntables where quality is crucial, pypdfium would be a suitable choice despite its \npotential drawbacks compared to the default backend.\n"},{"location":"examples/rag_opensearch/#rag-with-opensearch","title":"RAG with OpenSearch\u00b6","text":"Step Tech Execution Embedding HuggingFace (IBM Granite Embedding 30M) \ud83d\udcbb Local Vector store OpenSearch 3.0.0 \ud83d\udcbb Local Gen AI Ollama (IBM Granite 4.0 Tiny) \ud83d\udcbb Local
This is a code recipe that uses OpenSearch, an open-source search and analytics tool, and the LlamaIndex framework to perform RAG over documents parsed by Docling.
In this notebook, we accomplish the following:
For running this notebook on your machine, you can use applications like Jupyter Notebook or Visual Studio Code.
\ud83d\udca1 For best results, please use GPU acceleration to run this notebook.
"},{"location":"examples/rag_opensearch/#virtual-environment","title":"Virtual environment\u00b6","text":"Before installing dependencies and to avoid conflicts in your environment, it is advisable to use a virtual environment (venv). For instance, uv is a popular tool to manage virtual environments and dependencies. You can install it with:
curl -LsSf https://astral.sh/uv/install.sh | sh\n
Then create the virtual environment and activate it:
uv venv\n source .venv/bin/activate\n
Refer to Installing uv for more details.
"},{"location":"examples/rag_opensearch/#dependencies","title":"Dependencies\u00b6","text":"To start, install the required dependencies by running the following command:
"},{"location":"examples/rag_opensearch/#gpu-checking","title":"GPU Checking\u00b6","text":""},{"location":"examples/rag_opensearch/#local-opensearch-instance","title":"Local OpenSearch instance\u00b6","text":"To run the notebook locally, we can pull an OpenSearch image and run a single node for local development. You can use a container tool like Podman or Docker. In the interest of simplicity, we disable the SSL option for this example.
\ud83d\udca1 The version of the OpenSearch instance needs to be compatible with the version of the OpenSearch Python Client library, since this library is used by the LlamaIndex framework, which we leverage in this notebook.
On your computer terminal run:
podman run \\\n -it \\\n --pull always \\\n -p 9200:9200 \\\n -p 9600:9600 \\\n -e \"discovery.type=single-node\" \\\n -e DISABLE_INSTALL_DEMO_CONFIG=true \\\n -e DISABLE_SECURITY_PLUGIN=true \\\n --name opensearch-node \\\n -d opensearchproject/opensearch:3.0.0\n
Once the instance is running, verify that you can connect to OpenSearch:
"},{"location":"examples/rag_opensearch/#language-models","title":"Language models\u00b6","text":"We will use HuggingFace and Ollama to run language models on your local computer, rather than relying on cloud services.
In this example, the following models are considered:
Once Ollama is installed on your computer, you can pull the model above from your terminal:
ollama pull granite4:tiny-h\n"},{"location":"examples/rag_opensearch/#setup","title":"Setup\u00b6","text":"
We setup the main variables for OpenSearch and the embedding and generation models.
"},{"location":"examples/rag_opensearch/#process-data-using-docling","title":"Process Data Using Docling\u00b6","text":"Docling can parse various document formats into a unified representation (DoclingDocument), which can then be exported to different output formats. For a full list of supported input and output formats, please refer to Supported formats section of Docling's documentation.
"},{"location":"examples/rag_opensearch/#run-the-document-conversion-pipeline","title":"Run the document conversion pipeline\u00b6","text":"We will convert the original PDF file into a DoclingDocument format using a DoclingReader object. We specify the JSON export type to retain the document hierarchical structure as an input for the next step (chunking the document).
Before the actual ingestion of data, we need to define the data transformations to apply on the DoclingDocument:
DoclingNodeParser executes the document-based chunking with the hybrid chunker, which leverages the tokenizer of the embedding model to ensure that the resulting chunks fit within the model input text limit.MetadataTransform is a custom transformation to ensure that generated chunk metadata is best formatted for indexing with OpenSearch\ud83d\udca1 For demonstration purposes, we configure the hybrid chunker to produce chunks capped at 200 tokens. The optimal limit will vary according to the specific requirements of the AI application in question. If this value is omitted, the chunker automatically derives the maximum size from the tokenizer. This safeguard guarantees that each chunk remains within the bounds supported by the underlying embedding model.
"},{"location":"examples/rag_opensearch/#embed-and-insert-the-data","title":"Embed and Insert the Data\u00b6","text":"In this step, we create an OpenSearchVectorClient, which encapsulates the logic for a single OpenSearch index with vector search enabled.
We then initialize the index using our sample data (a single PDF file), the Docling node parser, and the OpenSearch client that we just created.
\ud83d\udca1 You may get a warning message like:
Token indices sequence length is longer than the specified maximum sequence length for this model
This is a false alarm and you may get more background explanation in Docling's FAQ page.
"},{"location":"examples/rag_opensearch/#build-rag","title":"Build RAG\u00b6","text":"In this section, we will see how to assemble a RAG system, execute a query, and get a generated response.
We will also describe how to leverage Docling capabilities to improve RAG results.
"},{"location":"examples/rag_opensearch/#run-a-query","title":"Run a query\u00b6","text":"With LlamaIndex's query engine, we can simply run a RAG system as follows:
"},{"location":"examples/rag_opensearch/#custom-serializers","title":"Custom serializers\u00b6","text":"Docling can extract the table content and process it for chunking, like other text elements.
In the following example, the response is generated from a retrieved chunk containing a table.
"},{"location":"examples/rag_opensearch/#filter-context-query","title":"Filter-context Query\u00b6","text":"By default, the DoclingNodeParser will keep the hierarchical information of items when creating the chunks. That information will be stored as metadata in the OpenSearch index. Leveraging the document structure is a powerful feature of Docling for improving RAG systems, both for retrieval and for answer generation.
For example, we can use chunk metadata with layout information to run queries in a filter context, for high retrieval accuracy.
Using the previous setup, we can see that the most similar chunk corresponds to a paragraph without enough grounding for the question:
"},{"location":"examples/rag_opensearch/#hybrid-search-retrieval-with-rrf","title":"Hybrid Search Retrieval with RRF\u00b6","text":"Hybrid search combines keyword and semantic search to improve search relevance. To avoid relying on traditional score normalization techniques, the reciprocal rank fusion (RRF) feature on hybrid search can significantly improve the relevance of the retrieved chunks in our RAG system.
First, create a search pipeline and specify RRF as technique:
"},{"location":"examples/rag_opensearch/#context-expansion","title":"Context expansion\u00b6","text":"Using small chunks can offer several benefits: it increases retrieval precision and it keeps the answer generation tightly focused, which improves accuracy, reduces hallucination, and speeds up inferece. However, your RAG system may overlook contextual information necessary for producing a fully grounded response.
Docling's preservation of document structure enables you to employ various strategies for enriching the context available during answer generation within the RAG pipeline. For example, after identifying the most relevant chunk, you might include adjacent chunks from the same section as additional groudning material before generating the final answer.
"},{"location":"examples/rag_weaviate/","title":"RAG with Weaviate","text":"Step Tech Execution Embedding Open AI \ud83c\udf10 Remote Vector store Weavieate \ud83d\udcbb Local Gen AI Open AI \ud83c\udf10 Remote In\u00a0[\u00a0]: Copied!%%capture\n%pip install docling~=\"2.7.0\"\n%pip install -U weaviate-client~=\"4.9.4\"\n%pip install rich\n%pip install torch\n\nimport logging\nimport warnings\n\nwarnings.filterwarnings(\"ignore\")\n\n# Suppress Weaviate client logs\nlogging.getLogger(\"weaviate\").setLevel(logging.ERROR)\n%%capture %pip install docling~=\"2.7.0\" %pip install -U weaviate-client~=\"4.9.4\" %pip install rich %pip install torch import logging import warnings warnings.filterwarnings(\"ignore\") # Suppress Weaviate client logs logging.getLogger(\"weaviate\").setLevel(logging.ERROR) In\u00a0[2]: Copied!
import torch\n\n# Check if GPU or MPS is available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\n print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n device = torch.device(\"mps\")\n print(\"MPS GPU is enabled.\")\nelse:\n raise OSError(\n \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n )\n import torch # Check if GPU or MPS is available if torch.cuda.is_available(): device = torch.device(\"cuda\") print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available(): device = torch.device(\"mps\") print(\"MPS GPU is enabled.\") else: raise OSError( \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\" ) MPS GPU is enabled.\n
Here, we've collected 10 influential machine learning papers published as PDFs on arXiv. Because Docling does not yet have title extraction for PDFs, we manually add the titles in a corresponding list.
Note: Converting all 10 papers should take around 8 minutes with a T4 GPU.
In\u00a0[3]: Copied!# Influential machine learning papers\nsource_urls = [\n \"https://arxiv.org/pdf/1706.03762\",\n \"https://arxiv.org/pdf/1810.04805\",\n \"https://arxiv.org/pdf/1406.2661\",\n \"https://arxiv.org/pdf/1409.0473\",\n \"https://arxiv.org/pdf/1412.6980\",\n \"https://arxiv.org/pdf/1312.6114\",\n \"https://arxiv.org/pdf/1312.5602\",\n \"https://arxiv.org/pdf/1512.03385\",\n \"https://arxiv.org/pdf/1409.3215\",\n \"https://arxiv.org/pdf/1301.3781\",\n]\n\n# And their corresponding titles (because Docling doesn't have title extraction yet!)\nsource_titles = [\n \"Attention Is All You Need\",\n \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\",\n \"Generative Adversarial Nets\",\n \"Neural Machine Translation by Jointly Learning to Align and Translate\",\n \"Adam: A Method for Stochastic Optimization\",\n \"Auto-Encoding Variational Bayes\",\n \"Playing Atari with Deep Reinforcement Learning\",\n \"Deep Residual Learning for Image Recognition\",\n \"Sequence to Sequence Learning with Neural Networks\",\n \"A Neural Probabilistic Language Model\",\n]\n# Influential machine learning papers source_urls = [ \"https://arxiv.org/pdf/1706.03762\", \"https://arxiv.org/pdf/1810.04805\", \"https://arxiv.org/pdf/1406.2661\", \"https://arxiv.org/pdf/1409.0473\", \"https://arxiv.org/pdf/1412.6980\", \"https://arxiv.org/pdf/1312.6114\", \"https://arxiv.org/pdf/1312.5602\", \"https://arxiv.org/pdf/1512.03385\", \"https://arxiv.org/pdf/1409.3215\", \"https://arxiv.org/pdf/1301.3781\", ] # And their corresponding titles (because Docling doesn't have title extraction yet!) source_titles = [ \"Attention Is All You Need\", \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\", \"Generative Adversarial Nets\", \"Neural Machine Translation by Jointly Learning to Align and Translate\", \"Adam: A Method for Stochastic Optimization\", \"Auto-Encoding Variational Bayes\", \"Playing Atari with Deep Reinforcement Learning\", \"Deep Residual Learning for Image Recognition\", \"Sequence to Sequence Learning with Neural Networks\", \"A Neural Probabilistic Language Model\", ] In\u00a0[4]: Copied!
from docling.document_converter import DocumentConverter\n\n# Instantiate the doc converter\ndoc_converter = DocumentConverter()\n\n# Directly pass list of files or streams to `convert_all`\nconv_results_iter = doc_converter.convert_all(source_urls) # previously `convert`\n\n# Iterate over the generator to get a list of Docling documents\ndocs = [result.document for result in conv_results_iter]\nfrom docling.document_converter import DocumentConverter # Instantiate the doc converter doc_converter = DocumentConverter() # Directly pass list of files or streams to `convert_all` conv_results_iter = doc_converter.convert_all(source_urls) # previously `convert` # Iterate over the generator to get a list of Docling documents docs = [result.document for result in conv_results_iter]
Fetching 9 files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 9/9 [00:00<00:00, 84072.91it/s]\n
ERR#: COULD NOT CONVERT TO RS THIS TABLE TO COMPUTE SPANS\nIn\u00a0[5]: Copied!
from docling_core.transforms.chunker import HierarchicalChunker\n\n# Initialize lists for text, and titles\ntexts, titles = [], []\n\nchunker = HierarchicalChunker()\n\n# Process each document in the list\nfor doc, title in zip(docs, source_titles): # Pair each document with its title\n chunks = list(\n chunker.chunk(doc)\n ) # Perform hierarchical chunking and get text from chunks\n for chunk in chunks:\n texts.append(chunk.text)\n titles.append(title)\nfrom docling_core.transforms.chunker import HierarchicalChunker # Initialize lists for text, and titles texts, titles = [], [] chunker = HierarchicalChunker() # Process each document in the list for doc, title in zip(docs, source_titles): # Pair each document with its title chunks = list( chunker.chunk(doc) ) # Perform hierarchical chunking and get text from chunks for chunk in chunks: texts.append(chunk.text) titles.append(title)
Because we're splitting the documents into chunks, we'll concatenate the article title to the beginning of each chunk for additional context.
In\u00a0[6]: Copied!# Concatenate title and text\nfor i in range(len(texts)):\n texts[i] = f\"{titles[i]} {texts[i]}\"\n # Concatenate title and text for i in range(len(texts)): texts[i] = f\"{titles[i]} {texts[i]}\" We'll be using the OpenAI API for both generating the text embeddings and for the generative model in our RAG pipeline. The code below dynamically fetches your API key based on whether you're running this notebook in Google Colab and running it as a regular Jupyter notebook. All you need to do is replace openai_api_key_var with the name of your environmental variable name or Colab secret name for the API key.
If you're running this notebook in Google Colab, make sure you add your API key as a secret.
In\u00a0[7]: Copied!# OpenAI API key variable name\nopenai_api_key_var = \"OPENAI_API_KEY\" # Replace with the name of your secret/env var\n\n# Fetch OpenAI API key\ntry:\n # If running in Colab, fetch API key from Secrets\n import google.colab\n from google.colab import userdata\n\n openai_api_key = userdata.get(openai_api_key_var)\n if not openai_api_key:\n raise ValueError(f\"Secret '{openai_api_key_var}' not found in Colab secrets.\")\nexcept ImportError:\n # If not running in Colab, fetch API key from environment variable\n import os\n\n openai_api_key = os.getenv(openai_api_key_var)\n if not openai_api_key:\n raise OSError(\n f\"Environment variable '{openai_api_key_var}' is not set. \"\n \"Please define it before running this script.\"\n )\n # OpenAI API key variable name openai_api_key_var = \"OPENAI_API_KEY\" # Replace with the name of your secret/env var # Fetch OpenAI API key try: # If running in Colab, fetch API key from Secrets import google.colab from google.colab import userdata openai_api_key = userdata.get(openai_api_key_var) if not openai_api_key: raise ValueError(f\"Secret '{openai_api_key_var}' not found in Colab secrets.\") except ImportError: # If not running in Colab, fetch API key from environment variable import os openai_api_key = os.getenv(openai_api_key_var) if not openai_api_key: raise OSError( f\"Environment variable '{openai_api_key_var}' is not set. \" \"Please define it before running this script.\" ) Embedded Weaviate allows you to spin up a Weaviate instance directly from your application code, without having to use a Docker container. If you're interested in other deployment methods, like using Docker-Compose or Kubernetes, check out this page in the Weaviate docs.
In\u00a0[\u00a0]: Copied!import weaviate\n\n# Connect to Weaviate embedded\nclient = weaviate.connect_to_embedded(headers={\"X-OpenAI-Api-Key\": openai_api_key})\n import weaviate # Connect to Weaviate embedded client = weaviate.connect_to_embedded(headers={\"X-OpenAI-Api-Key\": openai_api_key}) In\u00a0[\u00a0]: Copied! import weaviate.classes.config as wc\n\n# Define the collection name\ncollection_name = \"docling\"\n\n# Delete the collection if it already exists\nif client.collections.exists(collection_name):\n client.collections.delete(collection_name)\n\n# Create the collection\ncollection = client.collections.create(\n name=collection_name,\n vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(\n model=\"text-embedding-3-large\", # Specify your embedding model here\n ),\n # Enable generative model from Cohere\n generative_config=wc.Configure.Generative.openai(\n model=\"gpt-4o\" # Specify your generative model for RAG here\n ),\n # Define properties of metadata\n properties=[\n wc.Property(name=\"text\", data_type=wc.DataType.TEXT),\n wc.Property(name=\"title\", data_type=wc.DataType.TEXT, skip_vectorization=True),\n ],\n)\nimport weaviate.classes.config as wc # Define the collection name collection_name = \"docling\" # Delete the collection if it already exists if client.collections.exists(collection_name): client.collections.delete(collection_name) # Create the collection collection = client.collections.create( name=collection_name, vectorizer_config=wc.Configure.Vectorizer.text2vec_openai( model=\"text-embedding-3-large\", # Specify your embedding model here ), # Enable generative model from Cohere generative_config=wc.Configure.Generative.openai( model=\"gpt-4o\" # Specify your generative model for RAG here ), # Define properties of metadata properties=[ wc.Property(name=\"text\", data_type=wc.DataType.TEXT), wc.Property(name=\"title\", data_type=wc.DataType.TEXT, skip_vectorization=True), ], ) In\u00a0[10]: Copied!
# Initialize the data object\ndata = []\n\n# Create a dictionary for each row by iterating through the corresponding lists\nfor text, title in zip(texts, titles):\n data_point = {\n \"text\": text,\n \"title\": title,\n }\n data.append(data_point)\n # Initialize the data object data = [] # Create a dictionary for each row by iterating through the corresponding lists for text, title in zip(texts, titles): data_point = { \"text\": text, \"title\": title, } data.append(data_point) In\u00a0[\u00a0]: Copied! # Insert text chunks and metadata into vector DB collection\nresponse = collection.data.insert_many(data)\n\nif response.has_errors:\n print(response.errors)\nelse:\n print(\"Insert complete.\")\n# Insert text chunks and metadata into vector DB collection response = collection.data.insert_many(data) if response.has_errors: print(response.errors) else: print(\"Insert complete.\") In\u00a0[12]: Copied!
from weaviate.classes.query import MetadataQuery\n\nresponse = collection.query.near_text(\n query=\"bert\",\n limit=2,\n return_metadata=MetadataQuery(distance=True),\n return_properties=[\"text\", \"title\"],\n)\n\nfor o in response.objects:\n print(o.properties)\n print(o.metadata.distance)\nfrom weaviate.classes.query import MetadataQuery response = collection.query.near_text( query=\"bert\", limit=2, return_metadata=MetadataQuery(distance=True), return_properties=[\"text\", \"title\"], ) for o in response.objects: print(o.properties) print(o.metadata.distance)
{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding A distinctive feature of BERT is its unified architecture across different tasks. There is mini-', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}\n0.6578550338745117\n{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding We introduce a new language representation model called BERT , which stands for B idirectional E ncoder R epresentations from T ransformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}\n0.6696287989616394\n In\u00a0[13]: Copied! from rich.console import Console\nfrom rich.panel import Panel\n\n# Create a prompt where context from the Weaviate collection will be injected\nprompt = \"Explain how {text} works, using only the retrieved context.\"\nquery = \"bert\"\n\nresponse = collection.generate.near_text(\n query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"]\n)\n\n# Prettify the output using Rich\nconsole = Console()\n\nconsole.print(\n Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\")\n)\nconsole.print(\n Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\")\n)\n from rich.console import Console from rich.panel import Panel # Create a prompt where context from the Weaviate collection will be injected prompt = \"Explain how {text} works, using only the retrieved context.\" query = \"bert\" response = collection.generate.near_text( query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"] ) # Prettify the output using Rich console = Console() console.print( Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\") ) console.print( Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\") ) \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Explain how bert works, using only the retrieved context. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 BERT, which stands for Bidirectional Encoder Representations from Transformers, is a language representation \u2502\n\u2502 model designed to pretrain deep bidirectional representations from unlabeled text. It conditions on both left \u2502\n\u2502 and right context in all layers, unlike traditional left-to-right or right-to-left language models. This \u2502\n\u2502 pre-training involves two unsupervised tasks. The pre-trained BERT model can then be fine-tuned with just one \u2502\n\u2502 additional output layer to create state-of-the-art models for various tasks, such as question answering and \u2502\n\u2502 language inference, without needing substantial task-specific architecture modifications. A distinctive feature \u2502\n\u2502 of BERT is its unified architecture across different tasks. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\nIn\u00a0[14]: Copied!
# Create a prompt where context from the Weaviate collection will be injected\nprompt = \"Explain how {text} works, using only the retrieved context.\"\nquery = \"a generative adversarial net\"\n\nresponse = collection.generate.near_text(\n query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"]\n)\n\n# Prettify the output using Rich\nconsole = Console()\n\nconsole.print(\n Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\")\n)\nconsole.print(\n Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\")\n)\n # Create a prompt where context from the Weaviate collection will be injected prompt = \"Explain how {text} works, using only the retrieved context.\" query = \"a generative adversarial net\" response = collection.generate.near_text( query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"] ) # Prettify the output using Rich console = Console() console.print( Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\") ) console.print( Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\") ) \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Explain how a generative adversarial net works, using only the retrieved context. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Generative Adversarial Nets (GANs) operate within an adversarial framework where two models are trained \u2502\n\u2502 simultaneously: a generative model (G) and a discriminative model (D). The generative model aims to capture the \u2502\n\u2502 data distribution and generate samples that mimic real data, while the discriminative model's task is to \u2502\n\u2502 distinguish between samples from the real data and those generated by G. This setup is akin to a game where the \u2502\n\u2502 generative model acts like counterfeiters trying to produce indistinguishable fake currency, and the \u2502\n\u2502 discriminative model acts like the police trying to detect these counterfeits. \u2502\n\u2502 \u2502\n\u2502 The training process involves a minimax two-player game where G tries to maximize the probability of D making a \u2502\n\u2502 mistake, while D tries to minimize it. When both models are defined by multilayer perceptrons, they can be \u2502\n\u2502 trained using backpropagation without the need for Markov chains or approximate inference networks. The \u2502\n\u2502 ultimate goal is for G to perfectly replicate the training data distribution, making D's output equal to 1/2 \u2502\n\u2502 everywhere, indicating it cannot distinguish between real and generated data. This framework allows for \u2502\n\u2502 specific training algorithms and optimization techniques, such as backpropagation and dropout, to be \u2502\n\u2502 effectively utilized. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
We can see that our RAG pipeline performs relatively well for simple queries, especially given the small size of the dataset. Scaling this method for converting a larger sample of PDFs would require more compute (GPUs) and a more advanced deployment of Weaviate (like Docker, Kubernetes, or Weaviate Cloud). For more information on available Weaviate configurations, check out the documentation.
"},{"location":"examples/rag_weaviate/#rag-with-weaviate","title":"RAG with Weaviate\u00b6","text":""},{"location":"examples/rag_weaviate/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"This is a code recipe that uses Weaviate to perform RAG over PDF documents parsed by Docling.
In this notebook, we accomplish the following:
To run this notebook, you'll need:
Note: For best results, please use GPU acceleration to run this notebook. Here are two options for running this notebook:
Note: If Colab prompts you to restart the session after running the cell below, click \"restart\" and proceed with running the rest of the notebook.
"},{"location":"examples/rag_weaviate/#part-1-docling","title":"\ud83d\udc25 Part 1: Docling\u00b6","text":"Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.
The code below checks to see if a GPU is available, either via CUDA or MPS.
"},{"location":"examples/rag_weaviate/#convert-pdfs-to-docling-documents","title":"Convert PDFs to Docling documents\u00b6","text":"Here we use Docling's .convert_all() to parse a batch of PDFs. The result is a list of Docling documents that we can use for text extraction.
Note: Please ignore the ERR# message.
We use Docling's HierarchicalChunker() to perform hierarchy-aware chunking of our list of documents. This is meant to preserve some of the structure and relationships within the document, which enables more accurate and relevant retrieval in our RAG pipeline.
Transform our data from lists to a list of dictionaries for insertion into our Weaviate collection.
"},{"location":"examples/rag_weaviate/#insert-data-into-weaviate-and-generate-embeddings","title":"Insert data into Weaviate and generate embeddings\u00b6","text":"Embeddings will be generated upon insertion to our Weaviate collection.
"},{"location":"examples/rag_weaviate/#query-the-data","title":"Query the data\u00b6","text":"Here, we perform a simple similarity search to return the most similar embedded chunks to our search query.
"},{"location":"examples/rag_weaviate/#perform-rag-on-parsed-articles","title":"Perform RAG on parsed articles\u00b6","text":"Weaviate's generate module allows you to perform RAG over your embedded data without having to use a separate framework.
We specify a prompt that includes the field we want to search through in the database (in this case it's text), a query that includes our search term, and the number of retrieved results to use in the generation.
Use RapidOCR with custom ONNX models to OCR a PDF page and print Markdown.
What this example does
RapidOcrOptions with explicit det/rec/cls model paths.Prerequisites
modelscope, and have network access to download models.docling and modelscope.How to run
python docs/examples/rapidocr_with_custom_models.py.Notes
source points to an arXiv PDF URL; replace with a local path if desired.~/.cache/modelscope); set a proxy or pre-download models if running in a restricted network environment.import os\n\nfrom modelscope import snapshot_download\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n\ndef main():\n # Source document to convert\n source = \"https://arxiv.org/pdf/2408.09869v4\"\n\n # Download RapidOCR models from Hugging Face\n print(\"Downloading RapidOCR models\")\n download_path = snapshot_download(repo_id=\"RapidAI/RapidOCR\")\n\n # Setup RapidOcrOptions for English detection\n det_model_path = os.path.join(\n download_path, \"onnx\", \"PP-OCRv5\", \"det\", \"ch_PP-OCRv5_server_det.onnx\"\n )\n rec_model_path = os.path.join(\n download_path, \"onnx\", \"PP-OCRv5\", \"rec\", \"ch_PP-OCRv5_rec_server_infer.onnx\"\n )\n cls_model_path = os.path.join(\n download_path, \"onnx\", \"PP-OCRv4\", \"cls\", \"ch_ppocr_mobile_v2.0_cls_infer.onnx\"\n )\n ocr_options = RapidOcrOptions(\n det_model_path=det_model_path,\n rec_model_path=rec_model_path,\n cls_model_path=cls_model_path,\n )\n\n pipeline_options = PdfPipelineOptions(\n ocr_options=ocr_options,\n )\n\n # Convert the document\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n ),\n },\n )\n\n conversion_result: ConversionResult = converter.convert(source=source)\n doc = conversion_result.document\n md = doc.export_to_markdown()\n print(md)\n\n\nif __name__ == \"__main__\":\n main()\n import os from modelscope import snapshot_download from docling.datamodel.base_models import InputFormat from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions from docling.document_converter import DocumentConverter, PdfFormatOption def main(): # Source document to convert source = \"https://arxiv.org/pdf/2408.09869v4\" # Download RapidOCR models from Hugging Face print(\"Downloading RapidOCR models\") download_path = snapshot_download(repo_id=\"RapidAI/RapidOCR\") # Setup RapidOcrOptions for English detection det_model_path = os.path.join( download_path, \"onnx\", \"PP-OCRv5\", \"det\", \"ch_PP-OCRv5_server_det.onnx\" ) rec_model_path = os.path.join( download_path, \"onnx\", \"PP-OCRv5\", \"rec\", \"ch_PP-OCRv5_rec_server_infer.onnx\" ) cls_model_path = os.path.join( download_path, \"onnx\", \"PP-OCRv4\", \"cls\", \"ch_ppocr_mobile_v2.0_cls_infer.onnx\" ) ocr_options = RapidOcrOptions( det_model_path=det_model_path, rec_model_path=rec_model_path, cls_model_path=cls_model_path, ) pipeline_options = PdfPipelineOptions( ocr_options=ocr_options, ) # Convert the document converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ), }, ) conversion_result: ConversionResult = converter.convert(source=source) doc = conversion_result.document md = doc.export_to_markdown() print(md) if __name__ == \"__main__\": main()"},{"location":"examples/retrieval_qdrant/","title":"Retrieval with Qdrant","text":"Step Tech Execution Embedding FastEmbed \ud83d\udcbb Local Vector store Qdrant \ud83d\udcbb Local This example demonstrates using Docling with Qdrant to perform a hybrid search across your documents using dense and sparse vectors.
We'll chunk the documents using Docling before adding them to a Qdrant collection. By limiting the length of the chunks, we can preserve the meaning in each vector embedding.
fastembed-gpu package if you've got the hardware to support it.%pip install --no-warn-conflicts -q qdrant-client docling fastembed\n%pip install --no-warn-conflicts -q qdrant-client docling fastembed
Note: you may need to restart the kernel to use updated packages.\n
Let's import all the classes we'll be working with.
In\u00a0[2]: Copied!from qdrant_client import QdrantClient\n\nfrom docling.chunking import HybridChunker\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter\nfrom qdrant_client import QdrantClient from docling.chunking import HybridChunker from docling.datamodel.base_models import InputFormat from docling.document_converter import DocumentConverter
COLLECTION_NAME = \"docling\"\n\ndoc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML])\nclient = QdrantClient(location=\":memory:\")\n# The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI.\n# For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant\n# client = QdrantClient(location=\"http://localhost:6333\")\n\nclient.set_model(\"sentence-transformers/all-MiniLM-L6-v2\")\nclient.set_sparse_model(\"Qdrant/bm25\")\nCOLLECTION_NAME = \"docling\" doc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML]) client = QdrantClient(location=\":memory:\") # The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI. # For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant # client = QdrantClient(location=\"http://localhost:6333\") client.set_model(\"sentence-transformers/all-MiniLM-L6-v2\") client.set_sparse_model(\"Qdrant/bm25\")
/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n warnings.warn(\n
We can now download and chunk the document using Docling. For demonstration, we'll use an article about chunking strategies :)
In\u00a0[4]: Copied!result = doc_converter.convert(\n \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\"\n)\ndocuments, metadatas = [], []\nfor chunk in HybridChunker().chunk(result.document):\n documents.append(chunk.text)\n metadatas.append(chunk.meta.export_json_dict())\nresult = doc_converter.convert( \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\" ) documents, metadatas = [], [] for chunk in HybridChunker().chunk(result.document): documents.append(chunk.text) metadatas.append(chunk.meta.export_json_dict())
Let's now upload the documents to Qdrant.
add() method batches the documents and uses FastEmbed to generate vector embeddings on our machine._ = client.add(\n collection_name=COLLECTION_NAME,\n documents=documents,\n metadata=metadatas,\n batch_size=64,\n)\n_ = client.add( collection_name=COLLECTION_NAME, documents=documents, metadata=metadatas, batch_size=64, ) In\u00a0[6]: Copied!
points = client.query(\n collection_name=COLLECTION_NAME,\n query_text=\"Can I split documents?\",\n limit=10,\n)\npoints = client.query( collection_name=COLLECTION_NAME, query_text=\"Can I split documents?\", limit=10, ) In\u00a0[7]: Copied!
for i, point in enumerate(points):\n print(f\"=== {i} ===\")\n print(point.document)\n print()\n for i, point in enumerate(points): print(f\"=== {i} ===\") print(point.document) print() === 0 ===\nHave you ever wondered how we, humans, would chunk? Here's a breakdown of a possible way a human would process a new document:\n1. We start at the top of the document, treating the first part as a chunk.\n\u00a0\u00a0\u00a02. We continue down the document, deciding if a new sentence or piece of information belongs with the first chunk or should start a new one.\n \u00a0\u00a0\u00a03. We keep this up until we reach the end of the document.\nThe ultimate dream? Having an agent do this for you. But slow down! This approach is still being tested and isn't quite ready for the big leagues due to the time it takes to process multiple LLM calls and the cost of those calls. There's no implementation available in public libraries just yet. However, Greg Kamradt has his version available here.\n\n=== 1 ===\nDocument Specific Chunking is a strategy that respects the document's structure. Rather than using a set number of characters or a recursive process, it creates chunks that align with the logical sections of the document, like paragraphs or subsections. This approach maintains the original author's organization of content and helps keep the text coherent. It makes the retrieved information more relevant and useful, particularly for structured documents with clearly defined sections.\nDocument Specific Chunking can handle a variety of document formats, such as:\nMarkdown\nHTML\nPython\netc\nHere we\u2019ll take Markdown as our example and use a modified version of our first sample text:\n\u200d\nThe result is the following:\nYou can see here that with a chunk size of 105, the Markdown structure of the document is taken into account, and the chunks thus preserve the semantics of the text!\n\n=== 2 ===\nAnd there you have it! These chunking strategies are like a personal toolbox when it comes to implementing Retrieval Augmented Generation. They're a ton of ways to slice and dice text, each with its unique features and quirks. This variety gives you the freedom to pick the strategy that suits your project best, allowing you to tailor your approach to perfectly fit the unique needs of your work.\nTo put these strategies into action, there's a whole array of tools and libraries at your disposal. For example, llama_index is a fantastic tool that lets you create document indices and retrieve chunked documents. Let's not forget LangChain, another remarkable tool that makes implementing chunking strategies a breeze, particularly when dealing with multi-language data. Diving into these tools and understanding how they can work in harmony with the chunking strategies we've discussed is a crucial part of mastering Retrieval Augmented Generation.\nBy the way, if you're eager to experiment with your own examples using the chunking visualisation tool featured in this blog, feel free to give it a try! You can access it right here. Enjoy, and happy chunking! \ud83d\ude09\n\n=== 3 ===\nRetrieval Augmented Generation (RAG) has been a hot topic in understanding, interpreting, and generating text with AI for the last few months. It's like a wonderful union of retrieval-based and generative models, creating a playground for researchers, data scientists, and natural language processing enthusiasts, like you and me.\nTo truly control the results produced by our RAG, we need to understand chunking strategies and their role in the process of retrieving and generating text. Indeed, each chunking strategy enhances RAG's effectiveness in its unique way.\nThe goal of chunking is, as its name says, to chunk the information into multiple smaller pieces in order to store it in a more efficient and meaningful way. This allows the retrieval to capture pieces of information that are more related to the question at hand, and the generation to be more precise, but also less costly, as only a part of a document will be included in the LLM prompt, instead of the whole document.\nLet's explore some chunking strategies together.\nThe methods mentioned in the article you're about to read usually make use of two key parameters. First, we have [chunk_size]\u2014 which controls the size of your text chunks. Then there's [chunk_overlap], which takes care of how much text overlaps between one chunk and the next.\n\n=== 4 ===\nSemantic Chunking considers the relationships within the text. It divides the text into meaningful, semantically complete chunks. This approach ensures the information's integrity during retrieval, leading to a more accurate and contextually appropriate outcome.\nSemantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.\nBy focusing on the text's meaning and context, Semantic Chunking significantly enhances the quality of retrieval. It's a top-notch choice when maintaining the semantic integrity of the text is vital.\nHowever, this method does require more effort and is notably slower than the previous ones.\nOn our example text, since it is quite short and does not expose varied subjects, this method would only generate a single chunk.\n\n=== 5 ===\nLanguage models used in the rest of your possible RAG pipeline have a token limit, which should not be exceeded. When dividing your text into chunks, it's advisable to count the number of tokens. Plenty of tokenizers are available. To ensure accuracy, use the same tokenizer for counting tokens as the one used in the language model.\nConsequently, there are also splitters available for this purpose.\nFor instance, by using the [SpacyTextSplitter] from LangChain, the following chunks are created:\n\u200d\n\n=== 6 ===\nFirst things first, we have Character Chunking. This strategy divides the text into chunks based on a fixed number of characters. Its simplicity makes it a great starting point, but it can sometimes disrupt the text's flow, breaking sentences or words in unexpected places. Despite its limitations, it's a great stepping stone towards more advanced methods.\nNow let\u2019s see that in action with an example. Imagine a text that reads:\nIf we decide to set our chunk size to 100 and no chunk overlap, we'd end up with the following chunks. As you can see, Character Chunking can lead to some intriguing, albeit sometimes nonsensical, results, cutting some of the sentences in their middle.\nBy choosing a smaller chunk size, \u00a0we would obtain more chunks, and by setting a bigger chunk overlap, we could obtain something like this:\n\u200d\nAlso, by default this method creates chunks character by character based on the empty character [\u2019 \u2019]. But you can specify a different one in order to chunk on something else, even a complete word! For instance, by specifying [' '] as the separator, you can avoid cutting words in their middle.\n\n=== 7 ===\nNext, let's take a look at Recursive Character Chunking. Based on the basic concept of Character Chunking, this advanced version takes it up a notch by dividing the text into chunks until a certain condition is met, such as reaching a minimum chunk size. This method ensures that the chunking process aligns with the text's structure, preserving more meaning. Its adaptability makes Recursive Character Chunking great for texts with varied structures.\nAgain, let\u2019s use the same example in order to illustrate this method. With a chunk size of 100, and the default settings for the other parameters, we obtain the following chunks:\n\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/retrieval_qdrant/#retrieval-with-qdrant","title":"Retrieval with Qdrant\u00b6","text":""},{"location":"examples/retrieval_qdrant/#overview","title":"Overview\u00b6","text":""},{"location":"examples/retrieval_qdrant/#setup","title":"Setup\u00b6","text":""},{"location":"examples/retrieval_qdrant/#retrieval","title":"Retrieval\u00b6","text":""},{"location":"examples/run_md/","title":"Run md","text":"In\u00a0[\u00a0]: Copied!
import json\nimport logging\nimport os\nfrom pathlib import Path\nimport json import logging import os from pathlib import Path In\u00a0[\u00a0]: Copied!
import yaml\nimport yaml In\u00a0[\u00a0]: Copied!
from docling.backend.md_backend import MarkdownDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.document import InputDocument\nfrom docling.backend.md_backend import MarkdownDocumentBackend from docling.datamodel.base_models import InputFormat from docling.datamodel.document import InputDocument In\u00a0[\u00a0]: Copied!
_log = logging.getLogger(__name__)\n_log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied!
def main():\n input_paths = [Path(\"README.md\")]\n\n for path in input_paths:\n in_doc = InputDocument(\n path_or_stream=path,\n format=InputFormat.PDF,\n backend=MarkdownDocumentBackend,\n )\n mdb = MarkdownDocumentBackend(in_doc=in_doc, path_or_stream=path)\n document = mdb.convert()\n\n out_path = Path(\"scratch\")\n print(f\"Document {path} converted.\\nSaved markdown output to: {out_path!s}\")\n\n # Export Docling document format to markdowndoc:\n fn = os.path.basename(path)\n\n with (out_path / f\"{fn}.md\").open(\"w\") as fp:\n fp.write(document.export_to_markdown())\n\n with (out_path / f\"{fn}.json\").open(\"w\") as fp:\n fp.write(json.dumps(document.export_to_dict()))\n\n with (out_path / f\"{fn}.yaml\").open(\"w\") as fp:\n fp.write(yaml.safe_dump(document.export_to_dict()))\n def main(): input_paths = [Path(\"README.md\")] for path in input_paths: in_doc = InputDocument( path_or_stream=path, format=InputFormat.PDF, backend=MarkdownDocumentBackend, ) mdb = MarkdownDocumentBackend(in_doc=in_doc, path_or_stream=path) document = mdb.convert() out_path = Path(\"scratch\") print(f\"Document {path} converted.\\nSaved markdown output to: {out_path!s}\") # Export Docling document format to markdowndoc: fn = os.path.basename(path) with (out_path / f\"{fn}.md\").open(\"w\") as fp: fp.write(document.export_to_markdown()) with (out_path / f\"{fn}.json\").open(\"w\") as fp: fp.write(json.dumps(document.export_to_dict())) with (out_path / f\"{fn}.yaml\").open(\"w\") as fp: fp.write(yaml.safe_dump(document.export_to_dict())) In\u00a0[\u00a0]: Copied! if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/run_with_accelerator/","title":"Accelerator options","text":"
Run conversion with an explicit accelerator configuration (CPU/MPS/CUDA).
What this example does
How to run
python docs/examples/run_with_accelerator.py.AcceleratorOptions examples to try AUTO/MPS/CUDA.Notes
cuda:N device selection (defaults to cuda:0).settings.debug.profile_pipeline_timings = True prints profiling details.AcceleratorDevice.MPS is macOS-only; CUDA requires a compatible GPU and CUDA-enabled PyTorch build. CPU mode works everywhere.from pathlib import Path\n\nfrom docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n)\nfrom docling.datamodel.settings import settings\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n\ndef main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n # Explicitly set the accelerator\n # accelerator_options = AcceleratorOptions(\n # num_threads=8, device=AcceleratorDevice.AUTO\n # )\n accelerator_options = AcceleratorOptions(\n num_threads=8, device=AcceleratorDevice.CPU\n )\n # accelerator_options = AcceleratorOptions(\n # num_threads=8, device=AcceleratorDevice.MPS\n # )\n # accelerator_options = AcceleratorOptions(\n # num_threads=8, device=AcceleratorDevice.CUDA\n # )\n\n # easyocr doesnt support cuda:N allocation, defaults to cuda:0\n # accelerator_options = AcceleratorOptions(num_threads=8, device=\"cuda:1\")\n\n pipeline_options = PdfPipelineOptions()\n pipeline_options.accelerator_options = accelerator_options\n pipeline_options.do_ocr = True\n pipeline_options.do_table_structure = True\n pipeline_options.table_structure_options.do_cell_matching = True\n\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n\n # Enable the profiling to measure the time spent\n settings.debug.profile_pipeline_timings = True\n\n # Convert the document\n conversion_result = converter.convert(input_doc_path)\n doc = conversion_result.document\n\n # List with total time per document\n doc_conversion_secs = conversion_result.timings[\"pipeline_total\"].times\n\n md = doc.export_to_markdown()\n print(md)\n print(f\"Conversion secs: {doc_conversion_secs}\")\n\n\nif __name__ == \"__main__\":\n main()\n from pathlib import Path from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, ) from docling.datamodel.settings import settings from docling.document_converter import DocumentConverter, PdfFormatOption def main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" # Explicitly set the accelerator # accelerator_options = AcceleratorOptions( # num_threads=8, device=AcceleratorDevice.AUTO # ) accelerator_options = AcceleratorOptions( num_threads=8, device=AcceleratorDevice.CPU ) # accelerator_options = AcceleratorOptions( # num_threads=8, device=AcceleratorDevice.MPS # ) # accelerator_options = AcceleratorOptions( # num_threads=8, device=AcceleratorDevice.CUDA # ) # easyocr doesnt support cuda:N allocation, defaults to cuda:0 # accelerator_options = AcceleratorOptions(num_threads=8, device=\"cuda:1\") pipeline_options = PdfPipelineOptions() pipeline_options.accelerator_options = accelerator_options pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) # Enable the profiling to measure the time spent settings.debug.profile_pipeline_timings = True # Convert the document conversion_result = converter.convert(input_doc_path) doc = conversion_result.document # List with total time per document doc_conversion_secs = conversion_result.timings[\"pipeline_total\"].times md = doc.export_to_markdown() print(md) print(f\"Conversion secs: {doc_conversion_secs}\") if __name__ == \"__main__\": main()"},{"location":"examples/run_with_formats/","title":"Multi-format conversion","text":"Run conversion across multiple input formats and customize handling per type.
What this example does
allowed_formats and override format_options per format.scratch/.Prerequisites
docling from your Python environment.PyYAML (pip install pyyaml).How to run
python docs/examples/run_with_formats.py.scratch/ next to where you run the script.scratch/ does not exist, create it before running.Customizing inputs
input_paths to include or remove files on your machine.allowed_formats).Notes
allowed_formats: explicit whitelist of formats that will be processed.format_options: per-format pipeline/backend overrides. Everything is optional; defaults exist.<stem>.md, <stem>.json, and <stem>.yaml in scratch/.import json\nimport logging\nfrom pathlib import Path\n\nimport yaml\n\nfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import (\n DocumentConverter,\n PdfFormatOption,\n WordFormatOption,\n)\nfrom docling.pipeline.simple_pipeline import SimplePipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\n\n_log = logging.getLogger(__name__)\n\n\ndef main():\n input_paths = [\n Path(\"README.md\"),\n Path(\"tests/data/html/wiki_duck.html\"),\n Path(\"tests/data/docx/word_sample.docx\"),\n Path(\"tests/data/docx/lorem_ipsum.docx\"),\n Path(\"tests/data/pptx/powerpoint_sample.pptx\"),\n Path(\"tests/data/2305.03393v1-pg9-img.png\"),\n Path(\"tests/data/pdf/2206.01062.pdf\"),\n Path(\"tests/data/asciidoc/test_01.asciidoc\"),\n ]\n\n ## for defaults use:\n # doc_converter = DocumentConverter()\n\n ## to customize use:\n\n # Below we explicitly whitelist formats and override behavior for some of them.\n # You can omit this block and use the defaults (see above) for a quick start.\n doc_converter = DocumentConverter( # all of the below is optional, has internal defaults.\n allowed_formats=[\n InputFormat.PDF,\n InputFormat.IMAGE,\n InputFormat.DOCX,\n InputFormat.HTML,\n InputFormat.PPTX,\n InputFormat.ASCIIDOC,\n InputFormat.CSV,\n InputFormat.MD,\n ], # whitelist formats, non-matching files are ignored.\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend\n ),\n InputFormat.DOCX: WordFormatOption(\n pipeline_cls=SimplePipeline # or set a backend, e.g., MsWordDocumentBackend\n # If you change the backend, remember to import it, e.g.:\n # from docling.backend.msword_backend import MsWordDocumentBackend\n ),\n },\n )\n\n conv_results = doc_converter.convert_all(input_paths)\n\n for res in conv_results:\n out_path = Path(\"scratch\") # ensure this directory exists before running\n print(\n f\"Document {res.input.file.name} converted.\"\n f\"\\nSaved markdown output to: {out_path!s}\"\n )\n _log.debug(res.document._export_to_indented_text(max_text_len=16))\n # Export Docling document to Markdown:\n with (out_path / f\"{res.input.file.stem}.md\").open(\"w\") as fp:\n fp.write(res.document.export_to_markdown())\n\n with (out_path / f\"{res.input.file.stem}.json\").open(\"w\") as fp:\n fp.write(json.dumps(res.document.export_to_dict()))\n\n with (out_path / f\"{res.input.file.stem}.yaml\").open(\"w\") as fp:\n fp.write(yaml.safe_dump(res.document.export_to_dict()))\n\n\nif __name__ == \"__main__\":\n main()\n import json import logging from pathlib import Path import yaml from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend from docling.datamodel.base_models import InputFormat from docling.document_converter import ( DocumentConverter, PdfFormatOption, WordFormatOption, ) from docling.pipeline.simple_pipeline import SimplePipeline from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline _log = logging.getLogger(__name__) def main(): input_paths = [ Path(\"README.md\"), Path(\"tests/data/html/wiki_duck.html\"), Path(\"tests/data/docx/word_sample.docx\"), Path(\"tests/data/docx/lorem_ipsum.docx\"), Path(\"tests/data/pptx/powerpoint_sample.pptx\"), Path(\"tests/data/2305.03393v1-pg9-img.png\"), Path(\"tests/data/pdf/2206.01062.pdf\"), Path(\"tests/data/asciidoc/test_01.asciidoc\"), ] ## for defaults use: # doc_converter = DocumentConverter() ## to customize use: # Below we explicitly whitelist formats and override behavior for some of them. # You can omit this block and use the defaults (see above) for a quick start. doc_converter = DocumentConverter( # all of the below is optional, has internal defaults. allowed_formats=[ InputFormat.PDF, InputFormat.IMAGE, InputFormat.DOCX, InputFormat.HTML, InputFormat.PPTX, InputFormat.ASCIIDOC, InputFormat.CSV, InputFormat.MD, ], # whitelist formats, non-matching files are ignored. format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend ), InputFormat.DOCX: WordFormatOption( pipeline_cls=SimplePipeline # or set a backend, e.g., MsWordDocumentBackend # If you change the backend, remember to import it, e.g.: # from docling.backend.msword_backend import MsWordDocumentBackend ), }, ) conv_results = doc_converter.convert_all(input_paths) for res in conv_results: out_path = Path(\"scratch\") # ensure this directory exists before running print( f\"Document {res.input.file.name} converted.\" f\"\\nSaved markdown output to: {out_path!s}\" ) _log.debug(res.document._export_to_indented_text(max_text_len=16)) # Export Docling document to Markdown: with (out_path / f\"{res.input.file.stem}.md\").open(\"w\") as fp: fp.write(res.document.export_to_markdown()) with (out_path / f\"{res.input.file.stem}.json\").open(\"w\") as fp: fp.write(json.dumps(res.document.export_to_dict())) with (out_path / f\"{res.input.file.stem}.yaml\").open(\"w\") as fp: fp.write(yaml.safe_dump(res.document.export_to_dict())) if __name__ == \"__main__\": main()"},{"location":"examples/serialization/","title":"Serialization","text":"In this notebook we showcase the usage of Docling serializers.
In\u00a0[1]: Copied!%pip install -qU pip docling docling-core~=2.29 rich\n%pip install -qU pip docling docling-core~=2.29 rich
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
DOC_SOURCE = \"https://arxiv.org/pdf/2311.18481\"\n\n# we set some start-stop cues for defining an excerpt to print\nstart_cue = \"Copyright \u00a9 2024\"\nstop_cue = \"Application of NLP to ESG\"\nDOC_SOURCE = \"https://arxiv.org/pdf/2311.18481\" # we set some start-stop cues for defining an excerpt to print start_cue = \"Copyright \u00a9 2024\" stop_cue = \"Application of NLP to ESG\" In\u00a0[3]: Copied!
from rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console(width=210) # for preventing Markdown table wrapped rendering\n\n\ndef print_in_console(text):\n console.print(Panel(text))\nfrom rich.console import Console from rich.panel import Panel console = Console(width=210) # for preventing Markdown table wrapped rendering def print_in_console(text): console.print(Panel(text))
We first convert the document:
In\u00a0[4]: Copied!from docling.document_converter import DocumentConverter\n\nconverter = DocumentConverter()\ndoc = converter.convert(source=DOC_SOURCE).document\nfrom docling.document_converter import DocumentConverter converter = DocumentConverter() doc = converter.convert(source=DOC_SOURCE).document
/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n warnings.warn(warn_msg)\n
We can now apply any BaseDocSerializer on the produced document.
\ud83d\udc49 Note that, to keep the shown output brief, we only print an excerpt.
E.g. below we apply an HTMLDocSerializer:
from docling_core.transforms.serializer.html import HTMLDocSerializer\n\nserializer = HTMLDocSerializer(doc=doc)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\n# we here only print an excerpt to keep the output brief:\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\nfrom docling_core.transforms.serializer.html import HTMLDocSerializer serializer = HTMLDocSerializer(doc=doc) ser_result = serializer.serialize() ser_text = ser_result.text # we here only print an excerpt to keep the output brief: print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.</p> \u2502\n\u2502 <table><tbody><tr><th>Report</th><th>Question</th><th>Answer</th></tr><tr><td>IBM 2022</td><td>How many hours were spent on employee learning in 2021?</td><td>22.5 million hours</td></tr><tr><td>IBM \u2502\n\u2502 2022</td><td>What was the rate of fatalities in 2021?</td><td>The rate of fatalities in 2021 was 0.0016.</td></tr><tr><td>IBM 2022</td><td>How many full audits were con- ducted in 2022 in \u2502\n\u2502 India?</td><td>2</td></tr><tr><td>Starbucks 2022</td><td>What is the percentage of women in the Board of Directors?</td><td>25%</td></tr><tr><td>Starbucks 2022</td><td>What was the total energy con- \u2502\n\u2502 sumption in 2021?</td><td>According to the table, the total energy consumption in 2021 was 2,491,543 MWh.</td></tr><tr><td>Starbucks 2022</td><td>How much packaging material was made from renewable mate- \u2502\n\u2502 rials?</td><td>According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.</td></tr></tbody></table> \u2502\n\u2502 <p>Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.</p> \u2502\n\u2502 <p>ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response.</p> \u2502\n\u2502 <h2>Related Work</h2> \u2502\n\u2502 <p>The DocQA integrates multiple AI technologies, namely:</p> \u2502\n\u2502 <p>Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric \u2502\n\u2502 layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et \u2502\n\u2502 al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-</p> \u2502\n\u2502 <figure><figcaption>Figure 1: System architecture: Simplified sketch of document question-answering pipeline.</figcaption></figure> \u2502\n\u2502 <p>based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).</p> \u2502\n\u2502 <p> \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
In the following example, we use a MarkdownDocSerializer:
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer\n\nserializer = MarkdownDocSerializer(doc=doc)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\nfrom docling_core.transforms.serializer.markdown import MarkdownDocSerializer serializer = MarkdownDocSerializer(doc=doc) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. \u2502\n\u2502 \u2502\n\u2502 | Report | Question | Answer | \u2502\n\u2502 |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| \u2502\n\u2502 | IBM 2022 | How many hours were spent on employee learning in 2021? | 22.5 million hours | \u2502\n\u2502 | IBM 2022 | What was the rate of fatalities in 2021? | The rate of fatalities in 2021 was 0.0016. | \u2502\n\u2502 | IBM 2022 | How many full audits were con- ducted in 2022 in India? | 2 | \u2502\n\u2502 | Starbucks 2022 | What is the percentage of women in the Board of Directors? | 25% | \u2502\n\u2502 | Starbucks 2022 | What was the total energy con- sumption in 2021? | According to the table, the total energy consumption in 2021 was 2,491,543 MWh. | \u2502\n\u2502 | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. | \u2502\n\u2502 \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. \u2502\n\u2502 \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response. \u2502\n\u2502 \u2502\n\u2502 ## Related Work \u2502\n\u2502 \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely: \u2502\n\u2502 \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- \u2502\n\u2502 \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline. \u2502\n\u2502 \u2502\n\u2502 <!-- image --> \u2502\n\u2502 \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). \u2502\n\u2502 \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
Let's now assume we would like to reconfigure the Markdown serialization such that:
Check out the following configuration and notice the serialization differences in the output further below:
In\u00a0[7]: Copied!from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer\nfrom docling_core.transforms.serializer.markdown import MarkdownParams\n\nserializer = MarkdownDocSerializer(\n doc=doc,\n table_serializer=TripletTableSerializer(),\n params=MarkdownParams(\n image_placeholder=\"<!-- demo picture placeholder -->\",\n # ...\n ),\n)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\nfrom docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer from docling_core.transforms.serializer.markdown import MarkdownParams serializer = MarkdownDocSerializer( doc=doc, table_serializer=TripletTableSerializer(), params=MarkdownParams( image_placeholder=\"\", # ... ), ) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. \u2502\n\u2502 \u2502\n\u2502 IBM 2022, Question = How many hours were spent on employee learning in 2021?. IBM 2022, Answer = 22.5 million hours. IBM 2022, Question = What was the rate of fatalities in 2021?. IBM 2022, Answer = The \u2502\n\u2502 rate of fatalities in 2021 was 0.0016.. IBM 2022, Question = How many full audits were con- ducted in 2022 in India?. IBM 2022, Answer = 2. Starbucks 2022, Question = What is the percentage of women in the \u2502\n\u2502 Board of Directors?. Starbucks 2022, Answer = 25%. Starbucks 2022, Question = What was the total energy con- sumption in 2021?. Starbucks 2022, Answer = According to the table, the total energy consumption \u2502\n\u2502 in 2021 was 2,491,543 MWh.. Starbucks 2022, Question = How much packaging material was made from renewable mate- rials?. Starbucks 2022, Answer = According to the given data, 31% of packaging materials were \u2502\n\u2502 made from recycled or renewable materials in FY22. \u2502\n\u2502 \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. \u2502\n\u2502 \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response. \u2502\n\u2502 \u2502\n\u2502 ## Related Work \u2502\n\u2502 \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely: \u2502\n\u2502 \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- \u2502\n\u2502 \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline. \u2502\n\u2502 \u2502\n\u2502 <!-- demo picture placeholder --> \u2502\n\u2502 \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). \u2502\n\u2502 \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n
In the examples above, we were able to reuse existing implementations for our desired serialization strategy, but let's now assume we want to define a custom serialization logic, e.g. we would like picture serialization to include any available picture description (captioning) annotations.
To that end, we first need to revisit our conversion and include all pipeline options needed for picture description enrichment.
In\u00a0[8]: Copied!from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n PictureDescriptionVlmOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions(\n do_picture_description=True,\n picture_description_options=PictureDescriptionVlmOptions(\n repo_id=\"HuggingFaceTB/SmolVLM-256M-Instruct\",\n prompt=\"Describe this picture in three to five sentences. Be precise and concise.\",\n ),\n generate_picture_images=True,\n images_scale=2,\n)\n\nconverter = DocumentConverter(\n format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}\n)\ndoc = converter.convert(source=DOC_SOURCE).document\n from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, PictureDescriptionVlmOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption pipeline_options = PdfPipelineOptions( do_picture_description=True, picture_description_options=PictureDescriptionVlmOptions( repo_id=\"HuggingFaceTB/SmolVLM-256M-Instruct\", prompt=\"Describe this picture in three to five sentences. Be precise and concise.\", ), generate_picture_images=True, images_scale=2, ) converter = DocumentConverter( format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} ) doc = converter.convert(source=DOC_SOURCE).document /Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n warnings.warn(warn_msg)\n
We can then define our custom picture serializer:
In\u00a0[9]: Copied!from typing import Any, Optional\n\nfrom docling_core.transforms.serializer.base import (\n BaseDocSerializer,\n SerializationResult,\n)\nfrom docling_core.transforms.serializer.common import create_ser_result\nfrom docling_core.transforms.serializer.markdown import (\n MarkdownParams,\n MarkdownPictureSerializer,\n)\nfrom docling_core.types.doc.document import (\n DoclingDocument,\n ImageRefMode,\n PictureDescriptionData,\n PictureItem,\n)\nfrom typing_extensions import override\n\n\nclass AnnotationPictureSerializer(MarkdownPictureSerializer):\n @override\n def serialize(\n self,\n *,\n item: PictureItem,\n doc_serializer: BaseDocSerializer,\n doc: DoclingDocument,\n separator: Optional[str] = None,\n **kwargs: Any,\n ) -> SerializationResult:\n text_parts: list[str] = []\n\n # reusing the existing result:\n parent_res = super().serialize(\n item=item,\n doc_serializer=doc_serializer,\n doc=doc,\n **kwargs,\n )\n text_parts.append(parent_res.text)\n\n # appending annotations:\n for annotation in item.annotations:\n if isinstance(annotation, PictureDescriptionData):\n text_parts.append(f\"<!-- Picture description: {annotation.text} -->\")\n\n text_res = (separator or \"\\n\").join(text_parts)\n return create_ser_result(text=text_res, span_source=item)\n from typing import Any, Optional from docling_core.transforms.serializer.base import ( BaseDocSerializer, SerializationResult, ) from docling_core.transforms.serializer.common import create_ser_result from docling_core.transforms.serializer.markdown import ( MarkdownParams, MarkdownPictureSerializer, ) from docling_core.types.doc.document import ( DoclingDocument, ImageRefMode, PictureDescriptionData, PictureItem, ) from typing_extensions import override class AnnotationPictureSerializer(MarkdownPictureSerializer): @override def serialize( self, *, item: PictureItem, doc_serializer: BaseDocSerializer, doc: DoclingDocument, separator: Optional[str] = None, **kwargs: Any, ) -> SerializationResult: text_parts: list[str] = [] # reusing the existing result: parent_res = super().serialize( item=item, doc_serializer=doc_serializer, doc=doc, **kwargs, ) text_parts.append(parent_res.text) # appending annotations: for annotation in item.annotations: if isinstance(annotation, PictureDescriptionData): text_parts.append(f\"\") text_res = (separator or \"\\n\").join(text_parts) return create_ser_result(text=text_res, span_source=item) Last but not least, we define a new doc serializer which leverages our custom picture serializer.
Notice the picture description annotations in the output below:
In\u00a0[10]: Copied!serializer = MarkdownDocSerializer(\n doc=doc,\n picture_serializer=AnnotationPictureSerializer(),\n params=MarkdownParams(\n image_mode=ImageRefMode.PLACEHOLDER,\n image_placeholder=\"\",\n ),\n)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\nserializer = MarkdownDocSerializer( doc=doc, picture_serializer=AnnotationPictureSerializer(), params=MarkdownParams( image_mode=ImageRefMode.PLACEHOLDER, image_placeholder=\"\", ), ) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. \u2502\n\u2502 \u2502\n\u2502 | Report | Question | Answer | \u2502\n\u2502 |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| \u2502\n\u2502 | IBM 2022 | How many hours were spent on employee learning in 2021? | 22.5 million hours | \u2502\n\u2502 | IBM 2022 | What was the rate of fatalities in 2021? | The rate of fatalities in 2021 was 0.0016. | \u2502\n\u2502 | IBM 2022 | How many full audits were con- ducted in 2022 in India? | 2 | \u2502\n\u2502 | Starbucks 2022 | What is the percentage of women in the Board of Directors? | 25% | \u2502\n\u2502 | Starbucks 2022 | What was the total energy con- sumption in 2021? | According to the table, the total energy consumption in 2021 was 2,491,543 MWh. | \u2502\n\u2502 | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. | \u2502\n\u2502 \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. \u2502\n\u2502 \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response. \u2502\n\u2502 \u2502\n\u2502 ## Related Work \u2502\n\u2502 \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely: \u2502\n\u2502 \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- \u2502\n\u2502 \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline. \u2502\n\u2502 <!-- Picture description: The image depicts a document conversion process. It is a sequence of steps that includes document conversion, information retrieval, and response generation. The document \u2502\n\u2502 conversion step involves converting the document from a text format to a markdown format. The information retrieval step involves retrieving the document from a database or other source. The response \u2502\n\u2502 generation step involves generating a response from the information retrieval step. --> \u2502\n\u2502 \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). \u2502\n\u2502 \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n"},{"location":"examples/serialization/#serialization","title":"Serialization\u00b6","text":""},{"location":"examples/serialization/#overview","title":"Overview\u00b6","text":""},{"location":"examples/serialization/#setup","title":"Setup\u00b6","text":""},{"location":"examples/serialization/#basic-usage","title":"Basic usage\u00b6","text":""},{"location":"examples/serialization/#configuring-a-serializer","title":"Configuring a serializer\u00b6","text":""},{"location":"examples/serialization/#creating-a-custom-serializer","title":"Creating a custom serializer\u00b6","text":""},{"location":"examples/suryaocr_with_custom_models/","title":"SuryaOCR with custom OCR models","text":"
Example: Integrating SuryaOCR with Docling for PDF OCR and Markdown Export
Overview:
Prerequisites:
pip install docling-suryadocling imports successfully.Execution:
python docs/examples/suryaocr_with_custom_models.pyNotes:
~/.cache/huggingface; override with HF_HOME env var.docling-surya package integrates SuryaOCR, which is licensed under the GNU General Public License (GPL). Using this integration may impose GPL obligations on your project. Review the license terms carefully.# Requires `pip install docling-surya`\n# See https://pypi.org/project/docling-surya/\nfrom docling_surya import SuryaOcrOptions\n# Requires `pip install docling-surya` # See https://pypi.org/project/docling-surya/ from docling_surya import SuryaOcrOptions In\u00a0[\u00a0]: Copied!
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied!
def main():\n source = \"https://19january2021snapshot.epa.gov/sites/static/files/2016-02/documents/epa_sample_letter_sent_to_commissioners_dated_february_29_2015.pdf\"\n\n pipeline_options = PdfPipelineOptions(\n do_ocr=True,\n ocr_model=\"suryaocr\",\n allow_external_plugins=True,\n ocr_options=SuryaOcrOptions(lang=[\"en\"]),\n )\n\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),\n InputFormat.IMAGE: PdfFormatOption(pipeline_options=pipeline_options),\n }\n )\n\n result = converter.convert(source)\n print(result.document.export_to_markdown())\n def main(): source = \"https://19january2021snapshot.epa.gov/sites/static/files/2016-02/documents/epa_sample_letter_sent_to_commissioners_dated_february_29_2015.pdf\" pipeline_options = PdfPipelineOptions( do_ocr=True, ocr_model=\"suryaocr\", allow_external_plugins=True, ocr_options=SuryaOcrOptions(lang=[\"en\"]), ) converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options), InputFormat.IMAGE: PdfFormatOption(pipeline_options=pipeline_options), } ) result = converter.convert(source) print(result.document.export_to_markdown()) In\u00a0[\u00a0]: Copied! if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"examples/tesseract_lang_detection/","title":"Automatic OCR language detection with tesseract","text":"
Detect language automatically with Tesseract OCR and force full-page OCR.
What this example does
lang=[\"auto\"].How to run
python docs/examples/tesseract_lang_detection.py.Notes
TesseractOcrOptions instead of TesseractCliOcrOptions.TESSDATA_PREFIX if Tesseract cannot find language data. Using lang=[\"auto\"] requires traineddata that supports script/language detection on your system.from pathlib import Path\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n TesseractCliOcrOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n\ndef main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n # Set lang=[\"auto\"] with a tesseract OCR engine: TesseractOcrOptions, TesseractCliOcrOptions\n # ocr_options = TesseractOcrOptions(lang=[\"auto\"])\n ocr_options = TesseractCliOcrOptions(lang=[\"auto\"])\n\n pipeline_options = PdfPipelineOptions(\n do_ocr=True, force_full_page_ocr=True, ocr_options=ocr_options\n )\n\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n\n doc = converter.convert(input_doc_path).document\n md = doc.export_to_markdown()\n print(md)\n\n\nif __name__ == \"__main__\":\n main()\n from pathlib import Path from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, TesseractCliOcrOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption def main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" # Set lang=[\"auto\"] with a tesseract OCR engine: TesseractOcrOptions, TesseractCliOcrOptions # ocr_options = TesseractOcrOptions(lang=[\"auto\"]) ocr_options = TesseractCliOcrOptions(lang=[\"auto\"]) pipeline_options = PdfPipelineOptions( do_ocr=True, force_full_page_ocr=True, ocr_options=ocr_options ) converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(input_doc_path).document md = doc.export_to_markdown() print(md) if __name__ == \"__main__\": main()"},{"location":"examples/translate/","title":"Simple translation","text":"Translate extracted text content and regenerate Markdown with embedded images.
What this example does
Prerequisites
translate().How to run
python docs/examples/translate.py.scratch/.Notes
translate() is a placeholder; integrate your preferred translation API/client.import logging\nfrom pathlib import Path\n\nfrom docling_core.types.doc import ImageRefMode, TableItem, TextItem\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n_log = logging.getLogger(__name__)\n\nIMAGE_RESOLUTION_SCALE = 2.0\n\n\n# FIXME: put in your favorite translation code ....\ndef translate(text: str, src: str = \"en\", dest: str = \"de\"):\n _log.warning(\"!!! IMPLEMENT HERE YOUR FAVORITE TRANSLATION CODE!!!\")\n # from googletrans import Translator\n\n # Initialize the translator\n # translator = Translator()\n\n # Translate text from English to German\n # text = \"Hello, how are you?\"\n # translated = translator.translate(text, src=\"en\", dest=\"de\")\n\n return text\n\n\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\") # ensure this directory exists before saving\n\n # Important: For operating with page images, we must keep them, otherwise the DocumentConverter\n # will destroy them for cleaning up memory.\n # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images.\n # scale=1 correspond of a standard 72 DPI image\n # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched\n # with the image field\n pipeline_options = PdfPipelineOptions()\n pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n pipeline_options.generate_page_images = True\n pipeline_options.generate_picture_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n conv_res = doc_converter.convert(input_doc_path)\n conv_doc = conv_res.document\n doc_filename = conv_res.input.file.name\n\n # Save markdown with embedded pictures in original text\n # Tip: create the `scratch/` folder first or adjust `output_dir`.\n md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\"\n conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n for element, _level in conv_res.document.iterate_items():\n if isinstance(element, TextItem):\n element.orig = element.text\n element.text = translate(text=element.text)\n\n elif isinstance(element, TableItem):\n for cell in element.data.table_cells:\n cell.text = translate(text=cell.text)\n\n # Save markdown with embedded pictures in translated text\n md_filename = output_dir / f\"{doc_filename}-with-images-translated.md\"\n conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n\nif __name__ == \"__main__\":\n main()\n import logging from pathlib import Path from docling_core.types.doc import ImageRefMode, TableItem, TextItem from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption _log = logging.getLogger(__name__) IMAGE_RESOLUTION_SCALE = 2.0 # FIXME: put in your favorite translation code .... def translate(text: str, src: str = \"en\", dest: str = \"de\"): _log.warning(\"!!! IMPLEMENT HERE YOUR FAVORITE TRANSLATION CODE!!!\") # from googletrans import Translator # Initialize the translator # translator = Translator() # Translate text from English to German # text = \"Hello, how are you?\" # translated = translator.translate(text, src=\"en\", dest=\"de\") return text def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") # ensure this directory exists before saving # Important: For operating with page images, we must keep them, otherwise the DocumentConverter # will destroy them for cleaning up memory. # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images. # scale=1 correspond of a standard 72 DPI image # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched # with the image field pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) conv_res = doc_converter.convert(input_doc_path) conv_doc = conv_res.document doc_filename = conv_res.input.file.name # Save markdown with embedded pictures in original text # Tip: create the `scratch/` folder first or adjust `output_dir`. md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\" conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED) for element, _level in conv_res.document.iterate_items(): if isinstance(element, TextItem): element.orig = element.text element.text = translate(text=element.text) elif isinstance(element, TableItem): for cell in element.data.table_cells: cell.text = translate(text=cell.text) # Save markdown with embedded pictures in translated text md_filename = output_dir / f\"{doc_filename}-with-images-translated.md\" conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED) if __name__ == \"__main__\": main()"},{"location":"examples/visual_grounding/","title":"Visual grounding","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote This example showcases Docling's visual grounding capabilities, which can be combined with any agentic AI / RAG framework.
In this instance, we illustrate these capabilities leveraging the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings.
HF_TOKEN.--no-warn-conflicts meant for Colab's pre-populated Python env; feel free to remove for stricter usage):%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain matplotlib python-dotenv\n%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain matplotlib python-dotenv
Note: you may need to restart the kernel to use updated packages.\nIn\u00a0[2]: Copied!
import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom dotenv import load_dotenv\nfrom langchain_core.prompts import PromptTemplate\nfrom langchain_docling.loader import ExportType\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nSOURCES = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nQUESTION = \"Which are the main AI models in Docling?\"\nPROMPT = PromptTemplate.from_template(\n \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\",\n)\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\n import os from pathlib import Path from tempfile import mkdtemp from dotenv import load_dotenv from langchain_core.prompts import PromptTemplate from langchain_docling.loader import ExportType def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") SOURCES = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" QUESTION = \"Which are the main AI models in Docling?\" PROMPT = PromptTemplate.from_template( \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\", ) TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied! from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=PdfPipelineOptions(\n generate_page_images=True,\n images_scale=2.0,\n ),\n )\n }\n)\n from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=PdfPipelineOptions( generate_page_images=True, images_scale=2.0, ), ) } ) We set up a simple doc store for keeping converted documents, as that is needed for visual grounding further below.
In\u00a0[4]: Copied!doc_store = {}\ndoc_store_root = Path(mkdtemp())\nfor source in SOURCES:\n dl_doc = converter.convert(source=source).document\n file_path = Path(doc_store_root / f\"{dl_doc.origin.binary_hash}.json\")\n dl_doc.save_as_json(file_path)\n doc_store[dl_doc.origin.binary_hash] = file_path\n doc_store = {} doc_store_root = Path(mkdtemp()) for source in SOURCES: dl_doc = converter.convert(source=source).document file_path = Path(doc_store_root / f\"{dl_doc.origin.binary_hash}.json\") dl_doc.save_as_json(file_path) doc_store[dl_doc.origin.binary_hash] = file_path Now we can instantiate our loader and load documents.
In\u00a0[5]: Copied!from langchain_docling import DoclingLoader\n\nfrom docling.chunking import HybridChunker\n\nloader = DoclingLoader(\n file_path=SOURCES,\n converter=converter,\n export_type=ExportType.DOC_CHUNKS,\n chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n)\n\ndocs = loader.load()\nfrom langchain_docling import DoclingLoader from docling.chunking import HybridChunker loader = DoclingLoader( file_path=SOURCES, converter=converter, export_type=ExportType.DOC_CHUNKS, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ) docs = loader.load()
Token indices sequence length is longer than the specified maximum sequence length for this model (648 > 512). Running this sequence through the model will result in indexing errors\n
\ud83d\udc49 NOTE: As you see above, using the HybridChunker can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here.
Inspecting some sample splits:
In\u00a0[6]: Copied!for d in docs[:3]:\n print(f\"- {d.page_content=}\")\nprint(\"...\")\n for d in docs[:3]: print(f\"- {d.page_content=}\") print(\"...\") - d.page_content='Docling Technical Report\\nVersion 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R\u00a8 uschlikon, Switzerland'\n- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n- d.page_content='1 Introduction\\nConverting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variability in formats, weak standardization and printing-optimized characteristic, which discards most structural features and metadata. With the advent of LLMs and popular application patterns such as retrieval-augmented generation (RAG), leveraging the rich content embedded in PDFs has become ever more relevant. In the past decade, several powerful document understanding solutions have emerged on the market, most of which are commercial software, cloud offerings [3] and most recently, multi-modal vision-language models. As of today, only a handful of open-source tools cover PDF conversion, leaving a significant feature and quality gap to proprietary solutions.\\nWith Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.\\nHere is what Docling delivers today:\\n\u00b7 Converts PDF documents to JSON or Markdown format, stable and lightning fast\\n\u00b7 Understands detailed page layout, reading order, locates figures and recovers table structures\\n\u00b7 Extracts metadata from the document, such as title, authors, references and language\\n\u00b7 Optionally applies OCR, e.g. for scanned PDFs\\n\u00b7 Can be configured to be optimal for batch-mode (i.e high throughput, low time-to-solution) or interactive mode (compromise on efficiency, low time-to-solution)\\n\u00b7 Can leverage different accelerators (GPU, MPS, etc).'\n...\nIn\u00a0[7]: Copied!
import json\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom langchain_huggingface.embeddings import HuggingFaceEmbeddings\nfrom langchain_milvus import Milvus\n\nembedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)\n\n\nmilvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed\nvectorstore = Milvus.from_documents(\n documents=docs,\n embedding=embedding,\n collection_name=\"docling_demo\",\n connection_args={\"uri\": milvus_uri},\n index_params={\"index_type\": \"FLAT\"},\n drop_old=True,\n)\n import json from pathlib import Path from tempfile import mkdtemp from langchain_huggingface.embeddings import HuggingFaceEmbeddings from langchain_milvus import Milvus embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID) milvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed vectorstore = Milvus.from_documents( documents=docs, embedding=embedding, collection_name=\"docling_demo\", connection_args={\"uri\": milvus_uri}, index_params={\"index_type\": \"FLAT\"}, drop_old=True, ) In\u00a0[8]: Copied! from langchain.chains import create_retrieval_chain\nfrom langchain.chains.combine_documents import create_stuff_documents_chain\nfrom langchain_huggingface import HuggingFaceEndpoint\n\nretriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K})\nllm = HuggingFaceEndpoint(\n repo_id=GEN_MODEL_ID,\n huggingfacehub_api_token=HF_TOKEN,\n)\n\n\ndef clip_text(text, threshold=100):\n return f\"{text[:threshold]}...\" if len(text) > threshold else text\n from langchain.chains import create_retrieval_chain from langchain.chains.combine_documents import create_stuff_documents_chain from langchain_huggingface import HuggingFaceEndpoint retriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K}) llm = HuggingFaceEndpoint( repo_id=GEN_MODEL_ID, huggingfacehub_api_token=HF_TOKEN, ) def clip_text(text, threshold=100): return f\"{text[:threshold]}...\" if len(text) > threshold else text Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\nIn\u00a0[9]: Copied!
from docling.chunking import DocMeta\nfrom docling.datamodel.document import DoclingDocument\n\nquestion_answer_chain = create_stuff_documents_chain(llm, PROMPT)\nrag_chain = create_retrieval_chain(retriever, question_answer_chain)\nresp_dict = rag_chain.invoke({\"input\": QUESTION})\n\nclipped_answer = clip_text(resp_dict[\"answer\"], threshold=200)\nprint(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")\n from docling.chunking import DocMeta from docling.datamodel.document import DoclingDocument question_answer_chain = create_stuff_documents_chain(llm, PROMPT) rag_chain = create_retrieval_chain(retriever, question_answer_chain) resp_dict = rag_chain.invoke({\"input\": QUESTION}) clipped_answer = clip_text(resp_dict[\"answer\"], threshold=200) print(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\") /Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_deprecation.py:131: FutureWarning: 'post' (from 'huggingface_hub.inference._client') is deprecated and will be removed from version '0.31.0'. Making direct POST requests to the inference server is not supported anymore. Please use task methods instead (e.g. `InferenceClient.chat_completion`). If your use case is not supported, please open an issue in https://github.com/huggingface/huggingface_hub.\n warnings.warn(warning_message, FutureWarning)\n
Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nThe main AI models in Docling are:\n1. A layout analysis model, an accurate object-detector for page elements.\n2. TableFormer, a state-of-the-art table structure recognition model.\nIn\u00a0[10]: Copied!
import matplotlib.pyplot as plt\nfrom PIL import ImageDraw\n\nfor i, doc in enumerate(resp_dict[\"context\"][:]):\n image_by_page = {}\n print(f\"Source {i + 1}:\")\n print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")\n meta = DocMeta.model_validate(doc.metadata[\"dl_meta\"])\n\n # loading the full DoclingDocument from the document store:\n dl_doc = DoclingDocument.load_from_json(doc_store.get(meta.origin.binary_hash))\n\n for doc_item in meta.doc_items:\n if doc_item.prov:\n prov = doc_item.prov[0] # here we only consider the first provenence item\n page_no = prov.page_no\n if img := image_by_page.get(page_no):\n pass\n else:\n page = dl_doc.pages[prov.page_no]\n print(f\" page: {prov.page_no}\")\n img = page.image.pil_image\n image_by_page[page_no] = img\n bbox = prov.bbox.to_top_left_origin(page_height=page.size.height)\n bbox = bbox.normalized(page.size)\n thickness = 2\n padding = thickness + 2\n bbox.l = round(bbox.l * img.width - padding)\n bbox.r = round(bbox.r * img.width + padding)\n bbox.t = round(bbox.t * img.height - padding)\n bbox.b = round(bbox.b * img.height + padding)\n draw = ImageDraw.Draw(img)\n draw.rectangle(\n xy=bbox.as_tuple(),\n outline=\"blue\",\n width=thickness,\n )\n for p in image_by_page:\n img = image_by_page[p]\n plt.figure(figsize=[15, 15])\n plt.imshow(img)\n plt.axis(\"off\")\n plt.show()\n import matplotlib.pyplot as plt from PIL import ImageDraw for i, doc in enumerate(resp_dict[\"context\"][:]): image_by_page = {} print(f\"Source {i + 1}:\") print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\") meta = DocMeta.model_validate(doc.metadata[\"dl_meta\"]) # loading the full DoclingDocument from the document store: dl_doc = DoclingDocument.load_from_json(doc_store.get(meta.origin.binary_hash)) for doc_item in meta.doc_items: if doc_item.prov: prov = doc_item.prov[0] # here we only consider the first provenence item page_no = prov.page_no if img := image_by_page.get(page_no): pass else: page = dl_doc.pages[prov.page_no] print(f\" page: {prov.page_no}\") img = page.image.pil_image image_by_page[page_no] = img bbox = prov.bbox.to_top_left_origin(page_height=page.size.height) bbox = bbox.normalized(page.size) thickness = 2 padding = thickness + 2 bbox.l = round(bbox.l * img.width - padding) bbox.r = round(bbox.r * img.width + padding) bbox.t = round(bbox.t * img.height - padding) bbox.b = round(bbox.b * img.height + padding) draw = ImageDraw.Draw(img) draw.rectangle( xy=bbox.as_tuple(), outline=\"blue\", width=thickness, ) for p in image_by_page: img = image_by_page[p] plt.figure(figsize=[15, 15]) plt.imshow(img) plt.axis(\"off\") plt.show() Source 1:\n text: \"3.2 AI models\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re...\"\n page: 3\n
Source 2:\n text: \"3 Processing pipeline\\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ...\"\n page: 2\n
Source 3:\n text: \"6 Future work and contributions\\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ...\"\n page: 5\nIn\u00a0[\u00a0]: Copied!
\n"},{"location":"examples/visual_grounding/#visual-grounding","title":"Visual grounding\u00b6","text":""},{"location":"examples/visual_grounding/#setup","title":"Setup\u00b6","text":""},{"location":"examples/visual_grounding/#document-store-setup","title":"Document store setup\u00b6","text":""},{"location":"examples/visual_grounding/#document-loading","title":"Document loading\u00b6","text":"
We first define our converter, in this case including options for keeping page images (for visual grounding).
"},{"location":"examples/visual_grounding/#ingestion","title":"Ingestion\u00b6","text":""},{"location":"examples/visual_grounding/#rag","title":"RAG\u00b6","text":""},{"location":"examples/visual_grounding/#visual-grounding","title":"Visual grounding\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/","title":"VLM pipeline with remote model","text":"Use the VLM pipeline with remote API models (LM Studio, Ollama, watsonx.ai).
What this example does
ApiVlmOptions for different VLM providers.Prerequisites
python-dotenv if using environment files.requests for HTTP calls and python-dotenv if loading env vars from .env.How to run
python docs/examples/vlm_pipeline_api_model.py.Choosing a provider
pipeline_options.vlm_options = ... block below.enable_remote_services=True to permit calling remote APIs.Notes
http://localhost:1234/v1/chat/completions.http://localhost:11434/v1/chat/completions.WX_API_KEY and WX_PROJECT_ID in env/.env.import json\nimport logging\nimport os\nfrom pathlib import Path\nfrom typing import Optional\n\nimport requests\nfrom docling_core.types.doc.page import SegmentedPage\nfrom dotenv import load_dotenv\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\n### Example of ApiVlmOptions definitions\n\n#### Using LM Studio or VLLM (OpenAI-compatible APIs)\n\n\ndef openai_compatible_vlm_options(\n model: str,\n prompt: str,\n format: ResponseFormat,\n hostname_and_port,\n temperature: float = 0.7,\n max_tokens: int = 4096,\n api_key: str = \"\",\n skip_special_tokens=False,\n):\n headers = {}\n if api_key:\n headers[\"Authorization\"] = f\"Bearer {api_key}\"\n\n options = ApiVlmOptions(\n url=f\"http://{hostname_and_port}/v1/chat/completions\", # LM studio defaults to port 1234, VLLM to 8000\n params=dict(\n model=model,\n max_tokens=max_tokens,\n skip_special_tokens=skip_special_tokens, # needed for VLLM\n ),\n headers=headers,\n prompt=prompt,\n timeout=90,\n scale=2.0,\n temperature=temperature,\n response_format=format,\n )\n return options\n\n\n#### Using LM Studio with OlmOcr model\n\n\ndef lms_olmocr_vlm_options(model: str):\n class OlmocrVlmOptions(ApiVlmOptions):\n def build_prompt(self, page: Optional[SegmentedPage]) -> str:\n if page is None:\n return self.prompt.replace(\"#RAW_TEXT#\", \"\")\n\n anchor = [\n f\"Page dimensions: {int(page.dimension.width)}x{int(page.dimension.height)}\"\n ]\n\n for text_cell in page.textline_cells:\n if not text_cell.text.strip():\n continue\n bbox = text_cell.rect.to_bounding_box().to_bottom_left_origin(\n page.dimension.height\n )\n anchor.append(f\"[{int(bbox.l)}x{int(bbox.b)}] {text_cell.text}\")\n\n for image_cell in page.bitmap_resources:\n bbox = image_cell.rect.to_bounding_box().to_bottom_left_origin(\n page.dimension.height\n )\n anchor.append(\n f\"[Image {int(bbox.l)}x{int(bbox.b)} to {int(bbox.r)}x{int(bbox.t)}]\"\n )\n\n if len(anchor) == 1:\n anchor.append(\n f\"[Image 0x0 to {int(page.dimension.width)}x{int(page.dimension.height)}]\"\n )\n\n # Original prompt uses cells sorting. We are skipping it for simplicity.\n\n raw_text = \"\\n\".join(anchor)\n\n return self.prompt.replace(\"#RAW_TEXT#\", raw_text)\n\n def decode_response(self, text: str) -> str:\n # OlmOcr trained to generate json response with language, rotation and other info\n try:\n generated_json = json.loads(text)\n except json.decoder.JSONDecodeError:\n return \"\"\n\n return generated_json[\"natural_text\"]\n\n options = OlmocrVlmOptions(\n url=\"http://localhost:1234/v1/chat/completions\",\n params=dict(\n model=model,\n ),\n prompt=(\n \"Below is the image of one page of a document, as well as some raw textual\"\n \" content that was previously extracted for it. Just return the plain text\"\n \" representation of this document as if you were reading it naturally.\\n\"\n \"Do not hallucinate.\\n\"\n \"RAW_TEXT_START\\n#RAW_TEXT#\\nRAW_TEXT_END\"\n ),\n timeout=90,\n scale=1.0,\n max_size=1024, # from OlmOcr pipeline\n response_format=ResponseFormat.MARKDOWN,\n )\n return options\n\n\n#### Using Ollama\n\n\ndef ollama_vlm_options(model: str, prompt: str):\n options = ApiVlmOptions(\n url=\"http://localhost:11434/v1/chat/completions\", # the default Ollama endpoint\n params=dict(\n model=model,\n ),\n prompt=prompt,\n timeout=90,\n scale=1.0,\n response_format=ResponseFormat.MARKDOWN,\n )\n return options\n\n\n#### Using a cloud service like IBM watsonx.ai\n\n\ndef watsonx_vlm_options(model: str, prompt: str):\n load_dotenv()\n api_key = os.environ.get(\"WX_API_KEY\")\n project_id = os.environ.get(\"WX_PROJECT_ID\")\n\n def _get_iam_access_token(api_key: str) -> str:\n res = requests.post(\n url=\"https://iam.cloud.ibm.com/identity/token\",\n headers={\n \"Content-Type\": \"application/x-www-form-urlencoded\",\n },\n data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\",\n )\n res.raise_for_status()\n api_out = res.json()\n print(f\"{api_out=}\")\n return api_out[\"access_token\"]\n\n options = ApiVlmOptions(\n url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\",\n params=dict(\n model_id=model,\n project_id=project_id,\n parameters=dict(\n max_new_tokens=400,\n ),\n ),\n headers={\n \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key),\n },\n prompt=prompt,\n timeout=60,\n response_format=ResponseFormat.MARKDOWN,\n )\n return options\n\n\n### Usage and conversion\n\n\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2305.03393v1-pg9.pdf\"\n\n # Configure the VLM pipeline. Enabling remote services allows HTTP calls to\n # locally hosted APIs (LM Studio, Ollama) or cloud services.\n pipeline_options = VlmPipelineOptions(\n enable_remote_services=True # required when calling remote VLM endpoints\n )\n\n # The ApiVlmOptions() allows to interface with APIs supporting\n # the multi-modal chat interface. Here follow a few example on how to configure those.\n\n # One possibility is self-hosting the model, e.g., via LM Studio, Ollama or VLLM.\n #\n # e.g. with VLLM, serve granite-docling with these commands:\n # > vllm serve ibm-granite/granite-docling-258M --revision untied\n #\n # with LM Studio, serve granite-docling with these commands:\n # > lms server start\n # > lms load ibm-granite/granite-docling-258M-mlx\n\n # Example using the Granite-Docling model with LM Studio or VLLM:\n pipeline_options.vlm_options = openai_compatible_vlm_options(\n model=\"granite-docling-258m-mlx\", # For VLLM use \"ibm-granite/granite-docling-258M\"\n hostname_and_port=\"localhost:1234\", # LM studio defaults to port 1234, VLLM to 8000\n prompt=\"Convert this page to docling.\",\n format=ResponseFormat.DOCTAGS,\n api_key=\"\",\n )\n\n # Example using the OlmOcr (dynamic prompt) model with LM Studio:\n # (uncomment the following lines)\n # pipeline_options.vlm_options = lms_olmocr_vlm_options(\n # model=\"hf.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF\",\n # )\n\n # Example using the Granite Vision model with Ollama:\n # (uncomment the following lines)\n # pipeline_options.vlm_options = ollama_vlm_options(\n # model=\"granite3.2-vision:2b\",\n # prompt=\"OCR the full page to markdown.\",\n # )\n\n # Another possibility is using online services, e.g., watsonx.ai.\n # Using watsonx.ai requires setting env variables WX_API_KEY and WX_PROJECT_ID\n # (see the top-level docstring for details). You can use a .env file as well.\n # (uncomment the following lines)\n # pipeline_options.vlm_options = watsonx_vlm_options(\n # model=\"ibm/granite-vision-3-2-2b\", prompt=\"OCR the full page to markdown.\"\n # )\n\n # Create the DocumentConverter and launch the conversion.\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n pipeline_cls=VlmPipeline,\n )\n }\n )\n result = doc_converter.convert(input_doc_path)\n print(result.document.export_to_markdown())\n\n\nif __name__ == \"__main__\":\n main()\n import json import logging import os from pathlib import Path from typing import Optional import requests from docling_core.types.doc.page import SegmentedPage from dotenv import load_dotenv from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline ### Example of ApiVlmOptions definitions #### Using LM Studio or VLLM (OpenAI-compatible APIs) def openai_compatible_vlm_options( model: str, prompt: str, format: ResponseFormat, hostname_and_port, temperature: float = 0.7, max_tokens: int = 4096, api_key: str = \"\", skip_special_tokens=False, ): headers = {} if api_key: headers[\"Authorization\"] = f\"Bearer {api_key}\" options = ApiVlmOptions( url=f\"http://{hostname_and_port}/v1/chat/completions\", # LM studio defaults to port 1234, VLLM to 8000 params=dict( model=model, max_tokens=max_tokens, skip_special_tokens=skip_special_tokens, # needed for VLLM ), headers=headers, prompt=prompt, timeout=90, scale=2.0, temperature=temperature, response_format=format, ) return options #### Using LM Studio with OlmOcr model def lms_olmocr_vlm_options(model: str): class OlmocrVlmOptions(ApiVlmOptions): def build_prompt(self, page: Optional[SegmentedPage]) -> str: if page is None: return self.prompt.replace(\"#RAW_TEXT#\", \"\") anchor = [ f\"Page dimensions: {int(page.dimension.width)}x{int(page.dimension.height)}\" ] for text_cell in page.textline_cells: if not text_cell.text.strip(): continue bbox = text_cell.rect.to_bounding_box().to_bottom_left_origin( page.dimension.height ) anchor.append(f\"[{int(bbox.l)}x{int(bbox.b)}] {text_cell.text}\") for image_cell in page.bitmap_resources: bbox = image_cell.rect.to_bounding_box().to_bottom_left_origin( page.dimension.height ) anchor.append( f\"[Image {int(bbox.l)}x{int(bbox.b)} to {int(bbox.r)}x{int(bbox.t)}]\" ) if len(anchor) == 1: anchor.append( f\"[Image 0x0 to {int(page.dimension.width)}x{int(page.dimension.height)}]\" ) # Original prompt uses cells sorting. We are skipping it for simplicity. raw_text = \"\\n\".join(anchor) return self.prompt.replace(\"#RAW_TEXT#\", raw_text) def decode_response(self, text: str) -> str: # OlmOcr trained to generate json response with language, rotation and other info try: generated_json = json.loads(text) except json.decoder.JSONDecodeError: return \"\" return generated_json[\"natural_text\"] options = OlmocrVlmOptions( url=\"http://localhost:1234/v1/chat/completions\", params=dict( model=model, ), prompt=( \"Below is the image of one page of a document, as well as some raw textual\" \" content that was previously extracted for it. Just return the plain text\" \" representation of this document as if you were reading it naturally.\\n\" \"Do not hallucinate.\\n\" \"RAW_TEXT_START\\n#RAW_TEXT#\\nRAW_TEXT_END\" ), timeout=90, scale=1.0, max_size=1024, # from OlmOcr pipeline response_format=ResponseFormat.MARKDOWN, ) return options #### Using Ollama def ollama_vlm_options(model: str, prompt: str): options = ApiVlmOptions( url=\"http://localhost:11434/v1/chat/completions\", # the default Ollama endpoint params=dict( model=model, ), prompt=prompt, timeout=90, scale=1.0, response_format=ResponseFormat.MARKDOWN, ) return options #### Using a cloud service like IBM watsonx.ai def watsonx_vlm_options(model: str, prompt: str): load_dotenv() api_key = os.environ.get(\"WX_API_KEY\") project_id = os.environ.get(\"WX_PROJECT_ID\") def _get_iam_access_token(api_key: str) -> str: res = requests.post( url=\"https://iam.cloud.ibm.com/identity/token\", headers={ \"Content-Type\": \"application/x-www-form-urlencoded\", }, data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\", ) res.raise_for_status() api_out = res.json() print(f\"{api_out=}\") return api_out[\"access_token\"] options = ApiVlmOptions( url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\", params=dict( model_id=model, project_id=project_id, parameters=dict( max_new_tokens=400, ), ), headers={ \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key), }, prompt=prompt, timeout=60, response_format=ResponseFormat.MARKDOWN, ) return options ### Usage and conversion def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2305.03393v1-pg9.pdf\" # Configure the VLM pipeline. Enabling remote services allows HTTP calls to # locally hosted APIs (LM Studio, Ollama) or cloud services. pipeline_options = VlmPipelineOptions( enable_remote_services=True # required when calling remote VLM endpoints ) # The ApiVlmOptions() allows to interface with APIs supporting # the multi-modal chat interface. Here follow a few example on how to configure those. # One possibility is self-hosting the model, e.g., via LM Studio, Ollama or VLLM. # # e.g. with VLLM, serve granite-docling with these commands: # > vllm serve ibm-granite/granite-docling-258M --revision untied # # with LM Studio, serve granite-docling with these commands: # > lms server start # > lms load ibm-granite/granite-docling-258M-mlx # Example using the Granite-Docling model with LM Studio or VLLM: pipeline_options.vlm_options = openai_compatible_vlm_options( model=\"granite-docling-258m-mlx\", # For VLLM use \"ibm-granite/granite-docling-258M\" hostname_and_port=\"localhost:1234\", # LM studio defaults to port 1234, VLLM to 8000 prompt=\"Convert this page to docling.\", format=ResponseFormat.DOCTAGS, api_key=\"\", ) # Example using the OlmOcr (dynamic prompt) model with LM Studio: # (uncomment the following lines) # pipeline_options.vlm_options = lms_olmocr_vlm_options( # model=\"hf.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF\", # ) # Example using the Granite Vision model with Ollama: # (uncomment the following lines) # pipeline_options.vlm_options = ollama_vlm_options( # model=\"granite3.2-vision:2b\", # prompt=\"OCR the full page to markdown.\", # ) # Another possibility is using online services, e.g., watsonx.ai. # Using watsonx.ai requires setting env variables WX_API_KEY and WX_PROJECT_ID # (see the top-level docstring for details). You can use a .env file as well. # (uncomment the following lines) # pipeline_options.vlm_options = watsonx_vlm_options( # model=\"ibm/granite-vision-3-2-2b\", prompt=\"OCR the full page to markdown.\" # ) # Create the DocumentConverter and launch the conversion. doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, pipeline_cls=VlmPipeline, ) } ) result = doc_converter.convert(input_doc_path) print(result.document.export_to_markdown()) if __name__ == \"__main__\": main() In\u00a0[\u00a0]: Copied! \n"},{"location":"examples/experimental/process_table_crops/","title":"Process table crops","text":"In\u00a0[\u00a0]: Copied!
\"\"\"Run Docling on an image using the experimental TableCrops layout model.\"\"\"\n\"\"\"Run Docling on an image using the experimental TableCrops layout model.\"\"\" In\u00a0[\u00a0]: Copied!
from __future__ import annotations\nfrom __future__ import annotations In\u00a0[\u00a0]: Copied!
from pathlib import Path\nfrom pathlib import Path In\u00a0[\u00a0]: Copied!
import docling\nfrom docling.datamodel.document import InputFormat\nfrom docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, ImageFormatOption\nfrom docling.experimental.datamodel.table_crops_layout_options import (\n TableCropsLayoutOptions,\n)\nfrom docling.experimental.models.table_crops_layout_model import TableCropsLayoutModel\nfrom docling.models.factories import get_layout_factory\nimport docling from docling.datamodel.document import InputFormat from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions from docling.document_converter import DocumentConverter, ImageFormatOption from docling.experimental.datamodel.table_crops_layout_options import ( TableCropsLayoutOptions, ) from docling.experimental.models.table_crops_layout_model import TableCropsLayoutModel from docling.models.factories import get_layout_factory In\u00a0[\u00a0]: Copied!
def main() -> None:\n sample_image = \"tests/data/2305.03393v1-table_crop.png\"\n\n pipeline_options = ThreadedPdfPipelineOptions(\n layout_options=TableCropsLayoutOptions(),\n do_table_structure=True,\n generate_page_images=True,\n )\n\n converter = DocumentConverter(\n allowed_formats=[InputFormat.IMAGE],\n format_options={\n InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)\n },\n )\n\n conv_res = converter.convert(sample_image)\n\n print(conv_res.document.tables[0].export_to_markdown())\n def main() -> None: sample_image = \"tests/data/2305.03393v1-table_crop.png\" pipeline_options = ThreadedPdfPipelineOptions( layout_options=TableCropsLayoutOptions(), do_table_structure=True, generate_page_images=True, ) converter = DocumentConverter( allowed_formats=[InputFormat.IMAGE], format_options={ InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options) }, ) conv_res = converter.convert(sample_image) print(conv_res.document.tables[0].export_to_markdown()) In\u00a0[\u00a0]: Copied! if __name__ == \"__main__\":\n main()\nif __name__ == \"__main__\": main()"},{"location":"faq/","title":"FAQ","text":"
This is a collection of FAQ collected from the user questions on https://github.com/docling-project/docling/discussions.
Is Python 3.14 supported? Is Python 3.13 supported? Install conflicts with numpy (python 3.13) Is macOS x86_64 supported? I get this error ImportError: libGL.so.1: cannot open shared object file: No such file or directory Are text styles (bold, underline, etc) supported? How do I run completely offline? Which model weights are needed to run Docling? SSL error downloading model weights Which OCR languages are supported? Some images are missing from MS Word and PowerpointHybridChunker triggers warning: 'Token indices sequence length is longer than the specified maximum sequence length for this model' How to use flash attention?"},{"location":"faq/#is-python-314-supported","title":"Is Python 3.14 supported?","text":"Python 3.14 is supported from Docling 2.59.0.
"},{"location":"faq/#is-python-313-supported","title":"Is Python 3.13 supported?","text":"Python 3.13 is supported from Docling 2.18.0.
"},{"location":"faq/#install-conflicts-with-numpy-python-313","title":"Install conflicts with numpy (python 3.13)","text":"When using docling-ibm-models>=2.0.7 and deepsearch-glm>=0.26.2 these issues should not show up anymore. Docling supports numpy versions >=1.24.4,<3.0.0 which should match all usages.
For older versions
This has been observed installing docling and langchain via poetry.
...\nThus, docling (>=2.7.0,<3.0.0) requires numpy (>=1.26.4,<2.0.0).\nSo, because ... depends on both numpy (>=2.0.2,<3.0.0) and docling (^2.7.0), version solving failed.\n Numpy is only adding Python 3.13 support starting in some 2.x.y version. In order to prepare for 3.13, Docling depends on a 2.x.y for 3.13, otherwise depending an 1.x.y version. If you are allowing 3.13 in your pyproject.toml, Poetry will try to find some way to reconcile Docling's numpy version for 3.13 (some 2.x.y) with LangChain's version for that (some 1.x.y) \u2014 leading to the error above.
Check if Python 3.13 is among the Python versions allowed by your pyproject.toml and if so, remove it and try again. E.g., if you have python = \"^3.10\", use python = \">=3.10,<3.13\" instead.
If you want to retain compatibility with python 3.9-3.13, you can also use a selector in pyproject.toml similar to the following
numpy = [\n { version = \"^2.1.0\", markers = 'python_version >= \"3.13\"' },\n { version = \"^1.24.4\", markers = 'python_version < \"3.13\"' },\n]\n Source: Issue #283
"},{"location":"faq/#is-macos-x86_64-supported","title":"Is macOS x86_64 supported?","text":"Yes, Docling (still) supports running the standard pipeline on macOS x86_64.
However, users might get into a combination of incompatible dependencies on a fresh install. Because Docling depends on PyTorch which dropped support for macOS x86_64 after the 2.2.2 release, and this old version of PyTorch works only with NumPy 1.x, users must ensure the correct NumPy version is running.
pip install docling \"numpy<2.0.0\"\n Source: Issue #1694.
"},{"location":"faq/#i-get-this-error-importerror-libglso1-cannot-open-shared-object-file-no-such-file-or-directory","title":"I get this error ImportError: libGL.so.1: cannot open shared object file: No such file or directory","text":"This error orginates from conflicting OpenCV distribution in some Docling third-party dependencies. opencv-python and opencv-python-headless both define the same python package cv2 and, if installed together, this often creates conflicts. Moreover, the opencv-python package (which is more common) depends on the OpenGL UI framework, which is usually not included for headless environments like Docker containers or remote VMs.
When you encouter the error above, you have two possibilities.
Solution 1: Force the headless OpenCV (preferred)
pip uninstall -y opencv-python opencv-python-headless\npip install --no-cache-dir opencv-python-headless\n Solution 2: Install the libGL system dependency.
Debian-basedRHEL / Fedoraapt-get install libgl1\n dnf install mesa-libGL\n"},{"location":"faq/#are-text-styles-bold-underline-etc-supported","title":"Are text styles (bold, underline, etc) supported?","text":"Text styles are supported in the DoclingDocument format. Currently only the declarative backends (i.e. the ones used for docx, pptx, markdown, html, etc) are able to set the correct text styles. Support for PDF is not yet possible.
Docling is not using any remote service, hence it can run in completely isolated air-gapped environments.
The only requirement is pointing the Docling runtime to the location where the model artifacts have been stored.
For example
pipeline_options = PdfPipelineOptions(artifacts_path=\"your location\")\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n Source: Issue #326
"},{"location":"faq/#which-model-weights-are-needed-to-run-docling","title":"Which model weights are needed to run Docling?","text":"Model weights are needed for the AI models used in the PDF pipeline. Other document types (docx, pptx, etc) do not have any such requirement.
For processing PDF documents, Docling requires the model weights from https://huggingface.co/ds4sd/docling-models.
When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has special pipeline options to control the runtime behavior.
"},{"location":"faq/#ssl-error-downloading-model-weights","title":"SSL error downloading model weights","text":"URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)>\n Similar SSL download errors have been observed by some users. This happens when model weights are fetched from Hugging Face. The error could happen when the python environment doesn't have an up-to-date list of trusted certificates.
Possible solutions were
pip install --upgrade certifiSSL_CERT_FILE and REQUESTS_CA_BUNDLE to the value of python -m certifi: CERT_PATH=$(python -m certifi)\nexport SSL_CERT_FILE=${CERT_PATH}\nexport REQUESTS_CA_BUNDLE=${CERT_PATH}\nDocling supports multiple OCR engine, each one has its own list of supported languages. Here is a collection of links to the original OCR engine's documentation listing the OCR languages.
Setting the OCR language in Docling is done via the OCR pipeline options:
from docling.datamodel.pipeline_options import PdfPipelineOptions\n\npipeline_options = PdfPipelineOptions()\npipeline_options.ocr_options.lang = [\"fr\", \"de\", \"es\", \"en\"] # example of languages for EasyOCR\n"},{"location":"faq/#some-images-are-missing-from-ms-word-and-powerpoint","title":"Some images are missing from MS Word and Powerpoint","text":"The image processing library used by Docling is able to handle embedded WMF images only on Windows platform. If you are on other operating systems, these images will be ignored.
"},{"location":"faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model","title":"HybridChunker triggers warning: 'Token indices sequence length is longer than the specified maximum sequence length for this model'","text":"TLDR: In the context of the HybridChunker, this is a known & ancitipated \"false alarm\".
Details:
Using the HybridChunker often triggers a warning like this:
Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors
This is a warning that is emitted by transformers, saying that actually running this sequence through the model will result in indexing errors, i.e. the problematic case is only if one indeed passes the particular sequence through the (embedding) model.
In our case though, this occurs as a \"false alarm\", since what happens is the following:
What is important is the actual token length of the produced chunks. The snippet below can be used for getting the actual maximum chunk size (for users wanting to confirm that this does not exceed the model limit):
chunk_max_len = 0\nfor i, chunk in enumerate(chunks):\n ser_txt = chunker.serialize(chunk=chunk)\n ser_tokens = len(tokenizer.tokenize(ser_txt))\n if ser_tokens > chunk_max_len:\n chunk_max_len = ser_tokens\n print(f\"{i}\\t{ser_tokens}\\t{repr(ser_txt[:100])}...\")\nprint(f\"Longest chunk yielded: {chunk_max_len} tokens\")\nprint(f\"Model max length: {tokenizer.model_max_length}\")\n Also see docling#725.
Source: Issue docling-core#119
"},{"location":"faq/#how-to-use-flash-attention","title":"How to use flash attention?","text":"When running models in Docling on CUDA devices, you can enable the usage of the Flash Attention2 library.
Using environment variables:
DOCLING_CUDA_USE_FLASH_ATTENTION2=1\n Using code:
from docling.datamodel.accelerator_options import (\n AcceleratorOptions,\n)\n\npipeline_options = VlmPipelineOptions(\n accelerator_options=AcceleratorOptions(cuda_use_flash_attention2=True)\n)\n This requires having the flash-attn package installed. Below are two alternative ways for installing it:
# Building from sources (required the CUDA dev environment)\npip install flash-attn\n\n# Using pre-built wheels (not available in all possible setups)\nFLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn\n"},{"location":"getting_started/installation/","title":"Installation","text":"To use Docling, simply install docling from your Python package manager, e.g. pip:
pip install docling\n Works on macOS, Linux, and Windows, with support for both x86_64 and arm64 architectures.
Alternative PyTorch distributionsThe Docling models depend on the PyTorch library. Depending on your architecture, you might want to use a different distribution of torch. For example, you might want support for different accelerator or for a cpu-only version. All the different ways for installing torch are listed on their website https://pytorch.org/.
One common situation is the installation on Linux systems with cpu-only support. In this case, we suggest the installation of Docling with the following options
# Example for installing on the Linux cpu-only version\npip install docling --extra-index-url https://download.pytorch.org/whl/cpu\n Installation on macOS Intel (x86_64) When installing Docling on macOS with Intel processors, you might encounter errors with PyTorch compatibility. This happens because newer PyTorch versions (2.6.0+) no longer provide wheels for Intel-based Macs.
If you're using an Intel Mac, install Docling with compatible PyTorch Note: PyTorch 2.2.2 requires Python 3.12 or lower. Make sure you're not using Python 3.13+.
# For uv users\nuv add torch==2.2.2 torchvision==0.17.2 docling\n\n# For pip users\npip install \"docling[mac_intel]\"\n\n# For Poetry users\npoetry add docling\n"},{"location":"getting_started/installation/#available-extras","title":"Available extras","text":"The docling package is designed to offer a working solution for the Docling default options. Some Docling functionalities require additional third-party packages and are therefore installed only if selected as extras (or installed independently).
The following table summarizes the extras available in the docling package. They can be activated with: pip install \"docling[NAME1,NAME2]\"
asr Installs dependencies for running the ASR pipeline. vlm Installs dependencies for running the VLM pipeline. easyocr Installs the EasyOCR OCR engine. tesserocr Installs the Tesseract binding for using it as OCR engine. ocrmac Installs the OcrMac OCR engine. rapidocr Installs the RapidOCR OCR engine with onnxruntime backend."},{"location":"getting_started/installation/#ocr-engines","title":"OCR engines","text":"Docling supports multiple OCR engines for processing scanned documents. The current version provides the following engines.
Engine Installation Usage EasyOCReasyocr extra or via pip install easyocr. EasyOcrOptions Tesseract System dependency. See description for Tesseract and Tesserocr below. TesseractOcrOptions Tesseract CLI System dependency. See description below. TesseractCliOcrOptions OcrMac System dependency. See description below. OcrMacOptions RapidOCR rapidocr extra can or via pip install rapidocr onnxruntime RapidOcrOptions OnnxTR Can be installed via the plugin system pip install \"docling-ocr-onnxtr[cpu]\". Please take a look at docling-OCR-OnnxTR. OnnxtrOcrOptions The Docling DocumentConverter allows to choose the OCR engine with the ocr_options settings. For example
from docling.datamodel.base_models import ConversionStatus, PipelineOptions\nfrom docling.datamodel.pipeline_options import PipelineOptions, EasyOcrOptions, TesseractOcrOptions\nfrom docling.document_converter import DocumentConverter\n\npipeline_options = PipelineOptions()\npipeline_options.do_ocr = True\npipeline_options.ocr_options = TesseractOcrOptions() # Use Tesseract\n\ndoc_converter = DocumentConverter(\n pipeline_options=pipeline_options,\n)\n Tesseract installation Tesseract is a popular OCR engine which is available on most operating systems. For using this engine with Docling, Tesseract must be installed on your system, using the packaging tool of your choice. Below we provide example commands. After installing Tesseract you are expected to provide the path to its language files using the TESSDATA_PREFIX environment variable (note that it must terminate with a slash /).
brew install tesseract leptonica pkg-config\nTESSDATA_PREFIX=/opt/homebrew/share/tessdata/\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config\nTESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n dnf install tesseract tesseract-devel tesseract-langpack-eng tesseract-osd leptonica-devel\nTESSDATA_PREFIX=/usr/share/tesseract/tessdata/\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n Linking to Tesseract The most efficient usage of the Tesseract library is via linking. Docling is using the Tesserocr package for this.
If you get into installation issues of Tesserocr, we suggest using the following installation options:
pip uninstall tesserocr\npip install --no-binary :all: tesserocr\n"},{"location":"getting_started/installation/#development-setup","title":"Development setup","text":"To develop Docling features, bugfixes etc., install as follows from your local clone's root dir:
uv sync --all-extras\n"},{"location":"getting_started/quickstart/","title":"Quickstart","text":""},{"location":"getting_started/quickstart/#basic-usage","title":"Basic usage","text":""},{"location":"getting_started/quickstart/#python","title":"Python","text":"In Docling, working with documents is as simple as:
For example, the snippet below shows conversion with export to Markdown:
from docling.document_converter import DocumentConverter\n\nsource = \"https://arxiv.org/pdf/2408.09869\" # file path or URL\nconverter = DocumentConverter()\ndoc = converter.convert(source).document\n\nprint(doc.export_to_markdown()) # output: \"### Docling Technical Report[...]\"\n Docling supports a wide array of file formats and, as outlined in the architecture guide, provides a versatile document model along with a full suite of supported operations.
"},{"location":"getting_started/quickstart/#cli","title":"CLI","text":"You can additionally use Docling directly from your terminal, for instance:
docling https://arxiv.org/pdf/2206.01062\n The CLI provides various options, such as \ud83e\udd5aGraniteDocling (incl. MLX acceleration) & other VLMs:
docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062\n For all available options, run docling --help or check the CLI reference.
Check out the Usage subpages (navigation menu on the left) as well as our featured examples for additional usage workflows, including conversion customization, RAG, framework integrations, chunking, serialization, enrichments, and much more!
"},{"location":"integrations/","title":"Integrations","text":"In this space, you can explore various Docling integrations with leading frameworks and tools!
Here some of our picks to get you started:
\ud83d\udc48 ... and there is much more: explore all integrations using the navigation menu on the side
A glimpse into Docling's ecosystem"},{"location":"integrations/apify/","title":"Apify","text":"You can run Docling in the cloud without installation using the Docling Actor on Apify platform. Simply provide a document URL and get the processed result:
apify call vancura/docling -i '{\n \"options\": {\n \"to_formats\": [\"md\", \"json\", \"html\", \"text\", \"doctags\"]\n },\n \"http_sources\": [\n {\"url\": \"https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf\"},\n {\"url\": \"https://arxiv.org/pdf/2408.09869\"}\n ]\n}'\n The Actor stores results in:
OUTPUT_RESULT)DOCLING_LOG)Read more about the Docling Actor, including how to use it via the Apify API and CLI.
Docling is available as a Java integration in Arconia.
Docling is available as an extraction backend in the Bee framework.
Docling is available in Cloudera through the RAG Studio Accelerator for Machine Learning Projects (AMP).
Docling is available in CrewAI as the CrewDoclingSource knowledge source.
Docling is used by the Data Prep Kit open-source toolkit for preparing unstructured data for LLM application development ranging from laptop scale to datacenter scale.
"},{"location":"integrations/data_prep_kit/#components","title":"Components","text":""},{"location":"integrations/data_prep_kit/#pdf-ingestion-to-parquet","title":"PDF ingestion to Parquet","text":"Docling is available as a file conversion method in DocETL:
Docling is available as a converter in Haystack:
Docling is available in Hector as an MCP-based document parser for RAG systems and document stores.
Hector is a production-grade A2A-native agent platform that integrates with Docling via the MCP server for advanced document parsing capabilities.
Docling is powering document processing in InstructLab, enabling users to unlock the knowledge hidden in documents and present it to InstructLab's fine-tuning for aligning AI models to the user's specific data.
More details can be found in this blog post.
Docling is available in Kotaemon as the DoclingReader loader:
Docling is available as an official LangChain extension.
To get started, check out the step-by-step guide in LangChain.
Docling is available on the Langflow visual low-code platform.
Docling is available as an official LlamaIndex extension.
To get started, check out the step-by-step guide in LlamaIndex.
"},{"location":"integrations/llamaindex/#components","title":"Components","text":""},{"location":"integrations/llamaindex/#docling-reader","title":"Docling Reader","text":"Reads document files and uses Docling to populate LlamaIndex Document objects \u2014 either serializing Docling's data model (losslessly, e.g. as JSON) or exporting to a simplified format (lossily, e.g. as Markdown).
Reads LlamaIndex Document objects populated in Docling's format by Docling Reader and, using its knowledge of the Docling format, parses them to LlamaIndex Node objects for downstream usage in LlamaIndex applications, e.g. as chunks for embedding.
Docling is powering the NVIDIA PDF to Podcast agentic AI blueprint:
Docling is available an ingestion engine for OpenContracts, allowing you to use Docling's OCR engine(s), chunker(s), labels, etc. and load them into a platform supporting bulk data extraction, text annotating, and question-answering:
Docling is available as a plugin for Open WebUI.
Docling is available in Prodigy as a Prodigy-PDF plugin recipe.
More details can be found in this blog post.
Docling is available as a Quarkus extension! See the extension documentation for more information.
Docling is powering document processing in Red Hat Enterprise Linux AI (RHEL AI), enabling users to unlock the knowledge hidden in documents and present it to InstructLab's fine-tuning for aligning AI models to the user's specific data.
Docling is available in spaCy as the spaCy Layout plugin.
More details can be found in this blog post.
Docling is available as a text extraction backend for txtai.
Docling is available as a document parser in Vectara.
This page provides documentation for our command line tools.
"},{"location":"reference/cli/#docling","title":"docling","text":"Usage:
docling [OPTIONS] source\n Options:
Name Type Description Default--from choice (docx | pptx | html | image | pdf | asciidoc | md | csv | xlsx | xml_uspto | xml_jats | mets_gbs | json_docling | audio | vtt) Specify input formats to convert from. Defaults to all formats. None --to choice (md | json | html | html_split_page | text | doctags) Specify output formats. Defaults to Markdown. None --show-layout / --no-show-layout boolean If enabled, the page images will show the bounding-boxes of the items. False --headers text Specify http request headers used when fetching url input sources in the form of a JSON string None --image-export-mode choice (placeholder | embedded | referenced) Image export mode for the document (only in case of JSON, Markdown or HTML). With placeholder, only the position of the image is marked in the output. In embedded mode, the image is embedded as base64 encoded string. In referenced mode, the image is exported in PNG format and referenced from the main exported document. ImageRefMode.EMBEDDED --pipeline choice (legacy | standard | vlm | asr) Choose the pipeline to process PDF or image files. ProcessingPipeline.STANDARD --vlm-model choice (smoldocling | smoldocling_vllm | granite_vision | granite_vision_vllm | granite_vision_ollama | got_ocr_2 | granite_docling | granite_docling_vllm) Choose the VLM model to use with PDF or image files. VlmModelType.GRANITEDOCLING --asr-model choice (whisper_tiny | whisper_small | whisper_medium | whisper_base | whisper_large | whisper_turbo | whisper_tiny_mlx | whisper_small_mlx | whisper_medium_mlx | whisper_base_mlx | whisper_large_mlx | whisper_turbo_mlx | whisper_tiny_native | whisper_small_native | whisper_medium_native | whisper_base_native | whisper_large_native | whisper_turbo_native) Choose the ASR model to use with audio/video files. AsrModelType.WHISPER_TINY --ocr / --no-ocr boolean If enabled, the bitmap content will be processed using OCR. True --force-ocr / --no-force-ocr boolean Replace any existing text with OCR generated text over the full content. False --tables / --no-tables boolean If enabled, the table structure model will be used to extract table information. True --ocr-engine text The OCR engine to use. When --allow-external-plugins is not set, the available values are: auto, easyocr, ocrmac, rapidocr, tesserocr, tesseract. Use the option --show-external-plugins to see the options allowed with external plugins. auto --ocr-lang text Provide a comma-separated list of languages used by the OCR engine. Note that each OCR engine has different values for the language names. None --psm integer Page Segmentation Mode for the OCR engine (0-13). None --pdf-backend choice (pypdfium2 | dlparse_v1 | dlparse_v2 | dlparse_v4) The PDF backend to use. PdfBackend.DLPARSE_V4 --pdf-password text Password for protected PDF documents None --table-mode choice (fast | accurate) The mode to use in the table structure model. TableFormerMode.ACCURATE --enrich-code / --no-enrich-code boolean Enable the code enrichment model in the pipeline. False --enrich-formula / --no-enrich-formula boolean Enable the formula enrichment model in the pipeline. False --enrich-picture-classes / --no-enrich-picture-classes boolean Enable the picture classification enrichment model in the pipeline. False --enrich-picture-description / --no-enrich-picture-description boolean Enable the picture description model in the pipeline. False --artifacts-path path If provided, the location of the model artifacts. None --enable-remote-services / --no-enable-remote-services boolean Must be enabled when using models connecting to remote services. False --allow-external-plugins / --no-allow-external-plugins boolean Must be enabled for loading modules from third-party plugins. False --show-external-plugins / --no-show-external-plugins boolean List the third-party plugins which are available when the option --allow-external-plugins is set. False --abort-on-error / --no-abort-on-error boolean If enabled, the processing will be aborted when the first error is encountered. False --output path Output directory where results are saved. . --verbose, -v integer Set the verbosity level. -v for info logging, -vv for debug logging. 0 --debug-visualize-cells / --no-debug-visualize-cells boolean Enable debug output which visualizes the PDF cells False --debug-visualize-ocr / --no-debug-visualize-ocr boolean Enable debug output which visualizes the OCR cells False --debug-visualize-layout / --no-debug-visualize-layout boolean Enable debug output which visualizes the layour clusters False --debug-visualize-tables / --no-debug-visualize-tables boolean Enable debug output which visualizes the table cells False --version boolean Show version information. None --document-timeout float The timeout for processing each document, in seconds. None --num-threads integer Number of threads 4 --device choice (auto | cpu | cuda | mps) Accelerator device AcceleratorDevice.AUTO --logo boolean Docling logo None --page-batch-size integer Number of pages processed in one batch. Default: 4 4 --help boolean Show this message and exit. False"},{"location":"reference/docling_document/","title":"Docling Document","text":"This is an automatic generated API reference of the DoclingDocument type.
"},{"location":"reference/docling_document/#docling_core.types.doc","title":"doc","text":"Package for models defined by the Document type.
Classes:
DoclingDocument \u2013 DoclingDocument.
DocumentOrigin \u2013 FileSource.
DocItem \u2013 DocItem.
DocItemLabel \u2013 DocItemLabel.
ProvenanceItem \u2013 ProvenanceItem.
GroupItem \u2013 GroupItem.
GroupLabel \u2013 GroupLabel.
NodeItem \u2013 NodeItem.
PageItem \u2013 PageItem.
FloatingItem \u2013 FloatingItem.
TextItem \u2013 TextItem.
TableItem \u2013 TableItem.
TableCell \u2013 TableCell.
TableData \u2013 BaseTableData.
TableCellLabel \u2013 TableCellLabel.
KeyValueItem \u2013 KeyValueItem.
SectionHeaderItem \u2013 SectionItem.
PictureItem \u2013 PictureItem.
ImageRef \u2013 ImageRef.
PictureClassificationClass \u2013 PictureClassificationData.
PictureClassificationData \u2013 PictureClassificationData.
RefItem \u2013 RefItem.
BoundingBox \u2013 BoundingBox.
CoordOrigin \u2013 CoordOrigin.
ImageRefMode \u2013 ImageRefMode.
Size \u2013 Size.
Bases: BaseModel
DoclingDocument.
Methods:
add_code \u2013 add_code.
add_document \u2013 Adds the content from the body of a DoclingDocument to this document under a specific parent.
add_form \u2013 add_form.
add_formula \u2013 add_formula.
add_group \u2013 add_group.
add_heading \u2013 add_heading.
add_inline_group \u2013 add_inline_group.
add_key_values \u2013 add_key_values.
add_list_group \u2013 add_list_group.
add_list_item \u2013 add_list_item.
add_node_items \u2013 Adds multiple NodeItems and their children under a parent in this document.
add_ordered_list \u2013 add_ordered_list.
add_page \u2013 add_page.
add_picture \u2013 add_picture.
add_table \u2013 add_table.
add_table_cell \u2013 Add a table cell to the table.
add_text \u2013 add_text.
add_title \u2013 add_title.
add_unordered_list \u2013 add_unordered_list.
append_child_item \u2013 Adds an item.
check_version_is_compatible \u2013 Check if this document version is compatible with SDK schema version.
concatenate \u2013 Concatenate multiple documents into a single document.
delete_items \u2013 Deletes an item, given its instance or ref, and any children it has.
delete_items_range \u2013 Deletes all NodeItems and their children in the range from the start NodeItem to the end NodeItem.
export_to_dict \u2013 Export to dict.
export_to_doctags \u2013 Exports the document content to a DocumentToken format.
export_to_document_tokens \u2013 Export to DocTags format.
export_to_element_tree \u2013 Export_to_element_tree.
export_to_html \u2013 Serialize to HTML.
export_to_markdown \u2013 Serialize to Markdown.
export_to_text \u2013 export_to_text.
extract_items_range \u2013 Extracts NodeItems and children in the range from the start NodeItem to the end as a new DoclingDocument.
filter \u2013 Create a new document based on the provided filter parameters.
get_visualization \u2013 Get visualization of the document as images by page.
insert_code \u2013 Creates a new CodeItem item and inserts it into the document.
insert_document \u2013 Inserts the content from the body of a DoclingDocument into this document at a specific position.
insert_form \u2013 Creates a new FormItem item and inserts it into the document.
insert_formula \u2013 Creates a new FormulaItem item and inserts it into the document.
insert_group \u2013 Creates a new GroupItem item and inserts it into the document.
insert_heading \u2013 Creates a new SectionHeaderItem item and inserts it into the document.
insert_inline_group \u2013 Creates a new InlineGroup item and inserts it into the document.
insert_item_after_sibling \u2013 Inserts an item, given its node_item instance, after other as a sibling.
insert_item_before_sibling \u2013 Inserts an item, given its node_item instance, before other as a sibling.
insert_key_values \u2013 Creates a new KeyValueItem item and inserts it into the document.
insert_list_group \u2013 Creates a new ListGroup item and inserts it into the document.
insert_list_item \u2013 Creates a new ListItem item and inserts it into the document.
insert_node_items \u2013 Insert multiple NodeItems and their children at a specific position in the document.
insert_picture \u2013 Creates a new PictureItem item and inserts it into the document.
insert_table \u2013 Creates a new TableItem item and inserts it into the document.
insert_text \u2013 Creates a new TextItem item and inserts it into the document.
insert_title \u2013 Creates a new TitleItem item and inserts it into the document.
iterate_items \u2013 Iterate elements with level.
load_from_doctags \u2013 Load Docling document from lists of DocTags and Images.
load_from_json \u2013 load_from_json.
load_from_yaml \u2013 load_from_yaml.
num_pages \u2013 num_pages.
print_element_tree \u2013 Print_element_tree.
replace_item \u2013 Replace item with new item.
save_as_doctags \u2013 Save the document content to DocTags format.
save_as_document_tokens \u2013 Save the document content to a DocumentToken format.
save_as_html \u2013 Save to HTML.
save_as_json \u2013 Save as json.
save_as_markdown \u2013 Save to markdown.
save_as_yaml \u2013 Save as yaml.
transform_to_content_layer \u2013 transform_to_content_layer.
validate_document \u2013 validate_document.
validate_misplaced_list_items \u2013 validate_misplaced_list_items.
validate_tree \u2013 validate_tree.
Attributes:
body (GroupItem) \u2013 form_items (List[FormItem]) \u2013 furniture (Annotated[GroupItem, Field(deprecated=True)]) \u2013 groups (List[Union[ListGroup, InlineGroup, GroupItem]]) \u2013 key_value_items (List[KeyValueItem]) \u2013 name (str) \u2013 origin (Optional[DocumentOrigin]) \u2013 pages (Dict[int, PageItem]) \u2013 pictures (List[PictureItem]) \u2013 schema_name (Literal['DoclingDocument']) \u2013 tables (List[TableItem]) \u2013 texts (List[Union[TitleItem, SectionHeaderItem, ListItem, CodeItem, FormulaItem, TextItem]]) \u2013 version (Annotated[str, StringConstraints(pattern=VERSION_PATTERN, strict=True)]) \u2013 body: GroupItem = GroupItem(name='_root_', self_ref='#/body')\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.form_items","title":"form_items","text":"form_items: List[FormItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.furniture","title":"furniture","text":"furniture: Annotated[GroupItem, Field(deprecated=True)] = GroupItem(name='_root_', self_ref='#/furniture', content_layer=FURNITURE)\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.groups","title":"groups","text":"groups: List[Union[ListGroup, InlineGroup, GroupItem]] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.key_value_items","title":"key_value_items","text":"key_value_items: List[KeyValueItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.name","title":"name","text":"name: str\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.origin","title":"origin","text":"origin: Optional[DocumentOrigin] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.pages","title":"pages","text":"pages: Dict[int, PageItem] = {}\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.pictures","title":"pictures","text":"pictures: List[PictureItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.schema_name","title":"schema_name","text":"schema_name: Literal['DoclingDocument'] = 'DoclingDocument'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.tables","title":"tables","text":"tables: List[TableItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.texts","title":"texts","text":"texts: List[Union[TitleItem, SectionHeaderItem, ListItem, CodeItem, FormulaItem, TextItem]] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.version","title":"version","text":"version: Annotated[str, StringConstraints(pattern=VERSION_PATTERN, strict=True)] = CURRENT_VERSION\n"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_code","title":"add_code","text":"add_code(text: str, code_language: Optional[CodeLanguageLabel] = None, orig: Optional[str] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n add_code.
Parameters:
text (str) \u2013 str:
code_language (Optional[CodeLanguageLabel], default: None ) \u2013 Optional[CodeLanguageLabel]: (Default value = None)
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
caption (Optional[Union[TextItem, RefItem]], default: None ) \u2013 Optional[Union[TextItem:
RefItem]] \u2013 (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: (Default value = None)
add_document(doc: DoclingDocument, parent: Optional[NodeItem] = None) -> None\n Adds the content from the body of a DoclingDocument to this document under a specific parent.
Parameters:
doc (DoclingDocument) \u2013 DoclingDocument: The document whose content will be added
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: The parent NodeItem under which new items are added (Default value = None)
Returns:
None \u2013 None
add_form(graph: GraphData, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None)\n add_form.
Parameters:
graph (GraphData) \u2013 GraphData:
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: (Default value = None)
add_formula(text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n add_formula.
Parameters:
text (str) \u2013 str:
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
level \u2013 LevelNumber: (Default value = 1)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: (Default value = None)
add_group(label: Optional[GroupLabel] = None, name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> GroupItem\n add_group.
Parameters:
label (Optional[GroupLabel], default: None ) \u2013 Optional[GroupLabel]: (Default value = None)
name (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: (Default value = None)
add_heading(text: str, orig: Optional[str] = None, level: LevelNumber = 1, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n add_heading.
Parameters:
label \u2013 DocItemLabel:
text (str) \u2013 str:
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
level (LevelNumber, default: 1 ) \u2013 LevelNumber: (Default value = 1)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: (Default value = None)
add_inline_group(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> InlineGroup\n add_inline_group.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_key_values","title":"add_key_values","text":"add_key_values(graph: GraphData, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None)\n add_key_values.
Parameters:
graph (GraphData) \u2013 GraphData:
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: (Default value = None)
add_list_group(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> ListGroup\n add_list_group.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_list_item","title":"add_list_item","text":"add_list_item(text: str, enumerated: bool = False, marker: Optional[str] = None, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n add_list_item.
Parameters:
label \u2013 str:
text (str) \u2013 str:
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: (Default value = None)
add_node_items(node_items: List[NodeItem], doc: DoclingDocument, parent: Optional[NodeItem] = None) -> None\n Adds multiple NodeItems and their children under a parent in this document.
Parameters:
node_items (List[NodeItem]) \u2013 list[NodeItem]: The NodeItems to be added
doc (DoclingDocument) \u2013 DoclingDocument: The document to which the NodeItems and their children belong
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: The parent NodeItem under which new items are added (Default value = None)
Returns:
None \u2013 None
add_ordered_list(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> GroupItem\n add_ordered_list.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_page","title":"add_page","text":"add_page(page_no: int, size: Size, image: Optional[ImageRef] = None) -> PageItem\n add_page.
Parameters:
page_no (int) \u2013 int:
size (Size) \u2013 Size:
add_picture(annotations: Optional[List[PictureDataType]] = None, image: Optional[ImageRef] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None)\n add_picture.
Parameters:
data \u2013 Optional[List[PictureData]]: (Default value = None)
caption (Optional[Union[TextItem, RefItem]], default: None ) \u2013 Optional[Union[TextItem:
RefItem]] \u2013 (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: (Default value = None)
add_table(data: TableData, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, label: DocItemLabel = TABLE, content_layer: Optional[ContentLayer] = None, annotations: Optional[list[TableAnnotationType]] = None)\n add_table.
Parameters:
data (TableData) \u2013 TableData:
caption (Optional[Union[TextItem, RefItem]], default: None ) \u2013 Optional[Union[TextItem, RefItem]]: (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: (Default value = None)
label (DocItemLabel, default: TABLE ) \u2013 DocItemLabel: (Default value = DocItemLabel.TABLE)
add_table_cell(table_item: TableItem, cell: TableCell) -> None\n Add a table cell to the table.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_text","title":"add_text","text":"add_text(label: DocItemLabel, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n add_text.
Parameters:
label (DocItemLabel) \u2013 str:
text (str) \u2013 str:
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: (Default value = None)
add_title(text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n add_title.
Parameters:
text (str) \u2013 str:
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
level \u2013 LevelNumber: (Default value = 1)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
parent (Optional[NodeItem], default: None ) \u2013 Optional[NodeItem]: (Default value = None)
add_unordered_list(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> GroupItem\n add_unordered_list.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.append_child_item","title":"append_child_item","text":"append_child_item(*, child: NodeItem, parent: Optional[NodeItem] = None) -> None\n Adds an item.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.check_version_is_compatible","title":"check_version_is_compatible","text":"check_version_is_compatible(v: str) -> str\n Check if this document version is compatible with SDK schema version.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.concatenate","title":"concatenate","text":"concatenate(docs: Sequence[DoclingDocument]) -> DoclingDocument\n Concatenate multiple documents into a single document.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.delete_items","title":"delete_items","text":"delete_items(*, node_items: List[NodeItem]) -> None\n Deletes an item, given its instance or ref, and any children it has.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.delete_items_range","title":"delete_items_range","text":"delete_items_range(*, start: NodeItem, end: NodeItem, start_inclusive: bool = True, end_inclusive: bool = True) -> None\n Deletes all NodeItems and their children in the range from the start NodeItem to the end NodeItem.
Parameters:
start (NodeItem) \u2013 NodeItem: The starting NodeItem of the range
end (NodeItem) \u2013 NodeItem: The ending NodeItem of the range
start_inclusive (bool, default: True ) \u2013 bool: (Default value = True): If True, the start NodeItem will also be deleted
end_inclusive (bool, default: True ) \u2013 bool: (Default value = True): If True, the end NodeItem will also be deleted
Returns:
None \u2013 None
export_to_dict(mode: str = 'json', by_alias: bool = True, exclude_none: bool = True, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None) -> Dict[str, Any]\n Export to dict.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_doctags","title":"export_to_doctags","text":"export_to_doctags(delim: str = '', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True, add_page_index: bool = True, add_table_cell_location: bool = False, add_table_cell_text: bool = True, minified: bool = False, pages: Optional[set[int]] = None) -> str\n Exports the document content to a DocumentToken format.
Operates on a slice of the document's body as defined through arguments from_element and to_element; defaulting to the whole main_text.
Parameters:
delim (str, default: '' ) \u2013 str: (Default value = \"\") Deprecated
from_element (int, default: 0 ) \u2013 int: (Default value = 0)
to_element (int, default: maxsize ) \u2013 Optional[int]: (Default value = None)
labels (Optional[set[DocItemLabel]], default: None ) \u2013 set[DocItemLabel]
xsize (int, default: 500 ) \u2013 int: (Default value = 500)
ysize (int, default: 500 ) \u2013 int: (Default value = 500)
add_location (bool, default: True ) \u2013 bool: (Default value = True)
add_content (bool, default: True ) \u2013 bool: (Default value = True)
add_page_index (bool, default: True ) \u2013 bool: (Default value = True)
flagsadd_table_cell_location \u2013 bool
add_table_cell_text (bool, default: True ) \u2013 bool: (Default value = True)
minified (bool, default: False ) \u2013 bool: (Default value = False)
pages (Optional[set[int]], default: None ) \u2013 set[int]: (Default value = None)
Returns:
str \u2013 The content of the document formatted as a DocTags string.
export_to_document_tokens(*args, **kwargs)\n Export to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_element_tree","title":"export_to_element_tree","text":"export_to_element_tree() -> str\n Export_to_element_tree.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_html","title":"export_to_html","text":"export_to_html(from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, enable_chart_tables: bool = True, image_mode: ImageRefMode = PLACEHOLDER, formula_to_mathml: bool = True, page_no: Optional[int] = None, html_lang: str = 'en', html_head: str = 'null', included_content_layers: Optional[set[ContentLayer]] = None, split_page_view: bool = False, include_annotations: bool = True) -> str\n Serialize to HTML.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_markdown","title":"export_to_markdown","text":"export_to_markdown(delim: str = '\\n\\n', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, strict_text: bool = False, escape_html: bool = True, escape_underscores: bool = True, image_placeholder: str = '<!-- image -->', enable_chart_tables: bool = True, image_mode: ImageRefMode = PLACEHOLDER, indent: int = 4, text_width: int = -1, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, page_break_placeholder: Optional[str] = None, include_annotations: bool = True, mark_annotations: bool = False, *, use_legacy_annotations: Optional[bool] = None, allowed_meta_names: Optional[set[str]] = None, blocked_meta_names: Optional[set[str]] = None, mark_meta: bool = False) -> str\n Serialize to Markdown.
Operates on a slice of the document's body as defined through arguments from_element and to_element; defaulting to the whole document.
Parameters:
delim (str, default: '\\n\\n' ) \u2013 Deprecated.
from_element (int, default: 0 ) \u2013 Body slicing start index (inclusive). (Default value = 0).
to_element (int, default: maxsize ) \u2013 Body slicing stop index (exclusive). (Default value = maxint).
labels (Optional[set[DocItemLabel]], default: None ) \u2013 The set of document labels to include in the export. None falls back to the system-defined default.
strict_text (bool, default: False ) \u2013 Deprecated.
escape_html (bool, default: True ) \u2013 bool: Whether to escape HTML reserved characters in the text content of the document. (Default value = True).
escape_underscores (bool, default: True ) \u2013 bool: Whether to escape underscores in the text content of the document. (Default value = True).
image_placeholder (str, default: '<!-- image -->' ) \u2013 The placeholder to include to position images in the markdown. (Default value = \"\\<!-- image -->\").
image_mode (ImageRefMode, default: PLACEHOLDER ) \u2013 The mode to use for including images in the markdown. (Default value = ImageRefMode.PLACEHOLDER).
indent (int, default: 4 ) \u2013 The indent in spaces of the nested lists. (Default value = 4).
included_content_layers (Optional[set[ContentLayer]], default: None ) \u2013 The set of layels to include in the export. None falls back to the system-defined default.
page_break_placeholder (Optional[str], default: None ) \u2013 The placeholder to include for marking page breaks. None means no page break placeholder will be used.
include_annotations (bool, default: True ) \u2013 bool: Whether to include annotations in the export; only considered if item does not have meta. (Default value = True).
mark_annotations (bool, default: False ) \u2013 bool: Whether to mark annotations in the export; only considered if item does not have meta. (Default value = False).
use_legacy_annotations (Optional[bool], default: None ) \u2013 bool: Deprecated; legacy annotations considered only when meta not present.
mark_meta (bool, default: False ) \u2013 bool: Whether to mark meta in the export
allowed_meta_names (Optional[set[str]], default: None ) \u2013 Optional[set[str]]: Meta names to allow; None means all meta names are allowed.
blocked_meta_names (Optional[set[str]], default: None ) \u2013 Optional[set[str]]: Meta names to block; takes precedence over allowed_meta_names.
Returns:
str \u2013 The exported Markdown representation.
export_to_text(delim: str = '\\n\\n', from_element: int = 0, to_element: int = 1000000, labels: Optional[set[DocItemLabel]] = None) -> str\n export_to_text.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.extract_items_range","title":"extract_items_range","text":"extract_items_range(*, start: NodeItem, end: NodeItem, start_inclusive: bool = True, end_inclusive: bool = True, delete: bool = False) -> DoclingDocument\n Extracts NodeItems and children in the range from the start NodeItem to the end as a new DoclingDocument.
Parameters:
start (NodeItem) \u2013 NodeItem: The starting NodeItem of the range (must be a direct child of the document body)
end (NodeItem) \u2013 NodeItem: The ending NodeItem of the range (must be a direct child of the document body)
start_inclusive (bool, default: True ) \u2013 bool: (Default value = True): If True, the start NodeItem will also be extracted
end_inclusive (bool, default: True ) \u2013 bool: (Default value = True): If True, the end NodeItem will also be extracted
delete (bool, default: False ) \u2013 bool: (Default value = False): If True, extracted items are deleted in the original document
Returns:
DoclingDocument \u2013 DoclingDocument: A new document containing the extracted NodeItems and their children
filter(page_nrs: Optional[set[int]] = None) -> DoclingDocument\n Create a new document based on the provided filter parameters.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.get_visualization","title":"get_visualization","text":"get_visualization(show_label: bool = True, show_branch_numbering: bool = False, viz_mode: Literal['reading_order', 'key_value'] = 'reading_order', show_cell_id: bool = False) -> dict[Optional[int], Image]\n Get visualization of the document as images by page.
Parameters:
show_label (bool, default: True ) \u2013 Show labels on elements (applies to all visualizers).
show_branch_numbering (bool, default: False ) \u2013 Show branch numbering (reading order visualizer only).
visualizer (str) \u2013 Which visualizer to use. One of 'reading_order' (default), 'key_value'.
show_cell_id (bool, default: False ) \u2013 Show cell IDs (key value visualizer only).
Returns:
dict[Optional[int], PILImage.Image] \u2013 Dictionary mapping page numbers to PIL images.
insert_code(sibling: NodeItem, text: str, code_language: Optional[CodeLanguageLabel] = None, orig: Optional[str] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> CodeItem\n Creates a new CodeItem item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
text (str) \u2013 str:
code_language (Optional[CodeLanguageLabel], default: None ) \u2013 Optional[str]: (Default value = None)
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
caption (Optional[Union[TextItem, RefItem]], default: None ) \u2013 Optional[Union[TextItem, RefItem]]: (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer (Optional[ContentLayer], default: None ) \u2013 Optional[ContentLayer]: (Default value = None)
formatting (Optional[Formatting], default: None ) \u2013 Optional[Formatting]: (Default value = None)
hyperlink (Optional[Union[AnyUrl, Path]], default: None ) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
CodeItem \u2013 CodeItem: The newly created CodeItem item.
insert_document(doc: DoclingDocument, sibling: NodeItem, after: bool = True) -> None\n Inserts the content from the body of a DoclingDocument into this document at a specific position.
Parameters:
doc (DoclingDocument) \u2013 DoclingDocument: The document whose content will be inserted
sibling (NodeItem) \u2013 NodeItem: The NodeItem after/before which the new items will be inserted
after (bool, default: True ) \u2013 bool: If True, insert after the sibling; if False, insert before (Default value = True)
Returns:
None \u2013 None
insert_form(sibling: NodeItem, graph: GraphData, prov: Optional[ProvenanceItem] = None, after: bool = True) -> FormItem\n Creates a new FormItem item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
graph (GraphData) \u2013 GraphData:
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
FormItem \u2013 FormItem: The newly created FormItem item.
insert_formula(sibling: NodeItem, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> FormulaItem\n Creates a new FormulaItem item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
text (str) \u2013 str:
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer (Optional[ContentLayer], default: None ) \u2013 Optional[ContentLayer]: (Default value = None)
formatting (Optional[Formatting], default: None ) \u2013 Optional[Formatting]: (Default value = None)
hyperlink (Optional[Union[AnyUrl, Path]], default: None ) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
FormulaItem \u2013 FormulaItem: The newly created FormulaItem item.
insert_group(sibling: NodeItem, label: Optional[GroupLabel] = None, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> GroupItem\n Creates a new GroupItem item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
label (Optional[GroupLabel], default: None ) \u2013 Optional[GroupLabel]: (Default value = None)
name (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
content_layer (Optional[ContentLayer], default: None ) \u2013 Optional[ContentLayer]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
GroupItem \u2013 GroupItem: The newly created GroupItem.
insert_heading(sibling: NodeItem, text: str, orig: Optional[str] = None, level: LevelNumber = 1, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> SectionHeaderItem\n Creates a new SectionHeaderItem item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
text (str) \u2013 str:
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
level (LevelNumber, default: 1 ) \u2013 LevelNumber: (Default value = 1)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer (Optional[ContentLayer], default: None ) \u2013 Optional[ContentLayer]: (Default value = None)
formatting (Optional[Formatting], default: None ) \u2013 Optional[Formatting]: (Default value = None)
hyperlink (Optional[Union[AnyUrl, Path]], default: None ) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
SectionHeaderItem \u2013 SectionHeaderItem: The newly created SectionHeaderItem item.
insert_inline_group(sibling: NodeItem, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> InlineGroup\n Creates a new InlineGroup item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
name (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
content_layer (Optional[ContentLayer], default: None ) \u2013 Optional[ContentLayer]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
InlineGroup \u2013 InlineGroup: The newly created InlineGroup item.
insert_item_after_sibling(*, new_item: NodeItem, sibling: NodeItem) -> None\n Inserts an item, given its node_item instance, after other as a sibling.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_item_before_sibling","title":"insert_item_before_sibling","text":"insert_item_before_sibling(*, new_item: NodeItem, sibling: NodeItem) -> None\n Inserts an item, given its node_item instance, before other as a sibling.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_key_values","title":"insert_key_values","text":"insert_key_values(sibling: NodeItem, graph: GraphData, prov: Optional[ProvenanceItem] = None, after: bool = True) -> KeyValueItem\n Creates a new KeyValueItem item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
graph (GraphData) \u2013 GraphData:
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
KeyValueItem \u2013 KeyValueItem: The newly created KeyValueItem item.
insert_list_group(sibling: NodeItem, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> ListGroup\n Creates a new ListGroup item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
name (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
content_layer (Optional[ContentLayer], default: None ) \u2013 Optional[ContentLayer]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
ListGroup \u2013 ListGroup: The newly created ListGroup item.
insert_list_item(sibling: NodeItem, text: str, enumerated: bool = False, marker: Optional[str] = None, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> ListItem\n Creates a new ListItem item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
text (str) \u2013 str:
enumerated (bool, default: False ) \u2013 bool: (Default value = False)
marker (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer (Optional[ContentLayer], default: None ) \u2013 Optional[ContentLayer]: (Default value = None)
formatting (Optional[Formatting], default: None ) \u2013 Optional[Formatting]: (Default value = None)
hyperlink (Optional[Union[AnyUrl, Path]], default: None ) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
ListItem \u2013 ListItem: The newly created ListItem item.
insert_node_items(sibling: NodeItem, node_items: List[NodeItem], doc: DoclingDocument, after: bool = True) -> None\n Insert multiple NodeItems and their children at a specific position in the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem: The NodeItem after/before which the new items will be inserted
node_items (List[NodeItem]) \u2013 list[NodeItem]: The NodeItems to be inserted
doc (DoclingDocument) \u2013 DoclingDocument: The document to which the NodeItems and their children belong
after (bool, default: True ) \u2013 bool: If True, insert after the sibling; if False, insert before (Default value = True)
Returns:
None \u2013 None
insert_picture(sibling: NodeItem, annotations: Optional[List[PictureDataType]] = None, image: Optional[ImageRef] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> PictureItem\n Creates a new PictureItem item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
annotations (Optional[List[PictureDataType]], default: None ) \u2013 Optional[List[PictureDataType]]: (Default value = None)
image (Optional[ImageRef], default: None ) \u2013 Optional[ImageRef]: (Default value = None)
caption (Optional[Union[TextItem, RefItem]], default: None ) \u2013 Optional[Union[TextItem, RefItem]]: (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer (Optional[ContentLayer], default: None ) \u2013 Optional[ContentLayer]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
PictureItem \u2013 PictureItem: The newly created PictureItem item.
insert_table(sibling: NodeItem, data: TableData, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, label: DocItemLabel = TABLE, content_layer: Optional[ContentLayer] = None, annotations: Optional[list[TableAnnotationType]] = None, after: bool = True) -> TableItem\n Creates a new TableItem item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
data (TableData) \u2013 TableData:
caption (Optional[Union[TextItem, RefItem]], default: None ) \u2013 Optional[Union[TextItem, RefItem]]: (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
label (DocItemLabel, default: TABLE ) \u2013 DocItemLabel: (Default value = DocItemLabel.TABLE)
content_layer (Optional[ContentLayer], default: None ) \u2013 Optional[ContentLayer]: (Default value = None)
annotations (Optional[list[TableAnnotationType]], default: None ) \u2013 Optional[List[TableAnnotationType]]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
TableItem \u2013 TableItem: The newly created TableItem item.
insert_text(sibling: NodeItem, label: DocItemLabel, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> TextItem\n Creates a new TextItem item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
label (DocItemLabel) \u2013 DocItemLabel:
text (str) \u2013 str:
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer (Optional[ContentLayer], default: None ) \u2013 Optional[ContentLayer]: (Default value = None)
formatting (Optional[Formatting], default: None ) \u2013 Optional[Formatting]: (Default value = None)
hyperlink (Optional[Union[AnyUrl, Path]], default: None ) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
TextItem \u2013 TextItem: The newly created TextItem item.
insert_title(sibling: NodeItem, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> TitleItem\n Creates a new TitleItem item and inserts it into the document.
Parameters:
sibling (NodeItem) \u2013 NodeItem:
text (str) \u2013 str:
orig (Optional[str], default: None ) \u2013 Optional[str]: (Default value = None)
prov (Optional[ProvenanceItem], default: None ) \u2013 Optional[ProvenanceItem]: (Default value = None)
content_layer (Optional[ContentLayer], default: None ) \u2013 Optional[ContentLayer]: (Default value = None)
formatting (Optional[Formatting], default: None ) \u2013 Optional[Formatting]: (Default value = None)
hyperlink (Optional[Union[AnyUrl, Path]], default: None ) \u2013 Optional[Union[AnyUrl, Path]]: (Default value = None)
after (bool, default: True ) \u2013 bool: (Default value = True)
Returns:
TitleItem \u2013 TitleItem: The newly created TitleItem item.
iterate_items(root: Optional[NodeItem] = None, with_groups: bool = False, traverse_pictures: bool = False, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, _level: int = 0) -> Iterable[Tuple[NodeItem, int]]\n Iterate elements with level.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.load_from_doctags","title":"load_from_doctags","text":"load_from_doctags(doctag_document: DocTagsDocument, document_name: str = 'Document') -> DoclingDocument\n Load Docling document from lists of DocTags and Images.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.load_from_json","title":"load_from_json","text":"load_from_json(filename: Union[str, Path]) -> DoclingDocument\n load_from_json.
Parameters:
filename (Union[str, Path]) \u2013 The filename to load a saved DoclingDocument from a .json.
Returns:
DoclingDocument \u2013 The loaded DoclingDocument.
load_from_yaml(filename: Union[str, Path]) -> DoclingDocument\n load_from_yaml.
Args: filename: The filename to load a YAML-serialized DoclingDocument from.
Returns: DoclingDocument: the loaded DoclingDocument
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.num_pages","title":"num_pages","text":"num_pages()\n num_pages.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.print_element_tree","title":"print_element_tree","text":"print_element_tree()\n Print_element_tree.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.replace_item","title":"replace_item","text":"replace_item(*, new_item: NodeItem, old_item: NodeItem) -> None\n Replace item with new item.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_doctags","title":"save_as_doctags","text":"save_as_doctags(filename: Union[str, Path], delim: str = '', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True, add_page_index: bool = True, add_table_cell_location: bool = False, add_table_cell_text: bool = True, minified: bool = False)\n Save the document content to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_document_tokens","title":"save_as_document_tokens","text":"save_as_document_tokens(*args, **kwargs)\n Save the document content to a DocumentToken format.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_html","title":"save_as_html","text":"save_as_html(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, image_mode: ImageRefMode = PLACEHOLDER, formula_to_mathml: bool = True, page_no: Optional[int] = None, html_lang: str = 'en', html_head: str = 'null', included_content_layers: Optional[set[ContentLayer]] = None, split_page_view: bool = False, include_annotations: bool = True)\n Save to HTML.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_json","title":"save_as_json","text":"save_as_json(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, image_mode: ImageRefMode = EMBEDDED, indent: int = 2, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None)\n Save as json.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_markdown","title":"save_as_markdown","text":"save_as_markdown(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, delim: str = '\\n\\n', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, strict_text: bool = False, escape_html: bool = True, escaping_underscores: bool = True, image_placeholder: str = '<!-- image -->', image_mode: ImageRefMode = PLACEHOLDER, indent: int = 4, text_width: int = -1, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, page_break_placeholder: Optional[str] = None, include_annotations: bool = True, *, mark_meta: bool = False, use_legacy_annotations: Optional[bool] = None)\n Save to markdown.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_yaml","title":"save_as_yaml","text":"save_as_yaml(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, image_mode: ImageRefMode = EMBEDDED, default_flow_style: bool = False, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None)\n Save as yaml.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.transform_to_content_layer","title":"transform_to_content_layer","text":"transform_to_content_layer(data: Any) -> Any\n transform_to_content_layer.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_document","title":"validate_document","text":"validate_document() -> Self\n validate_document.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_misplaced_list_items","title":"validate_misplaced_list_items","text":"validate_misplaced_list_items()\n validate_misplaced_list_items.
"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_tree","title":"validate_tree","text":"validate_tree(root: NodeItem) -> bool\n validate_tree.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin","title":"DocumentOrigin","text":" Bases: BaseModel
FileSource.
Methods:
parse_hex_string \u2013 parse_hex_string.
validate_mimetype \u2013 validate_mimetype.
Attributes:
binary_hash (Uint64) \u2013 filename (str) \u2013 mimetype (str) \u2013 uri (Optional[AnyUrl]) \u2013 binary_hash: Uint64\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.filename","title":"filename","text":"filename: str\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.mimetype","title":"mimetype","text":"mimetype: str\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.uri","title":"uri","text":"uri: Optional[AnyUrl] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.parse_hex_string","title":"parse_hex_string","text":"parse_hex_string(value)\n parse_hex_string.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.validate_mimetype","title":"validate_mimetype","text":"validate_mimetype(v)\n validate_mimetype.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem","title":"DocItem","text":" Bases: NodeItem
DocItem.
Methods:
get_annotations \u2013 Get the annotations of this DocItem.
get_image \u2013 Returns the image of this DocItem.
get_location_tokens \u2013 Get the location string for the BaseCell.
get_ref \u2013 get_ref.
Attributes:
children (List[RefItem]) \u2013 content_layer (ContentLayer) \u2013 label (DocItemLabel) \u2013 meta (Optional[BaseMeta]) \u2013 model_config \u2013 parent (Optional[RefItem]) \u2013 prov (List[ProvenanceItem]) \u2013 self_ref (str) \u2013 children: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.label","title":"label","text":"label: DocItemLabel\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.meta","title":"meta","text":"meta: Optional[BaseMeta] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n Get the annotations of this DocItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n Returns the image of this DocItem.
The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel","title":"DocItemLabel","text":" Bases: str, Enum
DocItemLabel.
Methods:
get_color \u2013 Return the RGB color associated with a given label.
Attributes:
CAPTION \u2013 CHART \u2013 CHECKBOX_SELECTED \u2013 CHECKBOX_UNSELECTED \u2013 CODE \u2013 DOCUMENT_INDEX \u2013 EMPTY_VALUE \u2013 FOOTNOTE \u2013 FORM \u2013 FORMULA \u2013 GRADING_SCALE \u2013 HANDWRITTEN_TEXT \u2013 KEY_VALUE_REGION \u2013 LIST_ITEM \u2013 PAGE_FOOTER \u2013 PAGE_HEADER \u2013 PARAGRAPH \u2013 PICTURE \u2013 REFERENCE \u2013 SECTION_HEADER \u2013 TABLE \u2013 TEXT \u2013 TITLE \u2013 CAPTION = 'caption'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHART","title":"CHART","text":"CHART = 'chart'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHECKBOX_SELECTED","title":"CHECKBOX_SELECTED","text":"CHECKBOX_SELECTED = 'checkbox_selected'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHECKBOX_UNSELECTED","title":"CHECKBOX_UNSELECTED","text":"CHECKBOX_UNSELECTED = 'checkbox_unselected'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CODE","title":"CODE","text":"CODE = 'code'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.DOCUMENT_INDEX","title":"DOCUMENT_INDEX","text":"DOCUMENT_INDEX = 'document_index'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.EMPTY_VALUE","title":"EMPTY_VALUE","text":"EMPTY_VALUE = 'empty_value'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FOOTNOTE","title":"FOOTNOTE","text":"FOOTNOTE = 'footnote'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FORM","title":"FORM","text":"FORM = 'form'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FORMULA","title":"FORMULA","text":"FORMULA = 'formula'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.GRADING_SCALE","title":"GRADING_SCALE","text":"GRADING_SCALE = 'grading_scale'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.HANDWRITTEN_TEXT","title":"HANDWRITTEN_TEXT","text":"HANDWRITTEN_TEXT = 'handwritten_text'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.KEY_VALUE_REGION","title":"KEY_VALUE_REGION","text":"KEY_VALUE_REGION = 'key_value_region'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.LIST_ITEM","title":"LIST_ITEM","text":"LIST_ITEM = 'list_item'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PAGE_FOOTER","title":"PAGE_FOOTER","text":"PAGE_FOOTER = 'page_footer'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PAGE_HEADER","title":"PAGE_HEADER","text":"PAGE_HEADER = 'page_header'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PARAGRAPH","title":"PARAGRAPH","text":"PARAGRAPH = 'paragraph'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PICTURE","title":"PICTURE","text":"PICTURE = 'picture'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.REFERENCE","title":"REFERENCE","text":"REFERENCE = 'reference'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.SECTION_HEADER","title":"SECTION_HEADER","text":"SECTION_HEADER = 'section_header'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TABLE","title":"TABLE","text":"TABLE = 'table'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TEXT","title":"TEXT","text":"TEXT = 'text'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TITLE","title":"TITLE","text":"TITLE = 'title'\n"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.get_color","title":"get_color","text":"get_color(label: DocItemLabel) -> Tuple[int, int, int]\n Return the RGB color associated with a given label.
"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem","title":"ProvenanceItem","text":" Bases: BaseModel
ProvenanceItem.
Attributes:
bbox (BoundingBox) \u2013 charspan (Tuple[int, int]) \u2013 page_no (int) \u2013 bbox: BoundingBox\n"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem.charspan","title":"charspan","text":"charspan: Tuple[int, int]\n"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem.page_no","title":"page_no","text":"page_no: int\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem","title":"GroupItem","text":" Bases: NodeItem
GroupItem.
Methods:
get_ref \u2013 get_ref.
Attributes:
children (List[RefItem]) \u2013 content_layer (ContentLayer) \u2013 label (GroupLabel) \u2013 meta (Optional[BaseMeta]) \u2013 model_config \u2013 name (str) \u2013 parent (Optional[RefItem]) \u2013 self_ref (str) \u2013 children: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.label","title":"label","text":"label: GroupLabel = UNSPECIFIED\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.meta","title":"meta","text":"meta: Optional[BaseMeta] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.name","title":"name","text":"name: str = 'group'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel","title":"GroupLabel","text":" Bases: str, Enum
GroupLabel.
Attributes:
CHAPTER \u2013 COMMENT_SECTION \u2013 FORM_AREA \u2013 INLINE \u2013 KEY_VALUE_AREA \u2013 LIST \u2013 ORDERED_LIST \u2013 PICTURE_AREA \u2013 SECTION \u2013 SHEET \u2013 SLIDE \u2013 UNSPECIFIED \u2013 CHAPTER = 'chapter'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.COMMENT_SECTION","title":"COMMENT_SECTION","text":"COMMENT_SECTION = 'comment_section'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.FORM_AREA","title":"FORM_AREA","text":"FORM_AREA = 'form_area'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.INLINE","title":"INLINE","text":"INLINE = 'inline'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.KEY_VALUE_AREA","title":"KEY_VALUE_AREA","text":"KEY_VALUE_AREA = 'key_value_area'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.LIST","title":"LIST","text":"LIST = 'list'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.ORDERED_LIST","title":"ORDERED_LIST","text":"ORDERED_LIST = 'ordered_list'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.PICTURE_AREA","title":"PICTURE_AREA","text":"PICTURE_AREA = 'picture_area'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SECTION","title":"SECTION","text":"SECTION = 'section'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SHEET","title":"SHEET","text":"SHEET = 'sheet'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SLIDE","title":"SLIDE","text":"SLIDE = 'slide'\n"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.UNSPECIFIED","title":"UNSPECIFIED","text":"UNSPECIFIED = 'unspecified'\n"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem","title":"NodeItem","text":" Bases: BaseModel
NodeItem.
Methods:
get_ref \u2013 get_ref.
Attributes:
children (List[RefItem]) \u2013 content_layer (ContentLayer) \u2013 meta (Optional[BaseMeta]) \u2013 model_config \u2013 parent (Optional[RefItem]) \u2013 self_ref (str) \u2013 children: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.meta","title":"meta","text":"meta: Optional[BaseMeta] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem","title":"PageItem","text":" Bases: BaseModel
PageItem.
Attributes:
image (Optional[ImageRef]) \u2013 page_no (int) \u2013 size (Size) \u2013 image: Optional[ImageRef] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem.page_no","title":"page_no","text":"page_no: int\n"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem.size","title":"size","text":"size: Size\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem","title":"FloatingItem","text":" Bases: DocItem
FloatingItem.
Methods:
caption_text \u2013 Computes the caption as a single text.
get_annotations \u2013 Get the annotations of this DocItem.
get_image \u2013 Returns the image corresponding to this FloatingItem.
get_location_tokens \u2013 Get the location string for the BaseCell.
get_ref \u2013 get_ref.
Attributes:
captions (List[RefItem]) \u2013 children (List[RefItem]) \u2013 content_layer (ContentLayer) \u2013 footnotes (List[RefItem]) \u2013 image (Optional[ImageRef]) \u2013 label (DocItemLabel) \u2013 meta (Optional[FloatingMeta]) \u2013 model_config \u2013 parent (Optional[RefItem]) \u2013 prov (List[ProvenanceItem]) \u2013 references (List[RefItem]) \u2013 self_ref (str) \u2013 captions: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.children","title":"children","text":"children: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.footnotes","title":"footnotes","text":"footnotes: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.image","title":"image","text":"image: Optional[ImageRef] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.label","title":"label","text":"label: DocItemLabel\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.meta","title":"meta","text":"meta: Optional[FloatingMeta] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.references","title":"references","text":"references: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.caption_text","title":"caption_text","text":"caption_text(doc: DoclingDocument) -> str\n Computes the caption as a single text.
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n Get the annotations of this DocItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n Returns the image corresponding to this FloatingItem.
This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.
In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem","title":"TextItem","text":" Bases: DocItem
TextItem.
Methods:
export_to_doctags \u2013 Export text element to document tokens format.
export_to_document_tokens \u2013 Export to DocTags format.
get_annotations \u2013 Get the annotations of this DocItem.
get_image \u2013 Returns the image of this DocItem.
get_location_tokens \u2013 Get the location string for the BaseCell.
get_ref \u2013 get_ref.
Attributes:
children (List[RefItem]) \u2013 content_layer (ContentLayer) \u2013 formatting (Optional[Formatting]) \u2013 hyperlink (Optional[Union[AnyUrl, Path]]) \u2013 label (Literal[CAPTION, CHECKBOX_SELECTED, CHECKBOX_UNSELECTED, FOOTNOTE, PAGE_FOOTER, PAGE_HEADER, PARAGRAPH, REFERENCE, TEXT, EMPTY_VALUE]) \u2013 meta (Optional[BaseMeta]) \u2013 model_config \u2013 orig (str) \u2013 parent (Optional[RefItem]) \u2013 prov (List[ProvenanceItem]) \u2013 self_ref (str) \u2013 text (str) \u2013 children: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.formatting","title":"formatting","text":"formatting: Optional[Formatting] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.hyperlink","title":"hyperlink","text":"hyperlink: Optional[Union[AnyUrl, Path]] = Field(union_mode='left_to_right', default=None)\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.label","title":"label","text":"label: Literal[CAPTION, CHECKBOX_SELECTED, CHECKBOX_UNSELECTED, FOOTNOTE, PAGE_FOOTER, PAGE_HEADER, PARAGRAPH, REFERENCE, TEXT, EMPTY_VALUE]\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.meta","title":"meta","text":"meta: Optional[BaseMeta] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.orig","title":"orig","text":"orig: str\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.text","title":"text","text":"text: str\n"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.export_to_doctags","title":"export_to_doctags","text":"export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n Export text element to document tokens format.
Parameters:
doc (DoclingDocument) \u2013 \"DoclingDocument\":
new_line (str, default: '' ) \u2013 str (Default value = \"\") Deprecated
xsize (int, default: 500 ) \u2013 int: (Default value = 500)
ysize (int, default: 500 ) \u2013 int: (Default value = 500)
add_location (bool, default: True ) \u2013 bool: (Default value = True)
add_content (bool, default: True ) \u2013 bool: (Default value = True)
export_to_document_tokens(*args, **kwargs)\n Export to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n Get the annotations of this DocItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n Returns the image of this DocItem.
The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem","title":"TableItem","text":" Bases: FloatingItem
TableItem.
Methods:
add_annotation \u2013 Add an annotation to the table.
caption_text \u2013 Computes the caption as a single text.
export_to_dataframe \u2013 Export the table as a Pandas DataFrame.
export_to_doctags \u2013 Export table to document tokens format.
export_to_document_tokens \u2013 Export to DocTags format.
export_to_html \u2013 Export the table as html.
export_to_markdown \u2013 Export the table as markdown.
export_to_otsl \u2013 Export the table as OTSL.
get_annotations \u2013 Get the annotations of this TableItem.
get_image \u2013 Returns the image corresponding to this FloatingItem.
get_location_tokens \u2013 Get the location string for the BaseCell.
get_ref \u2013 get_ref.
Attributes:
annotations (Annotated[List[TableAnnotationType], deprecated('Field `annotations` is deprecated; use `meta` instead.')]) \u2013 captions (List[RefItem]) \u2013 children (List[RefItem]) \u2013 content_layer (ContentLayer) \u2013 data (TableData) \u2013 footnotes (List[RefItem]) \u2013 image (Optional[ImageRef]) \u2013 label (Literal[DOCUMENT_INDEX, TABLE]) \u2013 meta (Optional[FloatingMeta]) \u2013 model_config \u2013 parent (Optional[RefItem]) \u2013 prov (List[ProvenanceItem]) \u2013 references (List[RefItem]) \u2013 self_ref (str) \u2013 annotations: Annotated[List[TableAnnotationType], deprecated('Field `annotations` is deprecated; use `meta` instead.')] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.captions","title":"captions","text":"captions: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.children","title":"children","text":"children: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.data","title":"data","text":"data: TableData\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.footnotes","title":"footnotes","text":"footnotes: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.image","title":"image","text":"image: Optional[ImageRef] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.label","title":"label","text":"label: Literal[DOCUMENT_INDEX, TABLE] = TABLE\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.meta","title":"meta","text":"meta: Optional[FloatingMeta] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.references","title":"references","text":"references: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.add_annotation","title":"add_annotation","text":"add_annotation(annotation: TableAnnotationType) -> None\n Add an annotation to the table.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.caption_text","title":"caption_text","text":"caption_text(doc: DoclingDocument) -> str\n Computes the caption as a single text.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_dataframe","title":"export_to_dataframe","text":"export_to_dataframe(doc: Optional[DoclingDocument] = None) -> DataFrame\n Export the table as a Pandas DataFrame.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_doctags","title":"export_to_doctags","text":"export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_cell_location: bool = True, add_cell_text: bool = True, add_caption: bool = True)\n Export table to document tokens format.
Parameters:
doc (DoclingDocument) \u2013 \"DoclingDocument\":
new_line (str, default: '' ) \u2013 str (Default value = \"\") Deprecated
xsize (int, default: 500 ) \u2013 int: (Default value = 500)
ysize (int, default: 500 ) \u2013 int: (Default value = 500)
add_location (bool, default: True ) \u2013 bool: (Default value = True)
add_cell_location (bool, default: True ) \u2013 bool: (Default value = True)
add_cell_text (bool, default: True ) \u2013 bool: (Default value = True)
add_caption (bool, default: True ) \u2013 bool: (Default value = True)
export_to_document_tokens(*args, **kwargs)\n Export to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_html","title":"export_to_html","text":"export_to_html(doc: Optional[DoclingDocument] = None, add_caption: bool = True) -> str\n Export the table as html.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_markdown","title":"export_to_markdown","text":"export_to_markdown(doc: Optional[DoclingDocument] = None) -> str\n Export the table as markdown.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_otsl","title":"export_to_otsl","text":"export_to_otsl(doc: DoclingDocument, add_cell_location: bool = True, add_cell_text: bool = True, xsize: int = 500, ysize: int = 500, **kwargs: Any) -> str\n Export the table as OTSL.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n Get the annotations of this TableItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n Returns the image corresponding to this FloatingItem.
This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.
In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell","title":"TableCell","text":" Bases: BaseModel
TableCell.
Methods:
from_dict_format \u2013 from_dict_format.
Attributes:
bbox (Optional[BoundingBox]) \u2013 col_span (int) \u2013 column_header (bool) \u2013 end_col_offset_idx (int) \u2013 end_row_offset_idx (int) \u2013 fillable (bool) \u2013 row_header (bool) \u2013 row_section (bool) \u2013 row_span (int) \u2013 start_col_offset_idx (int) \u2013 start_row_offset_idx (int) \u2013 text (str) \u2013 bbox: Optional[BoundingBox] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.col_span","title":"col_span","text":"col_span: int = 1\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.column_header","title":"column_header","text":"column_header: bool = False\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.end_col_offset_idx","title":"end_col_offset_idx","text":"end_col_offset_idx: int\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.end_row_offset_idx","title":"end_row_offset_idx","text":"end_row_offset_idx: int\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.fillable","title":"fillable","text":"fillable: bool = False\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_header","title":"row_header","text":"row_header: bool = False\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_section","title":"row_section","text":"row_section: bool = False\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_span","title":"row_span","text":"row_span: int = 1\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.start_col_offset_idx","title":"start_col_offset_idx","text":"start_col_offset_idx: int\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.start_row_offset_idx","title":"start_row_offset_idx","text":"start_row_offset_idx: int\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.text","title":"text","text":"text: str\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.from_dict_format","title":"from_dict_format","text":"from_dict_format(data: Any) -> Any\n from_dict_format.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableData","title":"TableData","text":" Bases: BaseModel
BaseTableData.
Methods:
add_row \u2013 Add a new row to the table from a list of strings.
add_rows \u2013 Add multiple new rows to the table from a list of lists of strings.
get_column_bounding_boxes \u2013 Get the minimal bounding box for each column in the table.
get_row_bounding_boxes \u2013 Get the minimal bounding box for each row in the table.
insert_row \u2013 Insert a new row from a list of strings before/after a specific index in the table.
insert_rows \u2013 Insert multiple new rows from a list of lists of strings before/after a specific index in the table.
pop_row \u2013 Remove and return the last row from the table.
remove_row \u2013 Remove a row from the table by its index.
remove_rows \u2013 Remove rows from the table by their indices.
Attributes:
grid (List[List[TableCell]]) \u2013 grid.
num_cols (int) \u2013 num_rows (int) \u2013 table_cells (List[AnyTableCell]) \u2013 grid: List[List[TableCell]]\n grid.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.num_cols","title":"num_cols","text":"num_cols: int = 0\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.num_rows","title":"num_rows","text":"num_rows: int = 0\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.table_cells","title":"table_cells","text":"table_cells: List[AnyTableCell] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.add_row","title":"add_row","text":"add_row(row: List[str]) -> None\n Add a new row to the table from a list of strings.
Parameters:
row (List[str]) \u2013 List[str]: A list of strings representing the content of the new row.
Returns:
None \u2013 None
add_rows(rows: List[List[str]]) -> None\n Add multiple new rows to the table from a list of lists of strings.
Parameters:
rows (List[List[str]]) \u2013 List[List[str]]: A list of lists, where each inner list represents the content of a new row.
Returns:
None \u2013 None
get_column_bounding_boxes() -> dict[int, BoundingBox]\n Get the minimal bounding box for each column in the table.
Returns: List[Optional[BoundingBox]]: A list where each element is the minimal bounding box that encompasses all cells in that column, or None if no cells in the column have bounding boxes.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.get_row_bounding_boxes","title":"get_row_bounding_boxes","text":"get_row_bounding_boxes() -> dict[int, BoundingBox]\n Get the minimal bounding box for each row in the table.
Returns: List[Optional[BoundingBox]]: A list where each element is the minimal bounding box that encompasses all cells in that row, or None if no cells in the row have bounding boxes.
"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.insert_row","title":"insert_row","text":"insert_row(row_index: int, row: List[str], after: bool = False) -> None\n Insert a new row from a list of strings before/after a specific index in the table.
Parameters:
row_index (int) \u2013 int: The index at which to insert the new row. (Starting from 0)
row (List[str]) \u2013 List[str]: A list of strings representing the content of the new row.
after (bool, default: False ) \u2013 bool: If True, insert the row after the specified index, otherwise before it. (Default is False)
Returns:
None \u2013 None
insert_rows(row_index: int, rows: List[List[str]], after: bool = False) -> None\n Insert multiple new rows from a list of lists of strings before/after a specific index in the table.
Parameters:
row_index (int) \u2013 int: The index at which to insert the new rows. (Starting from 0)
rows (List[List[str]]) \u2013 List[List[str]]: A list of lists, where each inner list represents the content of a new row.
after (bool, default: False ) \u2013 bool: If True, insert the rows after the specified index, otherwise before it. (Default is False)
Returns:
None \u2013 None
pop_row(doc: Optional[DoclingDocument] = None) -> List[TableCell]\n Remove and return the last row from the table.
Returns:
List[TableCell] \u2013 List[TableCell]: A list of TableCell objects representing the popped row.
remove_row(row_index: int, doc: Optional[DoclingDocument] = None) -> List[TableCell]\n Remove a row from the table by its index.
Parameters:
row_index (int) \u2013 int: The index of the row to remove. (Starting from 0)
Returns:
List[TableCell] \u2013 List[TableCell]: A list of TableCell objects representing the removed row.
remove_rows(indices: List[int], doc: Optional[DoclingDocument] = None) -> List[List[TableCell]]\n Remove rows from the table by their indices.
Parameters:
indices (List[int]) \u2013 List[int]: A list of indices of the rows to remove. (Starting from 0)
Returns:
List[List[TableCell]] \u2013 List[List[TableCell]]: A list representation of the removed rows as lists of TableCell objects.
Bases: str, Enum
TableCellLabel.
Methods:
get_color \u2013 Return the RGB color associated with a given label.
Attributes:
BODY \u2013 COLUMN_HEADER \u2013 ROW_HEADER \u2013 ROW_SECTION \u2013 BODY = 'body'\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.COLUMN_HEADER","title":"COLUMN_HEADER","text":"COLUMN_HEADER = 'col_header'\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.ROW_HEADER","title":"ROW_HEADER","text":"ROW_HEADER = 'row_header'\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.ROW_SECTION","title":"ROW_SECTION","text":"ROW_SECTION = 'row_section'\n"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.get_color","title":"get_color","text":"get_color(label: TableCellLabel) -> Tuple[int, int, int]\n Return the RGB color associated with a given label.
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem","title":"KeyValueItem","text":" Bases: FloatingItem
KeyValueItem.
Methods:
caption_text \u2013 Computes the caption as a single text.
export_to_document_tokens \u2013 Export key value item to document tokens format.
get_annotations \u2013 Get the annotations of this DocItem.
get_image \u2013 Returns the image corresponding to this FloatingItem.
get_location_tokens \u2013 Get the location string for the BaseCell.
get_ref \u2013 get_ref.
Attributes:
captions (List[RefItem]) \u2013 children (List[RefItem]) \u2013 content_layer (ContentLayer) \u2013 footnotes (List[RefItem]) \u2013 graph (GraphData) \u2013 image (Optional[ImageRef]) \u2013 label (Literal[KEY_VALUE_REGION]) \u2013 meta (Optional[FloatingMeta]) \u2013 model_config \u2013 parent (Optional[RefItem]) \u2013 prov (List[ProvenanceItem]) \u2013 references (List[RefItem]) \u2013 self_ref (str) \u2013 captions: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.children","title":"children","text":"children: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.footnotes","title":"footnotes","text":"footnotes: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.graph","title":"graph","text":"graph: GraphData\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.image","title":"image","text":"image: Optional[ImageRef] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.label","title":"label","text":"label: Literal[KEY_VALUE_REGION] = KEY_VALUE_REGION\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.meta","title":"meta","text":"meta: Optional[FloatingMeta] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.references","title":"references","text":"references: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.caption_text","title":"caption_text","text":"caption_text(doc: DoclingDocument) -> str\n Computes the caption as a single text.
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.export_to_document_tokens","title":"export_to_document_tokens","text":"export_to_document_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n Export key value item to document tokens format.
Parameters:
doc (DoclingDocument) \u2013 \"DoclingDocument\":
new_line (str, default: '' ) \u2013 str (Default value = \"\") Deprecated
xsize (int, default: 500 ) \u2013 int: (Default value = 500)
ysize (int, default: 500 ) \u2013 int: (Default value = 500)
add_location (bool, default: True ) \u2013 bool: (Default value = True)
add_content (bool, default: True ) \u2013 bool: (Default value = True)
get_annotations() -> Sequence[BaseAnnotation]\n Get the annotations of this DocItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n Returns the image corresponding to this FloatingItem.
This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.
In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem","title":"SectionHeaderItem","text":" Bases: TextItem
SectionItem.
Methods:
export_to_doctags \u2013 Export text element to document tokens format.
export_to_document_tokens \u2013 Export to DocTags format.
get_annotations \u2013 Get the annotations of this DocItem.
get_image \u2013 Returns the image of this DocItem.
get_location_tokens \u2013 Get the location string for the BaseCell.
get_ref \u2013 get_ref.
Attributes:
children (List[RefItem]) \u2013 content_layer (ContentLayer) \u2013 formatting (Optional[Formatting]) \u2013 hyperlink (Optional[Union[AnyUrl, Path]]) \u2013 label (Literal[SECTION_HEADER]) \u2013 level (LevelNumber) \u2013 meta (Optional[BaseMeta]) \u2013 model_config \u2013 orig (str) \u2013 parent (Optional[RefItem]) \u2013 prov (List[ProvenanceItem]) \u2013 self_ref (str) \u2013 text (str) \u2013 children: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.formatting","title":"formatting","text":"formatting: Optional[Formatting] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.hyperlink","title":"hyperlink","text":"hyperlink: Optional[Union[AnyUrl, Path]] = Field(union_mode='left_to_right', default=None)\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.label","title":"label","text":"label: Literal[SECTION_HEADER] = SECTION_HEADER\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.level","title":"level","text":"level: LevelNumber = 1\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.meta","title":"meta","text":"meta: Optional[BaseMeta] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.orig","title":"orig","text":"orig: str\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.text","title":"text","text":"text: str\n"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.export_to_doctags","title":"export_to_doctags","text":"export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n Export text element to document tokens format.
Parameters:
doc (DoclingDocument) \u2013 \"DoclingDocument\":
new_line (str, default: '' ) \u2013 str (Default value = \"\") Deprecated
xsize (int, default: 500 ) \u2013 int: (Default value = 500)
ysize (int, default: 500 ) \u2013 int: (Default value = 500)
add_location (bool, default: True ) \u2013 bool: (Default value = True)
add_content (bool, default: True ) \u2013 bool: (Default value = True)
export_to_document_tokens(*args, **kwargs)\n Export to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n Get the annotations of this DocItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n Returns the image of this DocItem.
The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem","title":"PictureItem","text":" Bases: FloatingItem
PictureItem.
Methods:
caption_text \u2013 Computes the caption as a single text.
export_to_doctags \u2013 Export picture to document tokens format.
export_to_document_tokens \u2013 Export to DocTags format.
export_to_html \u2013 Export picture to HTML format.
export_to_markdown \u2013 Export picture to Markdown format.
get_annotations \u2013 Get the annotations of this PictureItem.
get_image \u2013 Returns the image corresponding to this FloatingItem.
get_location_tokens \u2013 Get the location string for the BaseCell.
get_ref \u2013 get_ref.
Attributes:
annotations (Annotated[List[PictureDataType], deprecated('Field `annotations` is deprecated; use `meta` instead.')]) \u2013 captions (List[RefItem]) \u2013 children (List[RefItem]) \u2013 content_layer (ContentLayer) \u2013 footnotes (List[RefItem]) \u2013 image (Optional[ImageRef]) \u2013 label (Literal[PICTURE, CHART]) \u2013 meta (Optional[PictureMeta]) \u2013 model_config \u2013 parent (Optional[RefItem]) \u2013 prov (List[ProvenanceItem]) \u2013 references (List[RefItem]) \u2013 self_ref (str) \u2013 annotations: Annotated[List[PictureDataType], deprecated('Field `annotations` is deprecated; use `meta` instead.')] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.captions","title":"captions","text":"captions: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.children","title":"children","text":"children: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.content_layer","title":"content_layer","text":"content_layer: ContentLayer = BODY\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.footnotes","title":"footnotes","text":"footnotes: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.image","title":"image","text":"image: Optional[ImageRef] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.label","title":"label","text":"label: Literal[PICTURE, CHART] = PICTURE\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.meta","title":"meta","text":"meta: Optional[PictureMeta] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.model_config","title":"model_config","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.parent","title":"parent","text":"parent: Optional[RefItem] = None\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.prov","title":"prov","text":"prov: List[ProvenanceItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.references","title":"references","text":"references: List[RefItem] = []\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.self_ref","title":"self_ref","text":"self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.caption_text","title":"caption_text","text":"caption_text(doc: DoclingDocument) -> str\n Computes the caption as a single text.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_doctags","title":"export_to_doctags","text":"export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_caption: bool = True, add_content: bool = True)\n Export picture to document tokens format.
Parameters:
doc (DoclingDocument) \u2013 \"DoclingDocument\":
new_line (str, default: '' ) \u2013 str (Default value = \"\") Deprecated
xsize (int, default: 500 ) \u2013 int: (Default value = 500)
ysize (int, default: 500 ) \u2013 int: (Default value = 500)
add_location (bool, default: True ) \u2013 bool: (Default value = True)
add_caption (bool, default: True ) \u2013 bool: (Default value = True)
add_content (bool, default: True ) \u2013 bool: (Default value = True)
export_to_document_tokens(*args, **kwargs)\n Export to DocTags format.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_html","title":"export_to_html","text":"export_to_html(doc: DoclingDocument, add_caption: bool = True, image_mode: ImageRefMode = PLACEHOLDER) -> str\n Export picture to HTML format.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_markdown","title":"export_to_markdown","text":"export_to_markdown(doc: DoclingDocument, add_caption: bool = True, image_mode: ImageRefMode = EMBEDDED, image_placeholder: str = '<!-- image -->') -> str\n Export picture to Markdown format.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_annotations","title":"get_annotations","text":"get_annotations() -> Sequence[BaseAnnotation]\n Get the annotations of this PictureItem.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_image","title":"get_image","text":"get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n Returns the image corresponding to this FloatingItem.
This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.
In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_location_tokens","title":"get_location_tokens","text":"get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n Get the location string for the BaseCell.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_ref","title":"get_ref","text":"get_ref() -> RefItem\n get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef","title":"ImageRef","text":" Bases: BaseModel
ImageRef.
Methods:
from_pil \u2013 Construct ImageRef from a PIL Image.
validate_mimetype \u2013 validate_mimetype.
Attributes:
dpi (int) \u2013 mimetype (str) \u2013 pil_image (Optional[Image]) \u2013 Return the PIL Image.
size (Size) \u2013 uri (Union[AnyUrl, Path]) \u2013 dpi: int\n"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.mimetype","title":"mimetype","text":"mimetype: str\n"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.pil_image","title":"pil_image","text":"pil_image: Optional[Image]\n Return the PIL Image.
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.size","title":"size","text":"size: Size\n"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.uri","title":"uri","text":"uri: Union[AnyUrl, Path] = Field(union_mode='left_to_right')\n"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.from_pil","title":"from_pil","text":"from_pil(image: Image, dpi: int) -> Self\n Construct ImageRef from a PIL Image.
"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.validate_mimetype","title":"validate_mimetype","text":"validate_mimetype(v)\n validate_mimetype.
"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationClass","title":"PictureClassificationClass","text":" Bases: BaseModel
PictureClassificationData.
Attributes:
class_name (str) \u2013 confidence (float) \u2013 class_name: str\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationClass.confidence","title":"confidence","text":"confidence: float\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData","title":"PictureClassificationData","text":" Bases: BaseAnnotation
PictureClassificationData.
Attributes:
kind (Literal['classification']) \u2013 predicted_classes (List[PictureClassificationClass]) \u2013 provenance (str) \u2013 kind: Literal['classification'] = 'classification'\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData.predicted_classes","title":"predicted_classes","text":"predicted_classes: List[PictureClassificationClass]\n"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData.provenance","title":"provenance","text":"provenance: str\n"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem","title":"RefItem","text":" Bases: BaseModel
RefItem.
Methods:
get_ref \u2013 get_ref.
resolve \u2013 Resolve the path in the document.
Attributes:
cref (str) \u2013 model_config \u2013 cref: str = Field(alias='$ref', pattern=_JSON_POINTER_REGEX)\n"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.model_config","title":"model_config","text":"model_config = ConfigDict(populate_by_name=True)\n"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.get_ref","title":"get_ref","text":"get_ref()\n get_ref.
"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.resolve","title":"resolve","text":"resolve(doc: DoclingDocument)\n Resolve the path in the document.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox","title":"BoundingBox","text":" Bases: BaseModel
BoundingBox.
Methods:
area \u2013 area.
as_tuple \u2013 as_tuple.
enclosing_bbox \u2013 Create a bounding box that covers all of the given boxes.
expand_by_scale \u2013 expand_to_size.
from_tuple \u2013 from_tuple.
intersection_area_with \u2013 Calculate the intersection area with another bounding box.
intersection_over_self \u2013 intersection_over_self.
intersection_over_union \u2013 intersection_over_union.
is_above \u2013 is_above.
is_horizontally_connected \u2013 is_horizontally_connected.
is_left_of \u2013 is_left_of.
is_strictly_above \u2013 is_strictly_above.
is_strictly_left_of \u2013 is_strictly_left_of.
normalized \u2013 normalized.
overlaps \u2013 overlaps.
overlaps_horizontally \u2013 Check if two bounding boxes overlap horizontally.
overlaps_vertically \u2013 Check if two bounding boxes overlap vertically.
overlaps_vertically_with_iou \u2013 overlaps_y_with_iou.
resize_by_scale \u2013 resize_by_scale.
scale_to_size \u2013 scale_to_size.
scaled \u2013 scaled.
to_bottom_left_origin \u2013 to_bottom_left_origin.
to_top_left_origin \u2013 to_top_left_origin.
union_area_with \u2013 Calculates the union area with another bounding box.
x_overlap_with \u2013 Calculates the horizontal overlap with another bounding box.
x_union_with \u2013 Calculates the horizontal union dimension with another bounding box.
y_overlap_with \u2013 Calculates the vertical overlap with another bounding box, respecting coordinate origin.
y_union_with \u2013 Calculates the vertical union dimension with another bounding box, respecting coordinate origin.
Attributes:
b (float) \u2013 coord_origin (CoordOrigin) \u2013 height \u2013 height.
l (float) \u2013 r (float) \u2013 t (float) \u2013 width \u2013 width.
b: float\n"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.coord_origin","title":"coord_origin","text":"coord_origin: CoordOrigin = TOPLEFT\n"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.height","title":"height","text":"height\n height.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.l","title":"l","text":"l: float\n"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.r","title":"r","text":"r: float\n"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.t","title":"t","text":"t: float\n"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.width","title":"width","text":"width\n width.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.area","title":"area","text":"area() -> float\n area.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.as_tuple","title":"as_tuple","text":"as_tuple() -> Tuple[float, float, float, float]\n as_tuple.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.enclosing_bbox","title":"enclosing_bbox","text":"enclosing_bbox(boxes: List[BoundingBox]) -> BoundingBox\n Create a bounding box that covers all of the given boxes.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.expand_by_scale","title":"expand_by_scale","text":"expand_by_scale(x_scale: float, y_scale: float) -> BoundingBox\n expand_to_size.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.from_tuple","title":"from_tuple","text":"from_tuple(coord: Tuple[float, ...], origin: CoordOrigin)\n from_tuple.
Parameters:
coord (Tuple[float, ...]) \u2013 Tuple[float:
...] \u2013 origin (CoordOrigin) \u2013 CoordOrigin:
intersection_area_with(other: BoundingBox) -> float\n Calculate the intersection area with another bounding box.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.intersection_over_self","title":"intersection_over_self","text":"intersection_over_self(other: BoundingBox, eps: float = 1e-06) -> float\n intersection_over_self.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.intersection_over_union","title":"intersection_over_union","text":"intersection_over_union(other: BoundingBox, eps: float = 1e-06) -> float\n intersection_over_union.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_above","title":"is_above","text":"is_above(other: BoundingBox) -> bool\n is_above.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_horizontally_connected","title":"is_horizontally_connected","text":"is_horizontally_connected(elem_i: BoundingBox, elem_j: BoundingBox) -> bool\n is_horizontally_connected.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_left_of","title":"is_left_of","text":"is_left_of(other: BoundingBox) -> bool\n is_left_of.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_strictly_above","title":"is_strictly_above","text":"is_strictly_above(other: BoundingBox, eps: float = 0.001) -> bool\n is_strictly_above.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_strictly_left_of","title":"is_strictly_left_of","text":"is_strictly_left_of(other: BoundingBox, eps: float = 0.001) -> bool\n is_strictly_left_of.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.normalized","title":"normalized","text":"normalized(page_size: Size)\n normalized.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps","title":"overlaps","text":"overlaps(other: BoundingBox) -> bool\n overlaps.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_horizontally","title":"overlaps_horizontally","text":"overlaps_horizontally(other: BoundingBox) -> bool\n Check if two bounding boxes overlap horizontally.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_vertically","title":"overlaps_vertically","text":"overlaps_vertically(other: BoundingBox) -> bool\n Check if two bounding boxes overlap vertically.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_vertically_with_iou","title":"overlaps_vertically_with_iou","text":"overlaps_vertically_with_iou(other: BoundingBox, iou: float) -> bool\n overlaps_y_with_iou.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.resize_by_scale","title":"resize_by_scale","text":"resize_by_scale(x_scale: float, y_scale: float)\n resize_by_scale.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.scale_to_size","title":"scale_to_size","text":"scale_to_size(old_size: Size, new_size: Size)\n scale_to_size.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.scaled","title":"scaled","text":"scaled(scale: float)\n scaled.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.to_bottom_left_origin","title":"to_bottom_left_origin","text":"to_bottom_left_origin(page_height: float) -> BoundingBox\n to_bottom_left_origin.
Parameters:
page_height (float) \u2013 to_top_left_origin(page_height: float) -> BoundingBox\n to_top_left_origin.
Parameters:
page_height (float) \u2013 union_area_with(other: BoundingBox) -> float\n Calculates the union area with another bounding box.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.x_overlap_with","title":"x_overlap_with","text":"x_overlap_with(other: BoundingBox) -> float\n Calculates the horizontal overlap with another bounding box.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.x_union_with","title":"x_union_with","text":"x_union_with(other: BoundingBox) -> float\n Calculates the horizontal union dimension with another bounding box.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.y_overlap_with","title":"y_overlap_with","text":"y_overlap_with(other: BoundingBox) -> float\n Calculates the vertical overlap with another bounding box, respecting coordinate origin.
"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.y_union_with","title":"y_union_with","text":"y_union_with(other: BoundingBox) -> float\n Calculates the vertical union dimension with another bounding box, respecting coordinate origin.
"},{"location":"reference/docling_document/#docling_core.types.doc.CoordOrigin","title":"CoordOrigin","text":" Bases: str, Enum
CoordOrigin.
Attributes:
BOTTOMLEFT \u2013 TOPLEFT \u2013 BOTTOMLEFT = 'BOTTOMLEFT'\n"},{"location":"reference/docling_document/#docling_core.types.doc.CoordOrigin.TOPLEFT","title":"TOPLEFT","text":"TOPLEFT = 'TOPLEFT'\n"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode","title":"ImageRefMode","text":" Bases: str, Enum
ImageRefMode.
Attributes:
EMBEDDED \u2013 PLACEHOLDER \u2013 REFERENCED \u2013 EMBEDDED = 'embedded'\n"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode.PLACEHOLDER","title":"PLACEHOLDER","text":"PLACEHOLDER = 'placeholder'\n"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode.REFERENCED","title":"REFERENCED","text":"REFERENCED = 'referenced'\n"},{"location":"reference/docling_document/#docling_core.types.doc.Size","title":"Size","text":" Bases: BaseModel
Size.
Methods:
as_tuple \u2013 as_tuple.
Attributes:
height (float) \u2013 width (float) \u2013 height: float = 0.0\n"},{"location":"reference/docling_document/#docling_core.types.doc.Size.width","title":"width","text":"width: float = 0.0\n"},{"location":"reference/docling_document/#docling_core.types.doc.Size.as_tuple","title":"as_tuple","text":"as_tuple()\n as_tuple.
"},{"location":"reference/document_converter/","title":"Document converter","text":"This is an automatic generated API reference of the main components of Docling.
"},{"location":"reference/document_converter/#docling.document_converter","title":"document_converter","text":"Classes:
DocumentConverter \u2013 ConversionResult \u2013 ConversionStatus \u2013 FormatOption \u2013 InputFormat \u2013 A document format supported by document backend parsers.
PdfFormatOption \u2013 ImageFormatOption \u2013 StandardPdfPipeline \u2013 High-performance PDF pipeline with multi-threaded stages.
WordFormatOption \u2013 PowerpointFormatOption \u2013 MarkdownFormatOption \u2013 AsciiDocFormatOption \u2013 HTMLFormatOption \u2013 SimplePipeline \u2013 SimpleModelPipeline.
DocumentConverter(allowed_formats: Optional[list[InputFormat]] = None, format_options: Optional[dict[InputFormat, FormatOption]] = None)\n Methods:
convert \u2013 convert_all \u2013 convert_string \u2013 initialize_pipeline \u2013 Initialize the conversion pipeline for the selected format.
Attributes:
allowed_formats \u2013 format_to_options (dict[InputFormat, FormatOption]) \u2013 initialized_pipelines (dict[tuple[Type[BasePipeline], str], BasePipeline]) \u2013 instance-attribute","text":"allowed_formats = allowed_formats if allowed_formats is not None else list(InputFormat)\n"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.format_to_options","title":"format_to_options instance-attribute","text":"format_to_options: dict[InputFormat, FormatOption] = {format: (_get_default_option(format=format) if (custom_option := (get(format))) is None else custom_option) for format in (allowed_formats)}\n"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.initialized_pipelines","title":"initialized_pipelines instance-attribute","text":"initialized_pipelines: dict[tuple[Type[BasePipeline], str], BasePipeline] = {}\n"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.convert","title":"convert","text":"convert(source: Union[Path, str, DocumentStream], headers: Optional[dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -> ConversionResult\n"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.convert_all","title":"convert_all","text":"convert_all(source: Iterable[Union[Path, str, DocumentStream]], headers: Optional[dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -> Iterator[ConversionResult]\n"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.convert_string","title":"convert_string","text":"convert_string(content: str, format: InputFormat, name: Optional[str] = None) -> ConversionResult\n"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.initialize_pipeline","title":"initialize_pipeline","text":"initialize_pipeline(format: InputFormat)\n Initialize the conversion pipeline for the selected format.
"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult","title":"ConversionResult","text":" Bases: ConversionAssets
Methods:
load \u2013 Load a ConversionAssets.
save \u2013 Serialize the full ConversionAssets to JSON.
Attributes:
assembled (AssembledUnit) \u2013 confidence (ConfidenceReport) \u2013 document (DoclingDocument) \u2013 errors (list[ErrorItem]) \u2013 input (InputDocument) \u2013 legacy_document \u2013 pages (list[Page]) \u2013 status (ConversionStatus) \u2013 timestamp (Optional[str]) \u2013 timings (dict[str, ProfilingItem]) \u2013 version (DoclingVersion) \u2013 class-attribute instance-attribute","text":"assembled: AssembledUnit = AssembledUnit()\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.confidence","title":"confidence class-attribute instance-attribute","text":"confidence: ConfidenceReport = Field(default_factory=ConfidenceReport)\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.document","title":"document class-attribute instance-attribute","text":"document: DoclingDocument = _EMPTY_DOCLING_DOC\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.errors","title":"errors class-attribute instance-attribute","text":"errors: list[ErrorItem] = []\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.input","title":"input instance-attribute","text":"input: InputDocument\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.legacy_document","title":"legacy_document property","text":"legacy_document\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.pages","title":"pages class-attribute instance-attribute","text":"pages: list[Page] = []\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.status","title":"status class-attribute instance-attribute","text":"status: ConversionStatus = PENDING\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.timestamp","title":"timestamp class-attribute instance-attribute","text":"timestamp: Optional[str] = None\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.timings","title":"timings class-attribute instance-attribute","text":"timings: dict[str, ProfilingItem] = {}\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.version","title":"version class-attribute instance-attribute","text":"version: DoclingVersion = DoclingVersion()\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.load","title":"load classmethod","text":"load(filename: Union[str, Path]) -> ConversionAssets\n Load a ConversionAssets.
"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.save","title":"save","text":"save(*, filename: Union[str, Path], indent: Optional[int] = 2)\n Serialize the full ConversionAssets to JSON.
"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus","title":"ConversionStatus","text":" Bases: str, Enum
Attributes:
FAILURE \u2013 PARTIAL_SUCCESS \u2013 PENDING \u2013 SKIPPED \u2013 STARTED \u2013 SUCCESS \u2013 class-attribute instance-attribute","text":"FAILURE = 'failure'\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.PARTIAL_SUCCESS","title":"PARTIAL_SUCCESS class-attribute instance-attribute","text":"PARTIAL_SUCCESS = 'partial_success'\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.PENDING","title":"PENDING class-attribute instance-attribute","text":"PENDING = 'pending'\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.SKIPPED","title":"SKIPPED class-attribute instance-attribute","text":"SKIPPED = 'skipped'\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.STARTED","title":"STARTED class-attribute instance-attribute","text":"STARTED = 'started'\n"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.SUCCESS","title":"SUCCESS class-attribute instance-attribute","text":"SUCCESS = 'success'\n"},{"location":"reference/document_converter/#docling.document_converter.FormatOption","title":"FormatOption","text":" Bases: BaseFormatOption
Methods:
set_optional_field_default \u2013 Attributes:
backend (Type[AbstractDocumentBackend]) \u2013 backend_options (Optional[BackendOptions]) \u2013 model_config \u2013 pipeline_cls (Type[BasePipeline]) \u2013 pipeline_options (Optional[PipelineOptions]) \u2013 instance-attribute","text":"backend: Type[AbstractDocumentBackend]\n"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.backend_options","title":"backend_options class-attribute instance-attribute","text":"backend_options: Optional[BackendOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.pipeline_cls","title":"pipeline_cls instance-attribute","text":"pipeline_cls: Type[BasePipeline]\n"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.pipeline_options","title":"pipeline_options class-attribute instance-attribute","text":"pipeline_options: Optional[PipelineOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> Self\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat","title":"InputFormat","text":" Bases: str, Enum
A document format supported by document backend parsers.
Attributes:
ASCIIDOC \u2013 AUDIO \u2013 CSV \u2013 DOCX \u2013 HTML \u2013 IMAGE \u2013 JSON_DOCLING \u2013 MD \u2013 METS_GBS \u2013 PDF \u2013 PPTX \u2013 VTT \u2013 XLSX \u2013 XML_JATS \u2013 XML_USPTO \u2013 class-attribute instance-attribute","text":"ASCIIDOC = 'asciidoc'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.AUDIO","title":"AUDIO class-attribute instance-attribute","text":"AUDIO = 'audio'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.CSV","title":"CSV class-attribute instance-attribute","text":"CSV = 'csv'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.DOCX","title":"DOCX class-attribute instance-attribute","text":"DOCX = 'docx'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.HTML","title":"HTML class-attribute instance-attribute","text":"HTML = 'html'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.IMAGE","title":"IMAGE class-attribute instance-attribute","text":"IMAGE = 'image'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.JSON_DOCLING","title":"JSON_DOCLING class-attribute instance-attribute","text":"JSON_DOCLING = 'json_docling'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.MD","title":"MD class-attribute instance-attribute","text":"MD = 'md'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.METS_GBS","title":"METS_GBS class-attribute instance-attribute","text":"METS_GBS = 'mets_gbs'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.PDF","title":"PDF class-attribute instance-attribute","text":"PDF = 'pdf'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.PPTX","title":"PPTX class-attribute instance-attribute","text":"PPTX = 'pptx'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.VTT","title":"VTT class-attribute instance-attribute","text":"VTT = 'vtt'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XLSX","title":"XLSX class-attribute instance-attribute","text":"XLSX = 'xlsx'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XML_JATS","title":"XML_JATS class-attribute instance-attribute","text":"XML_JATS = 'xml_jats'\n"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XML_USPTO","title":"XML_USPTO class-attribute instance-attribute","text":"XML_USPTO = 'xml_uspto'\n"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption","title":"PdfFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default \u2013 Attributes:
backend (Type[AbstractDocumentBackend]) \u2013 backend_options (Optional[PdfBackendOptions]) \u2013 model_config \u2013 pipeline_cls (Type) \u2013 pipeline_options (Optional[PipelineOptions]) \u2013 class-attribute instance-attribute","text":"backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend\n"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.backend_options","title":"backend_options class-attribute instance-attribute","text":"backend_options: Optional[PdfBackendOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.pipeline_cls","title":"pipeline_cls class-attribute instance-attribute","text":"pipeline_cls: Type = StandardPdfPipeline\n"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.pipeline_options","title":"pipeline_options class-attribute instance-attribute","text":"pipeline_options: Optional[PipelineOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> Self\n"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption","title":"ImageFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default \u2013 Attributes:
backend (Type[AbstractDocumentBackend]) \u2013 backend_options (Optional[BackendOptions]) \u2013 model_config \u2013 pipeline_cls (Type) \u2013 pipeline_options (Optional[PipelineOptions]) \u2013 class-attribute instance-attribute","text":"backend: Type[AbstractDocumentBackend] = ImageDocumentBackend\n"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.backend_options","title":"backend_options class-attribute instance-attribute","text":"backend_options: Optional[BackendOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.pipeline_cls","title":"pipeline_cls class-attribute instance-attribute","text":"pipeline_cls: Type = StandardPdfPipeline\n"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.pipeline_options","title":"pipeline_options class-attribute instance-attribute","text":"pipeline_options: Optional[PipelineOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> Self\n"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline","title":"StandardPdfPipeline","text":"StandardPdfPipeline(pipeline_options: ThreadedPdfPipelineOptions)\n Bases: ConvertPipeline
High-performance PDF pipeline with multi-threaded stages.
Methods:
execute \u2013 get_default_options \u2013 is_backend_supported \u2013 Attributes:
artifacts_path (Optional[Path]) \u2013 build_pipe (List[Callable]) \u2013 enrichment_pipe \u2013 keep_images \u2013 pipeline_options (ThreadedPdfPipelineOptions) \u2013 instance-attribute","text":"artifacts_path: Optional[Path] = None\n"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.build_pipe","title":"build_pipe instance-attribute","text":"build_pipe: List[Callable] = []\n"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.enrichment_pipe","title":"enrichment_pipe instance-attribute","text":"enrichment_pipe = [DocumentPictureClassifier(enabled=do_picture_classification, artifacts_path=artifacts_path, options=DocumentPictureClassifierOptions(), accelerator_options=accelerator_options), picture_description_model]\n"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.keep_images","title":"keep_images instance-attribute","text":"keep_images = False\n"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.pipeline_options","title":"pipeline_options instance-attribute","text":"pipeline_options: ThreadedPdfPipelineOptions = pipeline_options\n"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.execute","title":"execute","text":"execute(in_doc: InputDocument, raises_on_error: bool) -> ConversionResult\n"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.get_default_options","title":"get_default_options classmethod","text":"get_default_options() -> ThreadedPdfPipelineOptions\n"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.is_backend_supported","title":"is_backend_supported classmethod","text":"is_backend_supported(backend: AbstractDocumentBackend) -> bool\n"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption","title":"WordFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default \u2013 Attributes:
backend (Type[AbstractDocumentBackend]) \u2013 backend_options (Optional[BackendOptions]) \u2013 model_config \u2013 pipeline_cls (Type) \u2013 pipeline_options (Optional[PipelineOptions]) \u2013 class-attribute instance-attribute","text":"backend: Type[AbstractDocumentBackend] = MsWordDocumentBackend\n"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.backend_options","title":"backend_options class-attribute instance-attribute","text":"backend_options: Optional[BackendOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.pipeline_cls","title":"pipeline_cls class-attribute instance-attribute","text":"pipeline_cls: Type = SimplePipeline\n"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.pipeline_options","title":"pipeline_options class-attribute instance-attribute","text":"pipeline_options: Optional[PipelineOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> Self\n"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption","title":"PowerpointFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default \u2013 Attributes:
backend (Type[AbstractDocumentBackend]) \u2013 backend_options (Optional[BackendOptions]) \u2013 model_config \u2013 pipeline_cls (Type) \u2013 pipeline_options (Optional[PipelineOptions]) \u2013 class-attribute instance-attribute","text":"backend: Type[AbstractDocumentBackend] = MsPowerpointDocumentBackend\n"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.backend_options","title":"backend_options class-attribute instance-attribute","text":"backend_options: Optional[BackendOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.pipeline_cls","title":"pipeline_cls class-attribute instance-attribute","text":"pipeline_cls: Type = SimplePipeline\n"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.pipeline_options","title":"pipeline_options class-attribute instance-attribute","text":"pipeline_options: Optional[PipelineOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> Self\n"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption","title":"MarkdownFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default \u2013 Attributes:
backend (Type[AbstractDocumentBackend]) \u2013 backend_options (Optional[MarkdownBackendOptions]) \u2013 model_config \u2013 pipeline_cls (Type) \u2013 pipeline_options (Optional[PipelineOptions]) \u2013 class-attribute instance-attribute","text":"backend: Type[AbstractDocumentBackend] = MarkdownDocumentBackend\n"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.backend_options","title":"backend_options class-attribute instance-attribute","text":"backend_options: Optional[MarkdownBackendOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.pipeline_cls","title":"pipeline_cls class-attribute instance-attribute","text":"pipeline_cls: Type = SimplePipeline\n"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.pipeline_options","title":"pipeline_options class-attribute instance-attribute","text":"pipeline_options: Optional[PipelineOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> Self\n"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption","title":"AsciiDocFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default \u2013 Attributes:
backend (Type[AbstractDocumentBackend]) \u2013 backend_options (Optional[BackendOptions]) \u2013 model_config \u2013 pipeline_cls (Type) \u2013 pipeline_options (Optional[PipelineOptions]) \u2013 class-attribute instance-attribute","text":"backend: Type[AbstractDocumentBackend] = AsciiDocBackend\n"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.backend_options","title":"backend_options class-attribute instance-attribute","text":"backend_options: Optional[BackendOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.pipeline_cls","title":"pipeline_cls class-attribute instance-attribute","text":"pipeline_cls: Type = SimplePipeline\n"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.pipeline_options","title":"pipeline_options class-attribute instance-attribute","text":"pipeline_options: Optional[PipelineOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> Self\n"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption","title":"HTMLFormatOption","text":" Bases: FormatOption
Methods:
set_optional_field_default \u2013 Attributes:
backend (Type[AbstractDocumentBackend]) \u2013 backend_options (Optional[HTMLBackendOptions]) \u2013 model_config \u2013 pipeline_cls (Type) \u2013 pipeline_options (Optional[PipelineOptions]) \u2013 class-attribute instance-attribute","text":"backend: Type[AbstractDocumentBackend] = HTMLDocumentBackend\n"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.backend_options","title":"backend_options class-attribute instance-attribute","text":"backend_options: Optional[HTMLBackendOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(arbitrary_types_allowed=True)\n"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.pipeline_cls","title":"pipeline_cls class-attribute instance-attribute","text":"pipeline_cls: Type = SimplePipeline\n"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.pipeline_options","title":"pipeline_options class-attribute instance-attribute","text":"pipeline_options: Optional[PipelineOptions] = None\n"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"set_optional_field_default() -> Self\n"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline","title":"SimplePipeline","text":"SimplePipeline(pipeline_options: ConvertPipelineOptions)\n Bases: ConvertPipeline
SimpleModelPipeline.
This class is used at the moment for formats / backends which produce straight DoclingDocument output.
Methods:
execute \u2013 get_default_options \u2013 is_backend_supported \u2013 Attributes:
artifacts_path (Optional[Path]) \u2013 build_pipe (List[Callable]) \u2013 enrichment_pipe \u2013 keep_images \u2013 pipeline_options (ConvertPipelineOptions) \u2013 instance-attribute","text":"artifacts_path: Optional[Path] = None\n"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.build_pipe","title":"build_pipe instance-attribute","text":"build_pipe: List[Callable] = []\n"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.enrichment_pipe","title":"enrichment_pipe instance-attribute","text":"enrichment_pipe = [DocumentPictureClassifier(enabled=do_picture_classification, artifacts_path=artifacts_path, options=DocumentPictureClassifierOptions(), accelerator_options=accelerator_options), picture_description_model]\n"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.keep_images","title":"keep_images instance-attribute","text":"keep_images = False\n"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.pipeline_options","title":"pipeline_options instance-attribute","text":"pipeline_options: ConvertPipelineOptions\n"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.execute","title":"execute","text":"execute(in_doc: InputDocument, raises_on_error: bool) -> ConversionResult\n"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.get_default_options","title":"get_default_options classmethod","text":"get_default_options() -> ConvertPipelineOptions\n"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.is_backend_supported","title":"is_backend_supported classmethod","text":"is_backend_supported(backend: AbstractDocumentBackend)\n"},{"location":"reference/pipeline_options/","title":"Pipeline options","text":"Pipeline options allow to customize the execution of the models during the conversion pipeline. This includes options for the OCR engines, the table model as well as enrichment options which can be enabled with do_xyz = True.
This is an automatic generated API reference of the all the pipeline options available in Docling.
"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options","title":"pipeline_options","text":"Classes:
AsrPipelineOptions \u2013 BaseLayoutOptions \u2013 Base options for layout models.
BaseOptions \u2013 Base class for options.
BaseTableStructureOptions \u2013 Base options for table structure models.
ConvertPipelineOptions \u2013 Base convert pipeline options.
EasyOcrOptions \u2013 Options for the EasyOCR engine.
LayoutOptions \u2013 Options for layout processing.
OcrAutoOptions \u2013 Options for pick OCR engine automatically.
OcrEngine \u2013 Enum of valid OCR engines.
OcrMacOptions \u2013 Options for the Mac OCR engine.
OcrOptions \u2013 OCR options.
PaginatedPipelineOptions \u2013 PdfBackend \u2013 Enum of valid PDF backends.
PdfPipelineOptions \u2013 Options for the PDF pipeline.
PictureDescriptionApiOptions \u2013 PictureDescriptionBaseOptions \u2013 PictureDescriptionVlmOptions \u2013 PipelineOptions \u2013 Base pipeline options.
ProcessingPipeline \u2013 RapidOcrOptions \u2013 Options for the RapidOCR engine.
TableFormerMode \u2013 Modes for the TableFormer model.
TableStructureOptions \u2013 Options for the table structure.
TesseractCliOcrOptions \u2013 Options for the TesseractCli engine.
TesseractOcrOptions \u2013 Options for the Tesseract engine.
ThreadedPdfPipelineOptions \u2013 Pipeline options for the threaded PDF pipeline with batching and backpressure control
VlmExtractionPipelineOptions \u2013 Options for extraction pipeline.
VlmPipelineOptions \u2013 Attributes:
granite_picture_description \u2013 smolvlm_picture_description \u2013 module-attribute","text":"granite_picture_description = PictureDescriptionVlmOptions(repo_id='ibm-granite/granite-vision-3.3-2b', prompt='What is shown in this image?')\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.smolvlm_picture_description","title":"smolvlm_picture_description module-attribute","text":"smolvlm_picture_description = PictureDescriptionVlmOptions(repo_id='HuggingFaceTB/SmolVLM-256M-Instruct')\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions","title":"AsrPipelineOptions","text":" Bases: PipelineOptions
Attributes:
accelerator_options (AcceleratorOptions) \u2013 allow_external_plugins (bool) \u2013 artifacts_path (Optional[Union[Path, str]]) \u2013 asr_options (Union[InlineAsrOptions]) \u2013 document_timeout (Optional[float]) \u2013 enable_remote_services (bool) \u2013 kind (str) \u2013 class-attribute instance-attribute","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute instance-attribute","text":"allow_external_plugins: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.artifacts_path","title":"artifacts_path class-attribute instance-attribute","text":"artifacts_path: Optional[Union[Path, str]] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.asr_options","title":"asr_options class-attribute instance-attribute","text":"asr_options: Union[InlineAsrOptions] = WHISPER_TINY\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.document_timeout","title":"document_timeout class-attribute instance-attribute","text":"document_timeout: Optional[float] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute instance-attribute","text":"enable_remote_services: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.kind","title":"kind class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseLayoutOptions","title":"BaseLayoutOptions","text":" Bases: BaseOptions
Base options for layout models.
Attributes:
keep_empty_clusters (bool) \u2013 kind (str) \u2013 skip_cell_assignment (bool) \u2013 class-attribute instance-attribute","text":"keep_empty_clusters: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseLayoutOptions.kind","title":"kind class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseLayoutOptions.skip_cell_assignment","title":"skip_cell_assignment class-attribute instance-attribute","text":"skip_cell_assignment: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseOptions","title":"BaseOptions","text":" Bases: BaseModel
Base class for options.
Attributes:
kind (str) \u2013 class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseTableStructureOptions","title":"BaseTableStructureOptions","text":" Bases: BaseOptions
Base options for table structure models.
Attributes:
kind (str) \u2013 class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions","title":"ConvertPipelineOptions","text":" Bases: PipelineOptions
Base convert pipeline options.
Attributes:
accelerator_options (AcceleratorOptions) \u2013 allow_external_plugins (bool) \u2013 artifacts_path (Optional[Union[Path, str]]) \u2013 do_picture_classification (bool) \u2013 do_picture_description (bool) \u2013 document_timeout (Optional[float]) \u2013 enable_remote_services (bool) \u2013 kind (str) \u2013 picture_description_options (PictureDescriptionBaseOptions) \u2013 class-attribute instance-attribute","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute instance-attribute","text":"allow_external_plugins: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.artifacts_path","title":"artifacts_path class-attribute instance-attribute","text":"artifacts_path: Optional[Union[Path, str]] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.do_picture_classification","title":"do_picture_classification class-attribute instance-attribute","text":"do_picture_classification: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.do_picture_description","title":"do_picture_description class-attribute instance-attribute","text":"do_picture_description: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.document_timeout","title":"document_timeout class-attribute instance-attribute","text":"document_timeout: Optional[float] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute instance-attribute","text":"enable_remote_services: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.kind","title":"kind class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.picture_description_options","title":"picture_description_options class-attribute instance-attribute","text":"picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions","title":"EasyOcrOptions","text":" Bases: OcrOptions
Options for the EasyOCR engine.
Attributes:
bitmap_area_threshold (float) \u2013 confidence_threshold (float) \u2013 download_enabled (bool) \u2013 force_full_page_ocr (bool) \u2013 kind (Literal['easyocr']) \u2013 lang (List[str]) \u2013 model_config \u2013 model_storage_directory (Optional[str]) \u2013 recog_network (Optional[str]) \u2013 suppress_mps_warnings (bool) \u2013 use_gpu (Optional[bool]) \u2013 class-attribute instance-attribute","text":"bitmap_area_threshold: float = 0.05\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.confidence_threshold","title":"confidence_threshold class-attribute instance-attribute","text":"confidence_threshold: float = 0.5\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.download_enabled","title":"download_enabled class-attribute instance-attribute","text":"download_enabled: bool = True\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute instance-attribute","text":"force_full_page_ocr: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.kind","title":"kind class-attribute","text":"kind: Literal['easyocr'] = 'easyocr'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.lang","title":"lang class-attribute instance-attribute","text":"lang: List[str] = ['fr', 'de', 'es', 'en']\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(extra='forbid', protected_namespaces=())\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.model_storage_directory","title":"model_storage_directory class-attribute instance-attribute","text":"model_storage_directory: Optional[str] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.recog_network","title":"recog_network class-attribute instance-attribute","text":"recog_network: Optional[str] = 'standard'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.suppress_mps_warnings","title":"suppress_mps_warnings class-attribute instance-attribute","text":"suppress_mps_warnings: bool = True\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.use_gpu","title":"use_gpu class-attribute instance-attribute","text":"use_gpu: Optional[bool] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions","title":"LayoutOptions","text":" Bases: BaseLayoutOptions
Options for layout processing.
Attributes:
create_orphan_clusters (bool) \u2013 keep_empty_clusters (bool) \u2013 kind (str) \u2013 model_spec (LayoutModelConfig) \u2013 skip_cell_assignment (bool) \u2013 class-attribute instance-attribute","text":"create_orphan_clusters: bool = True\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.keep_empty_clusters","title":"keep_empty_clusters class-attribute instance-attribute","text":"keep_empty_clusters: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.kind","title":"kind class-attribute","text":"kind: str = 'docling_layout_default'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.model_spec","title":"model_spec class-attribute instance-attribute","text":"model_spec: LayoutModelConfig = DOCLING_LAYOUT_HERON\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.skip_cell_assignment","title":"skip_cell_assignment class-attribute instance-attribute","text":"skip_cell_assignment: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrAutoOptions","title":"OcrAutoOptions","text":" Bases: OcrOptions
Options for pick OCR engine automatically.
Attributes:
bitmap_area_threshold (float) \u2013 force_full_page_ocr (bool) \u2013 kind (Literal['auto']) \u2013 lang (List[str]) \u2013 class-attribute instance-attribute","text":"bitmap_area_threshold: float = 0.05\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrAutoOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute instance-attribute","text":"force_full_page_ocr: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrAutoOptions.kind","title":"kind class-attribute","text":"kind: Literal['auto'] = 'auto'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrAutoOptions.lang","title":"lang class-attribute instance-attribute","text":"lang: List[str] = []\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine","title":"OcrEngine","text":" Bases: str, Enum
Enum of valid OCR engines.
Attributes:
AUTO \u2013 EASYOCR \u2013 OCRMAC \u2013 RAPIDOCR \u2013 TESSERACT \u2013 TESSERACT_CLI \u2013 class-attribute instance-attribute","text":"AUTO = 'auto'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.EASYOCR","title":"EASYOCR class-attribute instance-attribute","text":"EASYOCR = 'easyocr'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.OCRMAC","title":"OCRMAC class-attribute instance-attribute","text":"OCRMAC = 'ocrmac'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.RAPIDOCR","title":"RAPIDOCR class-attribute instance-attribute","text":"RAPIDOCR = 'rapidocr'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.TESSERACT","title":"TESSERACT class-attribute instance-attribute","text":"TESSERACT = 'tesseract'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.TESSERACT_CLI","title":"TESSERACT_CLI class-attribute instance-attribute","text":"TESSERACT_CLI = 'tesseract_cli'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions","title":"OcrMacOptions","text":" Bases: OcrOptions
Options for the Mac OCR engine.
Attributes:
bitmap_area_threshold (float) \u2013 force_full_page_ocr (bool) \u2013 framework (str) \u2013 kind (Literal['ocrmac']) \u2013 lang (List[str]) \u2013 model_config \u2013 recognition (str) \u2013 class-attribute instance-attribute","text":"bitmap_area_threshold: float = 0.05\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute instance-attribute","text":"force_full_page_ocr: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.framework","title":"framework class-attribute instance-attribute","text":"framework: str = 'vision'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.kind","title":"kind class-attribute","text":"kind: Literal['ocrmac'] = 'ocrmac'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.lang","title":"lang class-attribute instance-attribute","text":"lang: List[str] = ['fr-FR', 'de-DE', 'es-ES', 'en-US']\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.recognition","title":"recognition class-attribute instance-attribute","text":"recognition: str = 'accurate'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions","title":"OcrOptions","text":" Bases: BaseOptions
OCR options.
Attributes:
bitmap_area_threshold (float) \u2013 force_full_page_ocr (bool) \u2013 kind (str) \u2013 lang (List[str]) \u2013 class-attribute instance-attribute","text":"bitmap_area_threshold: float = 0.05\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute instance-attribute","text":"force_full_page_ocr: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.kind","title":"kind class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.lang","title":"lang instance-attribute","text":"lang: List[str]\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions","title":"PaginatedPipelineOptions","text":" Bases: ConvertPipelineOptions
Attributes:
accelerator_options (AcceleratorOptions) \u2013 allow_external_plugins (bool) \u2013 artifacts_path (Optional[Union[Path, str]]) \u2013 do_picture_classification (bool) \u2013 do_picture_description (bool) \u2013 document_timeout (Optional[float]) \u2013 enable_remote_services (bool) \u2013 generate_page_images (bool) \u2013 generate_picture_images (bool) \u2013 images_scale (float) \u2013 kind (str) \u2013 picture_description_options (PictureDescriptionBaseOptions) \u2013 class-attribute instance-attribute","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute instance-attribute","text":"allow_external_plugins: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.artifacts_path","title":"artifacts_path class-attribute instance-attribute","text":"artifacts_path: Optional[Union[Path, str]] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.do_picture_classification","title":"do_picture_classification class-attribute instance-attribute","text":"do_picture_classification: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.do_picture_description","title":"do_picture_description class-attribute instance-attribute","text":"do_picture_description: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.document_timeout","title":"document_timeout class-attribute instance-attribute","text":"document_timeout: Optional[float] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute instance-attribute","text":"enable_remote_services: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.generate_page_images","title":"generate_page_images class-attribute instance-attribute","text":"generate_page_images: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.generate_picture_images","title":"generate_picture_images class-attribute instance-attribute","text":"generate_picture_images: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.images_scale","title":"images_scale class-attribute instance-attribute","text":"images_scale: float = 1.0\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.kind","title":"kind class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.picture_description_options","title":"picture_description_options class-attribute instance-attribute","text":"picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend","title":"PdfBackend","text":" Bases: str, Enum
Enum of valid PDF backends.
Attributes:
DLPARSE_V1 \u2013 DLPARSE_V2 \u2013 DLPARSE_V4 \u2013 PYPDFIUM2 \u2013 class-attribute instance-attribute","text":"DLPARSE_V1 = 'dlparse_v1'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.DLPARSE_V2","title":"DLPARSE_V2 class-attribute instance-attribute","text":"DLPARSE_V2 = 'dlparse_v2'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.DLPARSE_V4","title":"DLPARSE_V4 class-attribute instance-attribute","text":"DLPARSE_V4 = 'dlparse_v4'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.PYPDFIUM2","title":"PYPDFIUM2 class-attribute instance-attribute","text":"PYPDFIUM2 = 'pypdfium2'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions","title":"PdfPipelineOptions","text":" Bases: PaginatedPipelineOptions
Options for the PDF pipeline.
Attributes:
accelerator_options (AcceleratorOptions) \u2013 allow_external_plugins (bool) \u2013 artifacts_path (Optional[Union[Path, str]]) \u2013 batch_polling_interval_seconds (float) \u2013 do_code_enrichment (bool) \u2013 do_formula_enrichment (bool) \u2013 do_ocr (bool) \u2013 do_picture_classification (bool) \u2013 do_picture_description (bool) \u2013 do_table_structure (bool) \u2013 document_timeout (Optional[float]) \u2013 enable_remote_services (bool) \u2013 force_backend_text (bool) \u2013 generate_page_images (bool) \u2013 generate_parsed_pages (bool) \u2013 generate_picture_images (bool) \u2013 generate_table_images (bool) \u2013 images_scale (float) \u2013 kind (str) \u2013 layout_batch_size (int) \u2013 layout_options (BaseLayoutOptions) \u2013 ocr_batch_size (int) \u2013 ocr_options (OcrOptions) \u2013 picture_description_options (PictureDescriptionBaseOptions) \u2013 queue_max_size (int) \u2013 table_batch_size (int) \u2013 table_structure_options (BaseTableStructureOptions) \u2013 class-attribute instance-attribute","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute instance-attribute","text":"allow_external_plugins: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.artifacts_path","title":"artifacts_path class-attribute instance-attribute","text":"artifacts_path: Optional[Union[Path, str]] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.batch_polling_interval_seconds","title":"batch_polling_interval_seconds class-attribute instance-attribute","text":"batch_polling_interval_seconds: float = 0.5\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_code_enrichment","title":"do_code_enrichment class-attribute instance-attribute","text":"do_code_enrichment: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_formula_enrichment","title":"do_formula_enrichment class-attribute instance-attribute","text":"do_formula_enrichment: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_ocr","title":"do_ocr class-attribute instance-attribute","text":"do_ocr: bool = True\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_picture_classification","title":"do_picture_classification class-attribute instance-attribute","text":"do_picture_classification: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_picture_description","title":"do_picture_description class-attribute instance-attribute","text":"do_picture_description: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_table_structure","title":"do_table_structure class-attribute instance-attribute","text":"do_table_structure: bool = True\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.document_timeout","title":"document_timeout class-attribute instance-attribute","text":"document_timeout: Optional[float] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute instance-attribute","text":"enable_remote_services: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.force_backend_text","title":"force_backend_text class-attribute instance-attribute","text":"force_backend_text: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_page_images","title":"generate_page_images class-attribute instance-attribute","text":"generate_page_images: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_parsed_pages","title":"generate_parsed_pages class-attribute instance-attribute","text":"generate_parsed_pages: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_picture_images","title":"generate_picture_images class-attribute instance-attribute","text":"generate_picture_images: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_table_images","title":"generate_table_images class-attribute instance-attribute","text":"generate_table_images: bool = Field(default=False, deprecated='Field `generate_table_images` is deprecated. To obtain table images, set `PdfPipelineOptions.generate_page_images = True` before conversion and then use the `TableItem.get_image` function.')\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.images_scale","title":"images_scale class-attribute instance-attribute","text":"images_scale: float = 1.0\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.kind","title":"kind class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.layout_batch_size","title":"layout_batch_size class-attribute instance-attribute","text":"layout_batch_size: int = 4\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.layout_options","title":"layout_options class-attribute instance-attribute","text":"layout_options: BaseLayoutOptions = LayoutOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.ocr_batch_size","title":"ocr_batch_size class-attribute instance-attribute","text":"ocr_batch_size: int = 4\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.ocr_options","title":"ocr_options class-attribute instance-attribute","text":"ocr_options: OcrOptions = OcrAutoOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.picture_description_options","title":"picture_description_options class-attribute instance-attribute","text":"picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.queue_max_size","title":"queue_max_size class-attribute instance-attribute","text":"queue_max_size: int = 100\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.table_batch_size","title":"table_batch_size class-attribute instance-attribute","text":"table_batch_size: int = 4\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.table_structure_options","title":"table_structure_options class-attribute instance-attribute","text":"table_structure_options: BaseTableStructureOptions = TableStructureOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions","title":"PictureDescriptionApiOptions","text":" Bases: PictureDescriptionBaseOptions
Attributes:
batch_size (int) \u2013 concurrency (int) \u2013 headers (Dict[str, str]) \u2013 kind (Literal['api']) \u2013 params (Dict[str, Any]) \u2013 picture_area_threshold (float) \u2013 prompt (str) \u2013 provenance (str) \u2013 scale (float) \u2013 timeout (float) \u2013 url (AnyUrl) \u2013 class-attribute instance-attribute","text":"batch_size: int = 8\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.concurrency","title":"concurrency class-attribute instance-attribute","text":"concurrency: int = 1\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.headers","title":"headers class-attribute instance-attribute","text":"headers: Dict[str, str] = {}\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.kind","title":"kind class-attribute","text":"kind: Literal['api'] = 'api'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.params","title":"params class-attribute instance-attribute","text":"params: Dict[str, Any] = {}\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.picture_area_threshold","title":"picture_area_threshold class-attribute instance-attribute","text":"picture_area_threshold: float = 0.05\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.prompt","title":"prompt class-attribute instance-attribute","text":"prompt: str = 'Describe this image in a few sentences.'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.provenance","title":"provenance class-attribute instance-attribute","text":"provenance: str = ''\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.scale","title":"scale class-attribute instance-attribute","text":"scale: float = 2\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.timeout","title":"timeout class-attribute instance-attribute","text":"timeout: float = 20\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.url","title":"url class-attribute instance-attribute","text":"url: AnyUrl = AnyUrl('http://localhost:8000/v1/chat/completions')\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions","title":"PictureDescriptionBaseOptions","text":" Bases: BaseOptions
Attributes:
batch_size (int) \u2013 kind (str) \u2013 picture_area_threshold (float) \u2013 scale (float) \u2013 class-attribute instance-attribute","text":"batch_size: int = 8\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.kind","title":"kind class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.picture_area_threshold","title":"picture_area_threshold class-attribute instance-attribute","text":"picture_area_threshold: float = 0.05\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.scale","title":"scale class-attribute instance-attribute","text":"scale: float = 2\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions","title":"PictureDescriptionVlmOptions","text":" Bases: PictureDescriptionBaseOptions
Attributes:
batch_size (int) \u2013 generation_config (Dict[str, Any]) \u2013 kind (Literal['vlm']) \u2013 picture_area_threshold (float) \u2013 prompt (str) \u2013 repo_cache_folder (str) \u2013 repo_id (str) \u2013 scale (float) \u2013 class-attribute instance-attribute","text":"batch_size: int = 8\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.generation_config","title":"generation_config class-attribute instance-attribute","text":"generation_config: Dict[str, Any] = dict(max_new_tokens=200, do_sample=False)\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.kind","title":"kind class-attribute","text":"kind: Literal['vlm'] = 'vlm'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.picture_area_threshold","title":"picture_area_threshold class-attribute instance-attribute","text":"picture_area_threshold: float = 0.05\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.prompt","title":"prompt class-attribute instance-attribute","text":"prompt: str = 'Describe this image in a few sentences.'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.repo_cache_folder","title":"repo_cache_folder property","text":"repo_cache_folder: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.repo_id","title":"repo_id instance-attribute","text":"repo_id: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.scale","title":"scale class-attribute instance-attribute","text":"scale: float = 2\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions","title":"PipelineOptions","text":" Bases: BaseOptions
Base pipeline options.
Attributes:
accelerator_options (AcceleratorOptions) \u2013 allow_external_plugins (bool) \u2013 artifacts_path (Optional[Union[Path, str]]) \u2013 document_timeout (Optional[float]) \u2013 enable_remote_services (bool) \u2013 kind (str) \u2013 class-attribute instance-attribute","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute instance-attribute","text":"allow_external_plugins: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.artifacts_path","title":"artifacts_path class-attribute instance-attribute","text":"artifacts_path: Optional[Union[Path, str]] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.document_timeout","title":"document_timeout class-attribute instance-attribute","text":"document_timeout: Optional[float] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute instance-attribute","text":"enable_remote_services: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.kind","title":"kind class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline","title":"ProcessingPipeline","text":" Bases: str, Enum
Attributes:
ASR \u2013 LEGACY \u2013 STANDARD \u2013 VLM \u2013 class-attribute instance-attribute","text":"ASR = 'asr'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.LEGACY","title":"LEGACY class-attribute instance-attribute","text":"LEGACY = 'legacy'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.STANDARD","title":"STANDARD class-attribute instance-attribute","text":"STANDARD = 'standard'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.VLM","title":"VLM class-attribute instance-attribute","text":"VLM = 'vlm'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions","title":"RapidOcrOptions","text":" Bases: OcrOptions
Options for the RapidOCR engine.
Attributes:
backend (Literal['onnxruntime', 'openvino', 'paddle', 'torch']) \u2013 bitmap_area_threshold (float) \u2013 cls_model_path (Optional[str]) \u2013 det_model_path (Optional[str]) \u2013 font_path (Optional[str]) \u2013 force_full_page_ocr (bool) \u2013 kind (Literal['rapidocr']) \u2013 lang (List[str]) \u2013 model_config \u2013 print_verbose (bool) \u2013 rapidocr_params (Dict[str, Any]) \u2013 rec_font_path (Optional[str]) \u2013 rec_keys_path (Optional[str]) \u2013 rec_model_path (Optional[str]) \u2013 text_score (float) \u2013 use_cls (Optional[bool]) \u2013 use_det (Optional[bool]) \u2013 use_rec (Optional[bool]) \u2013 class-attribute instance-attribute","text":"backend: Literal['onnxruntime', 'openvino', 'paddle', 'torch'] = 'onnxruntime'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.bitmap_area_threshold","title":"bitmap_area_threshold class-attribute instance-attribute","text":"bitmap_area_threshold: float = 0.05\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.cls_model_path","title":"cls_model_path class-attribute instance-attribute","text":"cls_model_path: Optional[str] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.det_model_path","title":"det_model_path class-attribute instance-attribute","text":"det_model_path: Optional[str] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.font_path","title":"font_path class-attribute instance-attribute","text":"font_path: Optional[str] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute instance-attribute","text":"force_full_page_ocr: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.kind","title":"kind class-attribute","text":"kind: Literal['rapidocr'] = 'rapidocr'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.lang","title":"lang class-attribute instance-attribute","text":"lang: List[str] = ['english', 'chinese']\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.print_verbose","title":"print_verbose class-attribute instance-attribute","text":"print_verbose: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rapidocr_params","title":"rapidocr_params class-attribute instance-attribute","text":"rapidocr_params: Dict[str, Any] = Field(default_factory=dict)\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rec_font_path","title":"rec_font_path class-attribute instance-attribute","text":"rec_font_path: Optional[str] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rec_keys_path","title":"rec_keys_path class-attribute instance-attribute","text":"rec_keys_path: Optional[str] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rec_model_path","title":"rec_model_path class-attribute instance-attribute","text":"rec_model_path: Optional[str] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.text_score","title":"text_score class-attribute instance-attribute","text":"text_score: float = 0.5\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_cls","title":"use_cls class-attribute instance-attribute","text":"use_cls: Optional[bool] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_det","title":"use_det class-attribute instance-attribute","text":"use_det: Optional[bool] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_rec","title":"use_rec class-attribute instance-attribute","text":"use_rec: Optional[bool] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableFormerMode","title":"TableFormerMode","text":" Bases: str, Enum
Modes for the TableFormer model.
Attributes:
ACCURATE \u2013 FAST \u2013 class-attribute instance-attribute","text":"ACCURATE = 'accurate'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableFormerMode.FAST","title":"FAST class-attribute instance-attribute","text":"FAST = 'fast'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions","title":"TableStructureOptions","text":" Bases: BaseTableStructureOptions
Options for the table structure.
Attributes:
do_cell_matching (bool) \u2013 kind (str) \u2013 mode (TableFormerMode) \u2013 class-attribute instance-attribute","text":"do_cell_matching: bool = True\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions.kind","title":"kind class-attribute","text":"kind: str = 'docling_tableformer'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions.mode","title":"mode class-attribute instance-attribute","text":"mode: TableFormerMode = ACCURATE\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions","title":"TesseractCliOcrOptions","text":" Bases: OcrOptions
Options for the TesseractCli engine.
Attributes:
bitmap_area_threshold (float) \u2013 force_full_page_ocr (bool) \u2013 kind (Literal['tesseract']) \u2013 lang (List[str]) \u2013 model_config \u2013 path (Optional[str]) \u2013 psm (Optional[int]) \u2013 tesseract_cmd (str) \u2013 class-attribute instance-attribute","text":"bitmap_area_threshold: float = 0.05\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute instance-attribute","text":"force_full_page_ocr: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.kind","title":"kind class-attribute","text":"kind: Literal['tesseract'] = 'tesseract'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.lang","title":"lang class-attribute instance-attribute","text":"lang: List[str] = ['fra', 'deu', 'spa', 'eng']\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.path","title":"path class-attribute instance-attribute","text":"path: Optional[str] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.psm","title":"psm class-attribute instance-attribute","text":"psm: Optional[int] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.tesseract_cmd","title":"tesseract_cmd class-attribute instance-attribute","text":"tesseract_cmd: str = 'tesseract'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions","title":"TesseractOcrOptions","text":" Bases: OcrOptions
Options for the Tesseract engine.
Attributes:
bitmap_area_threshold (float) \u2013 force_full_page_ocr (bool) \u2013 kind (Literal['tesserocr']) \u2013 lang (List[str]) \u2013 model_config \u2013 path (Optional[str]) \u2013 psm (Optional[int]) \u2013 class-attribute instance-attribute","text":"bitmap_area_threshold: float = 0.05\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.force_full_page_ocr","title":"force_full_page_ocr class-attribute instance-attribute","text":"force_full_page_ocr: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.kind","title":"kind class-attribute","text":"kind: Literal['tesserocr'] = 'tesserocr'\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.lang","title":"lang class-attribute instance-attribute","text":"lang: List[str] = ['fra', 'deu', 'spa', 'eng']\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.model_config","title":"model_config class-attribute instance-attribute","text":"model_config = ConfigDict(extra='forbid')\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.path","title":"path class-attribute instance-attribute","text":"path: Optional[str] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.psm","title":"psm class-attribute instance-attribute","text":"psm: Optional[int] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions","title":"ThreadedPdfPipelineOptions","text":" Bases: PdfPipelineOptions
Pipeline options for the threaded PDF pipeline with batching and backpressure control
Attributes:
accelerator_options (AcceleratorOptions) \u2013 allow_external_plugins (bool) \u2013 artifacts_path (Optional[Union[Path, str]]) \u2013 batch_polling_interval_seconds (float) \u2013 do_code_enrichment (bool) \u2013 do_formula_enrichment (bool) \u2013 do_ocr (bool) \u2013 do_picture_classification (bool) \u2013 do_picture_description (bool) \u2013 do_table_structure (bool) \u2013 document_timeout (Optional[float]) \u2013 enable_remote_services (bool) \u2013 force_backend_text (bool) \u2013 generate_page_images (bool) \u2013 generate_parsed_pages (bool) \u2013 generate_picture_images (bool) \u2013 generate_table_images (bool) \u2013 images_scale (float) \u2013 kind (str) \u2013 layout_batch_size (int) \u2013 layout_options (BaseLayoutOptions) \u2013 ocr_batch_size (int) \u2013 ocr_options (OcrOptions) \u2013 picture_description_options (PictureDescriptionBaseOptions) \u2013 queue_max_size (int) \u2013 table_batch_size (int) \u2013 table_structure_options (BaseTableStructureOptions) \u2013 class-attribute instance-attribute","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute instance-attribute","text":"allow_external_plugins: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.artifacts_path","title":"artifacts_path class-attribute instance-attribute","text":"artifacts_path: Optional[Union[Path, str]] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.batch_polling_interval_seconds","title":"batch_polling_interval_seconds class-attribute instance-attribute","text":"batch_polling_interval_seconds: float = 0.5\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_code_enrichment","title":"do_code_enrichment class-attribute instance-attribute","text":"do_code_enrichment: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_formula_enrichment","title":"do_formula_enrichment class-attribute instance-attribute","text":"do_formula_enrichment: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_ocr","title":"do_ocr class-attribute instance-attribute","text":"do_ocr: bool = True\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_picture_classification","title":"do_picture_classification class-attribute instance-attribute","text":"do_picture_classification: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_picture_description","title":"do_picture_description class-attribute instance-attribute","text":"do_picture_description: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_table_structure","title":"do_table_structure class-attribute instance-attribute","text":"do_table_structure: bool = True\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.document_timeout","title":"document_timeout class-attribute instance-attribute","text":"document_timeout: Optional[float] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute instance-attribute","text":"enable_remote_services: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.force_backend_text","title":"force_backend_text class-attribute instance-attribute","text":"force_backend_text: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.generate_page_images","title":"generate_page_images class-attribute instance-attribute","text":"generate_page_images: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.generate_parsed_pages","title":"generate_parsed_pages class-attribute instance-attribute","text":"generate_parsed_pages: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.generate_picture_images","title":"generate_picture_images class-attribute instance-attribute","text":"generate_picture_images: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.generate_table_images","title":"generate_table_images class-attribute instance-attribute","text":"generate_table_images: bool = Field(default=False, deprecated='Field `generate_table_images` is deprecated. To obtain table images, set `PdfPipelineOptions.generate_page_images = True` before conversion and then use the `TableItem.get_image` function.')\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.images_scale","title":"images_scale class-attribute instance-attribute","text":"images_scale: float = 1.0\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.kind","title":"kind class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.layout_batch_size","title":"layout_batch_size class-attribute instance-attribute","text":"layout_batch_size: int = 4\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.layout_options","title":"layout_options class-attribute instance-attribute","text":"layout_options: BaseLayoutOptions = LayoutOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.ocr_batch_size","title":"ocr_batch_size class-attribute instance-attribute","text":"ocr_batch_size: int = 4\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.ocr_options","title":"ocr_options class-attribute instance-attribute","text":"ocr_options: OcrOptions = OcrAutoOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.picture_description_options","title":"picture_description_options class-attribute instance-attribute","text":"picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.queue_max_size","title":"queue_max_size class-attribute instance-attribute","text":"queue_max_size: int = 100\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.table_batch_size","title":"table_batch_size class-attribute instance-attribute","text":"table_batch_size: int = 4\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.table_structure_options","title":"table_structure_options class-attribute instance-attribute","text":"table_structure_options: BaseTableStructureOptions = TableStructureOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions","title":"VlmExtractionPipelineOptions","text":" Bases: PipelineOptions
Options for extraction pipeline.
Attributes:
accelerator_options (AcceleratorOptions) \u2013 allow_external_plugins (bool) \u2013 artifacts_path (Optional[Union[Path, str]]) \u2013 document_timeout (Optional[float]) \u2013 enable_remote_services (bool) \u2013 kind (str) \u2013 vlm_options (Union[InlineVlmOptions]) \u2013 class-attribute instance-attribute","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute instance-attribute","text":"allow_external_plugins: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.artifacts_path","title":"artifacts_path class-attribute instance-attribute","text":"artifacts_path: Optional[Union[Path, str]] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.document_timeout","title":"document_timeout class-attribute instance-attribute","text":"document_timeout: Optional[float] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute instance-attribute","text":"enable_remote_services: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.kind","title":"kind class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.vlm_options","title":"vlm_options class-attribute instance-attribute","text":"vlm_options: Union[InlineVlmOptions] = NU_EXTRACT_2B_TRANSFORMERS\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions","title":"VlmPipelineOptions","text":" Bases: PaginatedPipelineOptions
Attributes:
accelerator_options (AcceleratorOptions) \u2013 allow_external_plugins (bool) \u2013 artifacts_path (Optional[Union[Path, str]]) \u2013 do_picture_classification (bool) \u2013 do_picture_description (bool) \u2013 document_timeout (Optional[float]) \u2013 enable_remote_services (bool) \u2013 force_backend_text (bool) \u2013 generate_page_images (bool) \u2013 generate_picture_images (bool) \u2013 images_scale (float) \u2013 kind (str) \u2013 picture_description_options (PictureDescriptionBaseOptions) \u2013 vlm_options (Union[InlineVlmOptions, ApiVlmOptions]) \u2013 class-attribute instance-attribute","text":"accelerator_options: AcceleratorOptions = AcceleratorOptions()\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.allow_external_plugins","title":"allow_external_plugins class-attribute instance-attribute","text":"allow_external_plugins: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.artifacts_path","title":"artifacts_path class-attribute instance-attribute","text":"artifacts_path: Optional[Union[Path, str]] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.do_picture_classification","title":"do_picture_classification class-attribute instance-attribute","text":"do_picture_classification: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.do_picture_description","title":"do_picture_description class-attribute instance-attribute","text":"do_picture_description: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.document_timeout","title":"document_timeout class-attribute instance-attribute","text":"document_timeout: Optional[float] = None\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.enable_remote_services","title":"enable_remote_services class-attribute instance-attribute","text":"enable_remote_services: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.force_backend_text","title":"force_backend_text class-attribute instance-attribute","text":"force_backend_text: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.generate_page_images","title":"generate_page_images class-attribute instance-attribute","text":"generate_page_images: bool = True\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.generate_picture_images","title":"generate_picture_images class-attribute instance-attribute","text":"generate_picture_images: bool = False\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.images_scale","title":"images_scale class-attribute instance-attribute","text":"images_scale: float = 1.0\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.kind","title":"kind class-attribute","text":"kind: str\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.picture_description_options","title":"picture_description_options class-attribute instance-attribute","text":"picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.vlm_options","title":"vlm_options class-attribute instance-attribute","text":"vlm_options: Union[InlineVlmOptions, ApiVlmOptions] = GRANITEDOCLING_TRANSFORMERS\n"},{"location":"usage/","title":"Index","text":""},{"location":"usage/#basic-usage","title":"Basic usage","text":""},{"location":"usage/#python","title":"Python","text":"In Docling, working with documents is as simple as:
For example, the snippet below shows conversion with export to Markdown:
from docling.document_converter import DocumentConverter\n\nsource = \"https://arxiv.org/pdf/2408.09869\" # file path or URL\nconverter = DocumentConverter()\ndoc = converter.convert(source).document\n\nprint(doc.export_to_markdown()) # output: \"### Docling Technical Report[...]\"\n Docling supports a wide array of file formats and, as outlined in the architecture guide, provides a versatile document model along with a full suite of supported operations.
"},{"location":"usage/#cli","title":"CLI","text":"You can additionally use Docling directly from your terminal, for instance:
docling https://arxiv.org/pdf/2206.01062\n The CLI provides various options, such as \ud83e\udd5aGraniteDocling (incl. MLX acceleration) & other VLMs:
docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062\n For all available options, run docling --help or check the CLI reference.
Check out the Usage subpages (navigation menu on the left) as well as our featured examples for additional usage workflows, including conversion customization, RAG, framework integrations, chunking, serialization, enrichments, and much more!
"},{"location":"usage/advanced_options/","title":"Advanced options","text":""},{"location":"usage/advanced_options/#model-prefetching-and-offline-usage","title":"Model prefetching and offline usage","text":"By default, models are downloaded automatically upon first usage. If you would prefer to explicitly prefetch them for offline use (e.g. in air-gapped environments) you can do that as follows:
Step 1: Prefetch the models
Use the docling-tools models download utility:
$ docling-tools models download\nDownloading layout model...\nDownloading tableformer model...\nDownloading picture classifier model...\nDownloading code formula model...\nDownloading easyocr models...\nModels downloaded into $HOME/.cache/docling/models.\n Alternatively, models can be programmatically downloaded using docling.utils.model_downloader.download_models().
Also, you can use download-hf-repo parameter to download arbitrary models from HuggingFace by specifying repo id:
$ docling-tools models download-hf-repo ds4sd/SmolDocling-256M-preview\nDownloading ds4sd/SmolDocling-256M-preview model from HuggingFace...\n Step 2: Use the prefetched models
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\nartifacts_path = \"/local/path/to/models\"\n\npipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path)\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n Or using the CLI:
docling --artifacts-path=\"/local/path/to/models\" FILE\n Or using the DOCLING_ARTIFACTS_PATH environment variable:
export DOCLING_ARTIFACTS_PATH=\"/local/path/to/models\"\npython my_docling_script.py\n"},{"location":"usage/advanced_options/#using-remote-services","title":"Using remote services","text":"The main purpose of Docling is to run local models which are not sharing any user data with remote services. Anyhow, there are valid use cases for processing part of the pipeline using remote services, for example invoking OCR engines from cloud vendors or the usage of hosted LLMs.
In Docling we decided to allow such models, but we require the user to explicitly opt-in in communicating with external services.
from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions(enable_remote_services=True)\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n When the value enable_remote_services=True is not set, the system will raise an exception OperationNotAllowed().
Note: This option is only related to the system sending user data to remote services. Control of pulling data (e.g. model weights) follows the logic described in Model prefetching and offline usage.
"},{"location":"usage/advanced_options/#list-of-remote-model-services","title":"List of remote model services","text":"The options in this list require the explicit enable_remote_services=True when processing the documents.
PictureDescriptionApiOptions: Using vision models via API calls.The example file custom_convert.py contains multiple ways one can adjust the conversion pipeline and features.
"},{"location":"usage/advanced_options/#control-pdf-table-extraction-options","title":"Control PDF table extraction options","text":"You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself. This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.
from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\n\npipeline_options = PdfPipelineOptions(do_table_structure=True)\npipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model\n\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between TableFormerMode.FAST (faster but less accurate) and TableFormerMode.ACCURATE (default) to receive better quality with difficult table structures.
from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode\n\npipeline_options = PdfPipelineOptions(do_table_structure=True)\npipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model\n\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n"},{"location":"usage/advanced_options/#impose-limits-on-the-document-size","title":"Impose limits on the document size","text":"You can limit the file size and number of pages which should be allowed to process per document:
from pathlib import Path\nfrom docling.document_converter import DocumentConverter\n\nsource = \"https://arxiv.org/pdf/2408.09869\"\nconverter = DocumentConverter()\nresult = converter.convert(source, max_num_pages=100, max_file_size=20971520)\n"},{"location":"usage/advanced_options/#convert-from-binary-pdf-streams","title":"Convert from binary PDF streams","text":"You can convert PDFs from a binary stream instead of from the filesystem as follows:
from io import BytesIO\nfrom docling.datamodel.base_models import DocumentStream\nfrom docling.document_converter import DocumentConverter\n\nbuf = BytesIO(your_binary_stream)\nsource = DocumentStream(name=\"my_doc.pdf\", stream=buf)\nconverter = DocumentConverter()\nresult = converter.convert(source)\n"},{"location":"usage/advanced_options/#limit-resource-usage","title":"Limit resource usage","text":"You can limit the CPU threads used by Docling by setting the environment variable OMP_NUM_THREADS accordingly. The default setting is using 4 CPU threads.
Docling allows to enrich the conversion pipeline with additional steps which process specific document components, e.g. code blocks, pictures, etc. The extra steps usually require extra models executions which may increase the processing time consistently. For this reason most enrichment models are disabled by default.
The following table provides an overview of the default enrichment models available in Docling.
Feature Parameter Processed item Description Code understandingdo_code_enrichment CodeItem See docs below. Formula understanding do_formula_enrichment TextItem with label FORMULA See docs below. Picture classification do_picture_classification PictureItem See docs below. Picture description do_picture_description PictureItem See docs below."},{"location":"usage/enrichments/#enrichments-details","title":"Enrichments details","text":""},{"location":"usage/enrichments/#code-understanding","title":"Code understanding","text":"The code understanding step allows to use advanced parsing for code blocks found in the document. This enrichment model also set the code_language property of the CodeItem.
Model specs: see the CodeFormula model card.
Example command line:
docling --enrich-code FILE\n Example code:
from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_code_enrichment = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n"},{"location":"usage/enrichments/#formula-understanding","title":"Formula understanding","text":"The formula understanding step will analyze the equation formulas in documents and extract their LaTeX representation. The HTML export functions in the DoclingDocument will leverage the formula and visualize the result using the mathml html syntax.
Model specs: see the CodeFormula model card.
Example command line:
docling --enrich-formula FILE\n Example code:
from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_formula_enrichment = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n"},{"location":"usage/enrichments/#picture-classification","title":"Picture classification","text":"The picture classification step classifies the PictureItem elements in the document with the DocumentFigureClassifier model. This model is specialized to understand the classes of pictures found in documents, e.g. different chart types, flow diagrams, logos, signatures, etc.
Model specs: see the DocumentFigureClassifier model card.
Example command line:
docling --enrich-picture-classes FILE\n Example code:
from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.generate_picture_images = True\npipeline_options.images_scale = 2\npipeline_options.do_picture_classification = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n"},{"location":"usage/enrichments/#picture-description","title":"Picture description","text":"The picture description step allows to annotate a picture with a vision model. This is also known as a \"captioning\" task. The Docling pipeline allows to load and run models completely locally as well as connecting to remote API which support the chat template. Below follow a few examples on how to use some common vision model and remote services.
from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n"},{"location":"usage/enrichments/#granite-vision-model","title":"Granite Vision model","text":"Model specs: see the ibm-granite/granite-vision-3.1-2b-preview model card.
Usage in Docling:
from docling.datamodel.pipeline_options import granite_picture_description\n\npipeline_options.picture_description_options = granite_picture_description\n"},{"location":"usage/enrichments/#smolvlm-model","title":"SmolVLM model","text":"Model specs: see the HuggingFaceTB/SmolVLM-256M-Instruct model card.
Usage in Docling:
from docling.datamodel.pipeline_options import smolvlm_picture_description\n\npipeline_options.picture_description_options = smolvlm_picture_description\n"},{"location":"usage/enrichments/#other-vision-models","title":"Other vision models","text":"The option class PictureDescriptionVlmOptions allows to use any another model from the Hugging Face Hub.
from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions\n\npipeline_options.picture_description_options = PictureDescriptionVlmOptions(\n repo_id=\"\", # <-- add here the Hugging Face repo_id of your favorite VLM\n prompt=\"Describe the image in three sentences. Be concise and accurate.\",\n)\n"},{"location":"usage/enrichments/#remote-vision-model","title":"Remote vision model","text":"The option class PictureDescriptionApiOptions allows to use models hosted on remote platforms, e.g. on local endpoints served by VLLM, Ollama and others, or cloud providers like IBM watsonx.ai, etc.
Note: in most cases this option will send your data to the remote service provider.
Usage in Docling:
from docling.datamodel.pipeline_options import PictureDescriptionApiOptions\n\n# Enable connections to remote services\npipeline_options.enable_remote_services=True # <-- this is required!\n\n# Example using a model running locally, e.g. via VLLM\n# $ vllm serve MODEL_NAME\npipeline_options.picture_description_options = PictureDescriptionApiOptions(\n url=\"http://localhost:8000/v1/chat/completions\",\n params=dict(\n model=\"MODEL NAME\",\n seed=42,\n max_completion_tokens=200,\n ),\n prompt=\"Describe the image in three sentences. Be concise and accurate.\",\n timeout=90,\n)\n End-to-end code snippets for cloud providers are available in the examples section:
Besides looking at the implementation of all the models listed above, the Docling documentation has a few examples dedicated to the implementation of enrichment models.
This guide describes how to maximize GPU performance for Docling pipelines. It covers device selection, pipeline differences, and provides example snippets for configuring batch size and concurrency in the VLM pipeline for both Linux and Windows.
Note
Improvements and optimizations strategies for maximizing the GPU performance is an active topic. Regularly check these guidelines for updates.
"},{"location":"usage/gpu/#standard-pipeline","title":"Standard Pipeline","text":"Enable GPU acceleration by configuring the accelerator device and concurrency options using Docling's API:
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\n\n# Configure accelerator options for GPU\naccelerator_options = AcceleratorOptions(\n device=AcceleratorDevice.CUDA, # or AcceleratorDevice.AUTO\n)\n Batch size and concurrency for document processing are controlled for each stage of the pipeline as:
from docling.datamodel.pipeline_options import (\n ThreadedPdfPipelineOptions,\n)\n\npipeline_options = ThreadedPdfPipelineOptions(\n ocr_batch_size=64, # default 4\n layout_batch_size=64, # default 4\n table_batch_size=4, # currently not using GPU batching\n)\n Setting a higher page_batch_size will run the Docling models (in particular the layout detection stage) with a GPU batch inference mode.
For a complete example see gpu_standard_pipeline.py.
"},{"location":"usage/gpu/#ocr-engines","title":"OCR engines","text":"The current Docling OCR engines rely on third-party libraries, hence GPU support depends on the availability in the respective engines.
The only setup which is known to work at the moment is RapidOCR with the torch backend, which can be enabled via
pipeline_options = PdfPipelineOptions()\npipeline_options.ocr_options = RapidOcrOptions(\n backend=\"torch\",\n)\n More details in the GitHub discussion #2451.
"},{"location":"usage/gpu/#vlm-pipeline","title":"VLM Pipeline","text":"For best GPU utilization, use a local inference server. Docling supports inference servers which exposes the OpenAI-compatible chat completion endpoints. For example:
http://localhost:8000/v1/chat/completions (available only on Linux)http://localhost:1234/v1/chat/completions (available both on Linux and Windows)http://localhost:11434/v1/chat/completions (available both on Linux and Windows)Here is an example on how to start the vllm inference server with optimum parameters for Granite Docling.
vllm serve ibm-granite/granite-docling-258M \\\n --host 127.0.0.1 --port 8000 \\\n --max-num-seqs 512 \\\n --max-num-batched-tokens 8192 \\\n --enable-chunked-prefill \\\n --gpu-memory-utilization 0.9\n"},{"location":"usage/gpu/#configure-docling","title":"Configure Docling","text":"Configure the VLM pipeline using Docling's VLM options:
from docling.datamodel.pipeline_options import VlmPipelineOptions\n\nvlm_options = VlmPipelineOptions(\n enable_remote_services=True,\n vlm_options={\n \"url\": \"http://localhost:8000/v1/chat/completions\", # or any other compatible endpoint\n \"params\": {\n \"model\": \"ibm-granite/granite-docling-258M\",\n \"max_tokens\": 4096,\n },\n \"concurrency\": 64, # default is 1\n \"prompt\": \"Convert this page to docling.\",\n \"timeout\": 90,\n }\n)\n Additionally to the concurrency, we also have to set the page_batch_size Docling parameter. Make sure to set settings.perf.page_batch_size >= vlm_options.concurrency.
from docling.datamodel.settings import settings\n\nsettings.perf.page_batch_size = 64 # default is 4\n"},{"location":"usage/gpu/#complete-example_1","title":"Complete example","text":"For a complete example see gpu_vlm_pipeline.py.
"},{"location":"usage/gpu/#available-models","title":"Available models","text":"Both LM Studio and Ollama rely on llama.cpp as runtime engine. For using this engine, models have to be converted to the gguf format.
Here is a list of known models which are available in gguf format and how to use them.
TBA.
"},{"location":"usage/gpu/#performance-results","title":"Performance results","text":""},{"location":"usage/gpu/#test-data","title":"Test data","text":"PDF doc ViDoRe V3 HR Num docs 1 14 Num pages 192 1110 Num tables 95 258 Format type PDF Parquet of images"},{"location":"usage/gpu/#test-infrastructure","title":"Test infrastructure","text":"g6e.2xlarge RTX 5090 RTX 5070 Description AWS instanceg6e.2xlarge Linux bare metal machine Windows 11 bare metal machine CPU 8 vCPUs, AMD EPYC 7R13 16 vCPU, AMD Ryzen 7 9800 16 vCPU, AMD Ryzen 7 9800 RAM 64GB 128GB 64GB GPU NVIDIA L40S 48GB NVIDIA GeForce RTX 5090 NVIDIA GeForce RTX 5070 CUDA Version 13.0, driver 580.95.05 13.0, driver 580.105.08 13.0, driver 581.57"},{"location":"usage/gpu/#results","title":"Results","text":"Pipelineg6e.2xlargeRTX 5090RTX 5070 PDF docViDoRe V3 HRPDF docViDoRe V3 HRPDF docViDoRe V3 HR Standard - Inline (no OCR)3.1 pages/second-7.9 pages/second[cpu-only]* 1.5 pages/second-4.2 pages/second[cpu-only]* 1.2 pages/second- VLM - Inference server (GraniteDocling)2.4 pages/second-3.8 pages/second3.6-4.5 pages/second-- * cpu-only timing computed with 16 pytorch threads.
"},{"location":"usage/jobkit/","title":"Jobkit","text":"Docling's document conversion can be executed as distributed jobs using Docling Jobkit.
This library provides:
You can run Jobkit locally via the CLI:
uv run docling-jobkit-local [configuration-file-path]\n The configuration file defines:
Example configuration file:
options: # Example Docling's conversion options\n do_ocr: false \nsources: # Source location (here Google Drive)\n - kind: google_drive\n path_id: 1X6B3j7GWlHfIPSF9VUkasN-z49yo1sGFA9xv55L2hSE\n token_path: \"./dev/google_drive/google_drive_token.json\" \n credentials_path: \"./dev/google_drive/google_drive_credentials.json\" \ntarget: # Target location (here S3)\n kind: s3\n endpoint: localhost:9000\n verify_ssl: false\n bucket: docling-target\n access_key: minioadmin\n secret_key: minioadmin\n"},{"location":"usage/jobkit/#connectors","title":"Connectors","text":"Connectors are used to import documents for processing with Docling and to export results after conversion.
The currently supported connectors are:
To use Google Drive as a source or target, you need to enable the API and set up credentials.
Step 1: Enable the Google Drive API.
Step 2: Create OAuth credentials.
google_drive_credentials.json.Step 3: Add test users.
Step 4: Edit configuration file.
credentials_path with your path to google_drive_credentials.json.path_id with your source or target location. It can be obtained from the URL as follows:https://drive.google.com/drive/u/0/folders/1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5 > folder id is 1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5.https://docs.google.com/document/d/1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw/edit > document id is 1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw.Step 5: Authenticate via CLI.
token_path and reused for next runs.New AI trends focus on Agentic AI, an artificial intelligence system that can accomplish a specific goal with limited supervision. Agents can act autonomously to understand, plan, and execute a specific task.
To address the integration problem, the Model Context Protocol (MCP) emerges as a popular standard for connecting AI applications to external tools.
"},{"location":"usage/mcp/#docling-mcp","title":"Docling MCP","text":"Docling supports the development of AI agents by providing an MCP Server. It allows you to experiment with document processing in different MCP Clients. Adding Docling MCP in your favorite client is usually as simple as adding the following entry in the configuration file:
{\n \"mcpServers\": {\n \"docling\": {\n \"command\": \"uvx\",\n \"args\": [\n \"--from=docling-mcp\",\n \"docling-mcp-server\"\n ]\n }\n }\n}\n When using Claude on your desktop, just edit the config file claude_desktop_config.json with the snippet above or the example provided here.
In LM Studio, edit the mcp.json file with the appropriate section or simply click on the button below for a direct install.
Docling MCP also provides tools specific for some applications and frameworks. See the Docling MCP Server repository for more details. You will find examples of building agents powered by Docling capabilities and leveraging frameworks like LlamaIndex, Llama Stack, Pydantic AI, or smolagents.
"},{"location":"usage/supported_formats/","title":"Supported formats","text":"Docling can parse various documents formats into a unified representation (Docling Document), which it can export to different formats too \u2014 check out Architecture for more details.
Below you can find a listing of all supported input and output formats.
"},{"location":"usage/supported_formats/#supported-input-formats","title":"Supported input formats","text":"Format Description PDF DOCX, XLSX, PPTX Default formats in MS Office 2007+, based on Office Open XML Markdown AsciiDoc Human-readable, plain-text markup language for structured technical content HTML, XHTML CSV PNG, JPEG, TIFF, BMP, WEBP Image formats WebVTT Web Video Text Tracks format for displaying timed textSchema-specific support:
Format Description USPTO XML XML format followed by USPTO patents JATS XML XML format followed by JATS articles Docling JSON JSON-serialized Docling Document"},{"location":"usage/supported_formats/#supported-output-formats","title":"Supported output formats","text":"Format Description HTML Both image embedding and referencing are supported Markdown JSON Lossless serialization of Docling Document Text Plain text, i.e. without Markdown markers Doctags Markup format for efficiently representing the full content and layout characteristics of a document"},{"location":"usage/vision_models/","title":"Vision models","text":"The VlmPipeline in Docling allows you to convert documents end-to-end using a vision-language model.
Docling supports vision-language models which output:
For running Docling using local models with the VlmPipeline:
docling --pipeline vlm FILE\n See also the example minimal_vlm_pipeline.py.
from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n ),\n }\n)\n\ndoc = converter.convert(source=\"FILE\").document\n"},{"location":"usage/vision_models/#available-local-models","title":"Available local models","text":"By default, the vision-language models are running locally. Docling allows to choose between the Hugging Face Transformers framework and the MLX (for Apple devices with MPS acceleration) one.
The following table reports the models currently available out-of-the-box.
Model instance Model Framework Device Num pages Inference time (sec)vlm_model_specs.GRANITEDOCLING_TRANSFORMERS ibm-granite/granite-docling-258M Transformers/AutoModelForVision2Seq MPS 1 - vlm_model_specs.GRANITEDOCLING_MLX ibm-granite/granite-docling-258M-mlx-bf16 MLX MPS 1 - vlm_model_specs.SMOLDOCLING_TRANSFORMERS ds4sd/SmolDocling-256M-preview Transformers/AutoModelForVision2Seq MPS 1 102.212 vlm_model_specs.SMOLDOCLING_MLX ds4sd/SmolDocling-256M-preview-mlx-bf16 MLX MPS 1 6.15453 vlm_model_specs.QWEN25_VL_3B_MLX mlx-community/Qwen2.5-VL-3B-Instruct-bf16 MLX MPS 1 23.4951 vlm_model_specs.PIXTRAL_12B_MLX mlx-community/pixtral-12b-bf16 MLX MPS 1 308.856 vlm_model_specs.GEMMA3_12B_MLX mlx-community/gemma-3-12b-it-bf16 MLX MPS 1 378.486 vlm_model_specs.GRANITE_VISION_TRANSFORMERS ibm-granite/granite-vision-3.2-2b Transformers/AutoModelForVision2Seq MPS 1 104.75 vlm_model_specs.PHI4_TRANSFORMERS microsoft/Phi-4-multimodal-instruct Transformers/AutoModelForCasualLM CPU 1 1175.67 vlm_model_specs.PIXTRAL_12B_TRANSFORMERS mistral-community/pixtral-12b Transformers/AutoModelForVision2Seq CPU 1 1828.21 Inference time is computed on a Macbook M3 Max using the example page tests/data/pdf/2305.03393v1-pg9.pdf. The comparison is done with the example compare_vlm_models.py.
For choosing the model, the code snippet above can be extended as follow
from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.datamodel import vlm_model_specs\n\npipeline_options = VlmPipelineOptions(\n vlm_options=vlm_model_specs.SMOLDOCLING_MLX, # <-- change the model here\n)\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n }\n)\n\ndoc = converter.convert(source=\"FILE\").document\n"},{"location":"usage/vision_models/#other-models","title":"Other models","text":"Other models can be configured by directly providing the Hugging Face repo_id, the prompt and a few more options.
For example:
from docling.datamodel.pipeline_options_vlm_model import InlineVlmOptions, InferenceFramework, TransformersModelType\n\npipeline_options = VlmPipelineOptions(\n vlm_options=InlineVlmOptions(\n repo_id=\"ibm-granite/granite-vision-3.2-2b\",\n prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\",\n response_format=ResponseFormat.MARKDOWN,\n inference_framework=InferenceFramework.TRANSFORMERS,\n transformers_model_type=TransformersModelType.AUTOMODEL_VISION2SEQ,\n supported_devices=[\n AcceleratorDevice.CPU,\n AcceleratorDevice.CUDA,\n AcceleratorDevice.MPS,\n ],\n scale=2.0,\n temperature=0.0,\n )\n)\n"},{"location":"usage/vision_models/#remote-models","title":"Remote models","text":"Additionally to local models, the VlmPipeline allows to offload the inference to a remote service hosting the models. Many remote inference services are provided, the key requirement is to offer an OpenAI-compatible API. This includes vLLM, Ollama, etc.
More examples on how to connect with the remote inference services can be found in the following examples: