{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Documentation","text":"<p>Docling simplifies document processing, parsing diverse formats \u2014 including advanced PDF understanding \u2014 and providing seamless integrations with the gen AI ecosystem.</p>"},{"location":"#getting-started","title":"Getting started","text":"<p>\ud83d\udc23 Ready to kick off your Docling journey? Let's dive right into it!</p> \u2b07\ufe0f InstallationQuickly install Docling in your environment \u25b6\ufe0f QuickstartGet a jumpstart on basic Docling usage \ud83e\udde9 ConceptsLearn Docling fundamentals and get a glimpse under the hood \ud83e\uddd1\ud83c\udffd\u200d\ud83c\udf73 ExamplesTry out recipes for various use cases, including conversion, RAG, and more \ud83e\udd16 IntegrationsCheck out integrations with popular AI tools and frameworks \ud83d\udcd6 ReferenceSee more API details"},{"location":"#features","title":"Features","text":"<ul> <li>\ud83d\uddc2\ufe0f  Parsing of multiple document formats incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, VTT, images (PNG, TIFF, JPEG, ...), and more</li> <li>\ud83d\udcd1 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more</li> <li>\ud83e\uddec Unified, expressive DoclingDocument representation format</li> <li>\u21aa\ufe0f  Various export formats and options, including Markdown, HTML, DocTags and lossless JSON</li> <li>\ud83d\udd12 Local execution capabilities for sensitive data and air-gapped environments</li> <li>\ud83e\udd16 Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI &amp; Haystack for agentic AI</li> <li>\ud83d\udd0d Extensive OCR support for scanned PDFs and images</li> <li>\ud83d\udc53 Support of several Visual Language Models (GraniteDocling)</li> <li>\ud83c\udf99\ufe0f  Support for Audio with Automatic Speech Recognition (ASR) models</li> <li>\ud83d\udd0c Connect to any agent using the Docling MCP server</li> <li>\ud83d\udcbb Simple and convenient CLI</li> </ul>"},{"location":"#whats-new","title":"What's new","text":"<ul> <li>\ud83d\udce4 Structured [information extraction][extraction] [\ud83e\uddea beta]</li> <li>\ud83d\udcd1 New layout model (Heron) by default, for faster PDF parsing</li> <li>\ud83d\udd0c MCP server for agentic applications</li> <li>\ud83d\udcac Parsing of Web Video Text Tracks (WebVTT) files</li> </ul>"},{"location":"#coming-soon","title":"Coming soon","text":"<ul> <li>\ud83d\udcdd Metadata extraction, including title, authors, references &amp; language</li> <li>\ud83d\udcdd Chart understanding (Barchart, Piechart, LinePlot, etc)</li> <li>\ud83d\udcdd Complex chemistry understanding (Molecular structures)</li> </ul>"},{"location":"#whats-next","title":"What's next","text":"<p>\ud83d\ude80 The journey has just begun! Join us and become a part of the growing Docling community.</p> <ul> <li> GitHub</li> <li> Discord</li> <li> LinkedIn</li> </ul>"},{"location":"#live-assistant","title":"Live assistant","text":"<p>Do you want to leverage the power of AI and get live support on Docling? Try out the Chat with Dosu functionalities provided by our friends at Dosu.</p> <p></p>"},{"location":"#lf-ai-data","title":"LF AI &amp; Data","text":"<p>Docling is hosted as a project in the LF AI &amp; Data Foundation.</p>"},{"location":"#ibm-open-source-ai","title":"IBM \u2764\ufe0f Open Source AI","text":"<p>The project was started by the AI for knowledge team at IBM Research Zurich.</p>"},{"location":"v2/","title":"V2","text":""},{"location":"v2/#whats-new","title":"What's new","text":"<p>Docling v2 introduces several new features:</p> <ul> <li>Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats</li> <li>Produces a new, universal document representation which can encapsulate document hierarchy</li> <li>Comes with a fresh new API and CLI</li> </ul>"},{"location":"v2/#changes-in-docling-v2","title":"Changes in Docling v2","text":""},{"location":"v2/#cli","title":"CLI","text":"<p>We updated the command line syntax of Docling v2 to support many formats. Examples are seen below. <pre><code># Convert a single file to Markdown (default)\ndocling myfile.pdf\n\n# Convert a single file to Markdown and JSON, without OCR\ndocling myfile.pdf --to json --to md --no-ocr\n\n# Convert PDF files in input directory to Markdown (default)\ndocling ./input/dir --from pdf\n\n# Convert PDF and Word files in input directory to Markdown and JSON\ndocling ./input/dir --from pdf --from docx --to md --to json --output ./scratch\n\n# Convert all supported files in input directory to Markdown, but abort on first error\ndocling ./input/dir --output ./scratch --abort-on-error\n</code></pre></p> <p>Notable changes from Docling v1:</p> <ul> <li>The standalone switches for different export formats are removed, and replaced with <code>--from</code> and <code>--to</code> arguments, to define input and output formats respectively.</li> <li>The new <code>--abort-on-error</code> will abort any batch conversion as soon an error is encountered</li> <li>The <code>--backend</code> option for PDFs was removed</li> </ul>"},{"location":"v2/#setting-up-a-documentconverter","title":"Setting up a <code>DocumentConverter</code>","text":"<p>To accommodate many input formats, we changed the way you need to set up your <code>DocumentConverter</code> object. You can now define a list of allowed formats on the <code>DocumentConverter</code> initialization, and specify custom options per-format if desired. By default, all supported formats are allowed. If you don't provide <code>format_options</code>, defaults will be used for all <code>allowed_formats</code>.</p> <p>Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend. They are provided as format-specific types, such as <code>PdfFormatOption</code> or <code>WordFormatOption</code>, as seen below.</p> <pre><code>from docling.document_converter import DocumentConverter\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import (\n    DocumentConverter,\n    PdfFormatOption,\n    WordFormatOption,\n)\nfrom docling.pipeline.simple_pipeline import SimplePipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\n\n## Default initialization still works as before:\n# doc_converter = DocumentConverter()\n\n\n# previous `PipelineOptions` is now `PdfPipelineOptions`\npipeline_options = PdfPipelineOptions()\npipeline_options.do_ocr = False\npipeline_options.do_table_structure = True\n#...\n\n## Custom options are now defined per format.\ndoc_converter = (\n    DocumentConverter(  # all of the below is optional, has internal defaults.\n        allowed_formats=[\n            InputFormat.PDF,\n            InputFormat.IMAGE,\n            InputFormat.DOCX,\n            InputFormat.HTML,\n            InputFormat.PPTX,\n        ],  # whitelist formats, non-matching files are ignored.\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_options=pipeline_options, # pipeline options go here.\n                backend=PyPdfiumDocumentBackend # optional: pick an alternative backend\n            ),\n            InputFormat.DOCX: WordFormatOption(\n                pipeline_cls=SimplePipeline # default for office formats and HTML\n            ),\n        },\n    )\n)\n</code></pre> <p>Note: If you work only with defaults, all remains the same as in Docling v1.</p> <p>More options are shown in the following example units:</p> <ul> <li>run_with_formats.py</li> <li>custom_convert.py</li> </ul>"},{"location":"v2/#converting-documents","title":"Converting documents","text":"<p>We have simplified the way you can feed input to the <code>DocumentConverter</code> and renamed the conversion methods for better semantics. You can now call the conversion directly with a single file, or a list of input files, or <code>DocumentStream</code> objects, without constructing a <code>DocumentConversionInput</code> object first.</p> <ul> <li><code>DocumentConverter.convert</code> now converts a single file input (previously <code>DocumentConverter.convert_single</code>).</li> <li><code>DocumentConverter.convert_all</code> now converts many files at once (previously <code>DocumentConverter.convert</code>).</li> </ul> <p><pre><code>...\nfrom docling.datamodel.document import ConversionResult\n## Convert a single file (from URL or local path)\nconv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Convert several files at once:\n\ninput_files = [\n    \"tests/data/html/wiki_duck.html\",\n    \"tests/data/docx/word_sample.docx\",\n    \"tests/data/docx/lorem_ipsum.docx\",\n    \"tests/data/pptx/powerpoint_sample.pptx\",\n    \"tests/data/2305.03393v1-pg9-img.png\",\n    \"tests/data/pdf/2206.01062.pdf\",\n]\n\n# Directly pass list of files or streams to `convert_all`\nconv_results_iter = doc_converter.convert_all(input_files) # previously `convert`\n</code></pre> Through the <code>raises_on_error</code> argument, you can also control if the conversion should raise exceptions when first encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status. By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed).</p> <pre><code>...\nconv_results_iter = doc_converter.convert_all(input_files, raises_on_error=False) # previously `convert`\n</code></pre>"},{"location":"v2/#access-document-structures","title":"Access document structures","text":"<p>We have simplified how you can access and export the converted document data, too. Our universal document representation is now available in conversion results as a <code>DoclingDocument</code> object. <code>DoclingDocument</code> provides a neat set of APIs to construct, iterate and export content in the document, as shown below.</p> <pre><code>import pandas as pd\nfrom docling_core.types.doc import TextItem, TableItem\n\nconv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Inspect the converted document:\nconv_result.document.print_element_tree()\n\n## Iterate the elements in reading order, including hierarchy level:\nfor item, level in conv_result.document.iterate_items():\n    if isinstance(item, TextItem):\n        print(item.text)\n    elif isinstance(item, TableItem):\n        table_df: pd.DataFrame = item.export_to_dataframe(doc=conv_result.document)\n        print(table_df.to_markdown())\n    elif ...:\n        #...\n</code></pre> <p>Note: While it is deprecated, you can still work with the Docling v1 document representation, it is available as: <pre><code>conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type\n</code></pre></p>"},{"location":"v2/#export-into-json-markdown-doctags","title":"Export into JSON, Markdown, Doctags","text":"<p>Note: All <code>render_...</code> methods in <code>ConversionResult</code> have been removed in Docling v2, and are now available on <code>DoclingDocument</code> as:</p> <ul> <li><code>DoclingDocument.export_to_dict</code></li> <li><code>DoclingDocument.export_to_markdown</code></li> <li><code>DoclingDocument.export_to_document_tokens</code></li> </ul> <pre><code>conv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Export to desired format:\nprint(json.dumps(conv_res.document.export_to_dict()))\nprint(conv_res.document.export_to_markdown())\nprint(conv_res.document.export_to_document_tokens())\n</code></pre> <p>Note: While it is deprecated, you can still export Docling v1 JSON format. This is available through the same methods as on the <code>DoclingDocument</code> type: <pre><code>## Export legacy document representation to desired format, for v1 compatibility:\nprint(json.dumps(conv_res.legacy_document.export_to_dict()))\nprint(conv_res.legacy_document.export_to_markdown())\nprint(conv_res.legacy_document.export_to_document_tokens())\n</code></pre></p>"},{"location":"v2/#reload-a-doclingdocument-stored-as-json","title":"Reload a <code>DoclingDocument</code> stored as JSON","text":"<p>You can save and reload a <code>DoclingDocument</code> to disk in JSON format using the following codes:</p> <pre><code># Save to disk:\ndoc: DoclingDocument = conv_res.document # produced from conversion result...\n\nwith Path(\"./doc.json\").open(\"w\") as fp:\n    fp.write(json.dumps(doc.export_to_dict())) # use `export_to_dict` to ensure consistency\n\n# Load from disk:\nwith Path(\"./doc.json\").open(\"r\") as fp:\n    doc_dict = json.loads(fp.read())\n    doc = DoclingDocument.model_validate(doc_dict) # use standard pydantic API to populate doc\n</code></pre>"},{"location":"v2/#chunking","title":"Chunking","text":"<p>Docling v2 defines new base classes for chunking:</p> <ul> <li><code>BaseMeta</code> for chunk metadata</li> <li><code>BaseChunk</code> containing the chunk text and metadata, and</li> <li><code>BaseChunker</code> for chunkers, producing chunks out of a <code>DoclingDocument</code>.</li> </ul> <p>Additionally, it provides an updated <code>HierarchicalChunker</code> implementation, which leverages the new <code>DoclingDocument</code> and provides a new, richer chunk output format, including:</p> <ul> <li>the respective doc items for grounding</li> <li>any applicable headings for context</li> <li>any applicable captions for context</li> </ul> <p>For an example, check out Chunking usage.</p>"},{"location":"concepts/","title":"Concepts","text":"<p>In this space, you can peek under the hood and learn some fundamental Docling concepts!</p> <p>Here some of our picks to get you started:</p> <ul> <li>\ud83c\udfdb\ufe0f Docling architecture</li> <li>\ud83d\udcc4 Docling Document</li> <li>core operations like \u270d\ufe0f serialization and \u2702\ufe0f chunking</li> </ul> <p>\ud83d\udc48 ... and there is much more: explore all the concepts using the navigation menu on the side</p>          Docling architecture outline"},{"location":"concepts/architecture/","title":"Architecture","text":"<p>In a nutshell, Docling's architecture is outlined in the diagram above.</p> <p>For each document format, the document converter knows which format-specific backend to employ for parsing the document and which pipeline to use for orchestrating the execution, along with any relevant options.</p> <p>Tip</p> <p>While the document converter holds a default mapping, this configuration is parametrizable, so e.g. for the PDF format, different backends and different pipeline options can be used \u2014 see Usage.</p> <p>The conversion result contains the Docling document, Docling's fundamental document representation.</p> <p>Some typical scenarios for using a Docling document include directly calling its export methods, such as for markdown, dictionary etc., or having it serialized by a serializer or chunked by a chunker.</p> <p>For more details on Docling's architecture, check out the Docling Technical Report.</p> <p>Note</p> <p>The components illustrated with dashed outline indicate base classes that can be subclassed for specialized implementations.</p>"},{"location":"concepts/chunking/","title":"Chunking","text":""},{"location":"concepts/chunking/#introduction","title":"Introduction","text":"<p>Chunking approaches</p> <p>Starting from a <code>DoclingDocument</code>, there are in principle two possible chunking approaches:</p> <ol> <li>exporting the <code>DoclingDocument</code> to Markdown (or similar format) and then   performing user-defined chunking as a post-processing step, or</li> <li>using native Docling chunkers, i.e. operating directly on the <code>DoclingDocument</code></li> </ol> <p>This page is about the latter, i.e. using native Docling chunkers. For an example of using approach (1) check out e.g. this recipe looking at the Markdown export mode.</p> <p>A chunker is a Docling abstraction that, given a <code>DoclingDocument</code>, returns a stream of chunks, each of which captures some part of the document as a string accompanied by respective metadata.</p> <p>To enable both flexibility for downstream applications and out-of-the-box utility, Docling defines a chunker class hierarchy, providing a base type, <code>BaseChunker</code>, as well as specific subclasses.</p> <p>Docling integration with gen AI frameworks like LlamaIndex is done using the <code>BaseChunker</code> interface, so users can easily plug in any built-in, self-defined, or third-party <code>BaseChunker</code> implementation.</p>"},{"location":"concepts/chunking/#base-chunker","title":"Base Chunker","text":"<p>The <code>BaseChunker</code> base class API defines that any chunker should provide the following:</p> <ul> <li><code>def chunk(self, dl_doc: DoclingDocument, **kwargs) -&gt; Iterator[BaseChunk]</code>:   Returning the chunks for the provided document.</li> <li><code>def contextualize(self, chunk: BaseChunk) -&gt; str</code>:   Returning the potentially metadata-enriched serialization of the chunk, typically   used to feed an embedding model (or generation model).</li> </ul>"},{"location":"concepts/chunking/#hybrid-chunker","title":"Hybrid Chunker","text":"<p>To access <code>HybridChunker</code></p> <ul> <li>If you are using the <code>docling</code> package, you can import as follows:     <pre><code>from docling.chunking import HybridChunker\n</code></pre></li> <li>If you are only using the <code>docling-core</code> package, you must ensure to install     the <code>chunking</code> extra if you want to use HuggingFace tokenizers, e.g.     <pre><code>pip install 'docling-core[chunking]'\n</code></pre>     or the <code>chunking-openai</code> extra if you prefer Open AI tokenizers (tiktoken), e.g.     <pre><code>pip install 'docling-core[chunking-openai]'\n</code></pre>     and then you     can import as follows:     <pre><code>from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\n</code></pre></li> </ul> <p>The <code>HybridChunker</code> implementation uses a hybrid approach, applying tokenization-aware refinements on top of document-based hierarchical chunking.</p> <p>More precisely:</p> <ul> <li>it starts from the result of the hierarchical chunker and, based on the user-provided   tokenizer (typically to be aligned to the embedding model tokenizer), it:</li> <li>does one pass where it splits chunks only when needed (i.e. oversized w.r.t. tokens), &amp;</li> <li>another pass where it merges chunks only when possible (i.e. undersized successive chunks with same headings &amp; captions) \u2014 users can opt out of this step via param <code>merge_peers</code> (by default <code>True</code>)</li> </ul> <p>\ud83d\udc49 Usage examples:</p> <ul> <li>Hybrid chunking</li> <li>Advanced chunking &amp; serialization</li> </ul>"},{"location":"concepts/chunking/#hierarchical-chunker","title":"Hierarchical Chunker","text":"<p>The <code>HierarchicalChunker</code> implementation uses the document structure information from the <code>DoclingDocument</code> to create one chunk for each individual detected document element, by default only merging together list items (can be opted out via param <code>merge_list_items</code>). It also takes care of attaching all relevant document metadata, including headers and captions.</p>"},{"location":"concepts/confidence_scores/","title":"Confidence Scores","text":""},{"location":"concepts/confidence_scores/#introduction","title":"Introduction","text":"<p>Confidence grades were introduced in v2.34.0 to help users understand how well a conversion performed and guide decisions about post-processing workflows. They are available in the <code>confidence</code> field of the <code>ConversionResult</code> object returned by the document converter.</p>"},{"location":"concepts/confidence_scores/#purpose","title":"Purpose","text":"<p>Complex layouts, poor scan quality, or challenging formatting can lead to suboptimal document conversion results that may require additional attention or alternative conversion pipelines.</p> <p>Confidence scores provide a quantitative assessment of document conversion quality. Each confidence report includes a numerical score (0.0 to 1.0) measuring conversion accuracy, and a quality grade (poor, fair, good, excellent) for quick interpretation.</p> <p>Focus on quality grades!</p> <p>Users can and should safely focus on the document-level grade fields \u2014 <code>mean_grade</code> and <code>low_grade</code> \u2014 to assess overall conversion quality. Numerical scores are used internally and are for informational purposes only; their computation and weighting may change in the future.</p> <p>Use cases for confidence grades include:</p> <ul> <li>Identify documents requiring manual review after the conversion</li> <li>Adjust conversion pipelines to the most appropriate for each document type</li> <li>Set confidence thresholds for unattended batch conversions</li> <li>Catch potential conversion issues early in your workflow.</li> </ul>"},{"location":"concepts/confidence_scores/#concepts","title":"Concepts","text":""},{"location":"concepts/confidence_scores/#scores-and-grades","title":"Scores and grades","text":"<p>A confidence report contains scores and grades:</p> <ul> <li>Scores: Numerical values between 0.0 and 1.0, where higher values indicate better conversion quality, for internal use only</li> <li>Grades: Categorical quality assessments based on score thresholds, used to assess the overall conversion confidence:</li> <li><code>POOR</code></li> <li><code>FAIR</code></li> <li><code>GOOD</code></li> <li><code>EXCELLENT</code></li> </ul>"},{"location":"concepts/confidence_scores/#types-of-confidence-calculated","title":"Types of confidence calculated","text":"<p>Each confidence report includes four component scores and grades:</p> <ul> <li><code>layout_score</code>: Overall quality of document element recognition </li> <li><code>ocr_score</code>: Quality of OCR-extracted content</li> <li><code>parse_score</code>: 10th percentile score of digital text cells (emphasizes problem areas)</li> <li><code>table_score</code>: Table extraction quality (not yet implemented)</li> </ul>"},{"location":"concepts/confidence_scores/#summary-grades","title":"Summary grades","text":"<p>Two aggregate grades provide overall document quality assessment:</p> <ul> <li><code>mean_grade</code>: Average of the four component scores</li> <li><code>low_grade</code>: 5th percentile score (highlights worst-performing areas)</li> </ul>"},{"location":"concepts/confidence_scores/#page-level-vs-document-level","title":"Page-level vs document-level","text":"<p>Confidence grades are calculated at two levels:</p> <ul> <li>Page-level: Individual scores and grades for each page, stored in the <code>pages</code> field</li> <li>Document-level: Overall scores and grades for the entire document, calculated as averages of the page-level grades and stored in fields equally named in the root <code>ConfidenceReport</code></li> </ul>"},{"location":"concepts/confidence_scores/#example","title":"Example","text":""},{"location":"concepts/docling_document/","title":"Docling Document","text":"<p>With Docling v2, we introduced a unified document representation format called <code>DoclingDocument</code>. It is defined as a pydantic datatype, which can express several features common to documents, such as:</p> <ul> <li>Text, Tables, Pictures, and more</li> <li>Document hierarchy with sections and groups</li> <li>Disambiguation between main body and headers, footers (furniture)</li> <li>Layout information (i.e. bounding boxes) for all items, if available</li> <li>Provenance information</li> </ul> <p>The definition of the Pydantic types is implemented in the module <code>docling_core.types.doc</code>, more details in source code definitions.</p> <p>It also brings a set of document construction APIs to build up a <code>DoclingDocument</code> from scratch.</p>"},{"location":"concepts/docling_document/#example-document-structures","title":"Example document structures","text":"<p>To illustrate the features of the <code>DoclingDocument</code> format, in the subsections below we consider the <code>DoclingDocument</code> converted from <code>tests/data/word_sample.docx</code> and we present some side-by-side comparisons, where the left side shows snippets from the converted document serialized as YAML and the right one shows the corresponding parts of the original MS Word.</p>"},{"location":"concepts/docling_document/#basic-structure","title":"Basic structure","text":"<p>A <code>DoclingDocument</code> exposes top-level fields for the document content, organized in two categories. The first category is the content items, which are stored in these fields:</p> <ul> <li><code>texts</code>: All items that have a text representation (paragraph, section heading, equation, ...). Base class is <code>TextItem</code>.</li> <li><code>tables</code>: All tables, type <code>TableItem</code>. Can carry structure annotations.</li> <li><code>pictures</code>: All pictures, type <code>PictureItem</code>. Can carry structure annotations.</li> <li><code>key_value_items</code>: All key-value items.</li> </ul> <p>All of the above fields are lists and store items inheriting from the <code>DocItem</code> type. They can express different data structures depending on their type, and reference parents and children through JSON pointers.</p> <p>The second category is content structure, which is encapsulated in:</p> <ul> <li><code>body</code>: The root node of a tree-structure for the main document body</li> <li><code>furniture</code>: The root node of a tree-structure for all items that don't belong into the body (headers, footers, ...)</li> <li><code>groups</code>: A set of items that don't represent content, but act as containers for other content items (e.g. a list, a chapter)</li> </ul> <p>All of the above fields are only storing <code>NodeItem</code> instances, which reference children and parents through JSON pointers.</p> <p>The reading order of the document is encapsulated through the <code>body</code> tree and the order of children in each item in the tree.</p> <p>Below example shows how all items in the first page are nested below the <code>title</code> item (<code>#/texts/1</code>).</p> <p></p>"},{"location":"concepts/docling_document/#grouping","title":"Grouping","text":"<p>Below example shows how all items under the heading \"Let's swim\" (<code>#/texts/5</code>) are nested as children. The children of \"Let's swim\" are both text items and groups, which contain the list elements. The group items are stored in the top-level <code>groups</code> field.</p> <p></p>"},{"location":"concepts/plugins/","title":"Plugins","text":"<p>Docling allows to be extended with third-party plugins which extend the choice of options provided in several steps of the pipeline.</p> <p>Plugins are loaded via the pluggy system which allows third-party developers to register the new capabilities using the setuptools entrypoint.</p> <p>The actual entrypoint definition might vary, depending on the packaging system you are using. Here are a few examples:</p> pyproject.tomlpoetry v1 pyproject.tomlsetup.cfgsetup.py <pre><code>[project.entry-points.\"docling\"]\nyour_plugin_name = \"your_package.module\"\n</code></pre> <pre><code>[tool.poetry.plugins.\"docling\"]\nyour_plugin_name = \"your_package.module\"\n</code></pre> <pre><code>[options.entry_points]\ndocling =\n    your_plugin_name = your_package.module\n</code></pre> <pre><code>from setuptools import setup\n\nsetup(\n    # ...,\n    entry_points = {\n        'docling': [\n            'your_plugin_name = \"your_package.module\"'\n        ]\n    }\n)\n</code></pre> <ul> <li><code>your_plugin_name</code> is the name you choose for your plugin. This must be unique among the broader Docling ecosystem.</li> <li><code>your_package.module</code> is the reference to the module in your package which is responsible for the plugin registration.</li> </ul>"},{"location":"concepts/plugins/#plugin-factories","title":"Plugin factories","text":""},{"location":"concepts/plugins/#ocr-factory","title":"OCR factory","text":"<p>The OCR factory allows to provide more OCR engines to the Docling users.</p> <p>The content of <code>your_package.module</code> registers the OCR engines with a code similar to:</p> <pre><code># Factory registration\ndef ocr_engines():\n    return {\n        \"ocr_engines\": [\n            YourOcrModel,\n        ]\n    }\n</code></pre> <p>where <code>YourOcrModel</code> must implement the <code>BaseOcrModel</code> and provide an options class derived from <code>OcrOptions</code>.</p> <p>If you look for an example, the default Docling plugins is a good starting point.</p>"},{"location":"concepts/plugins/#third-party-plugins","title":"Third-party plugins","text":"<p>When the plugin is not provided by the main <code>docling</code> package but by a third-party package this have to be enabled explicitly via the <code>allow_external_plugins</code> option.</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions()\npipeline_options.allow_external_plugins = True  # &lt;-- enabled the external plugins\npipeline_options.ocr_options = YourOptions  # &lt;-- your options here\n\ndoc_converter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(\n            pipeline_options=pipeline_options\n        )\n    }\n)\n</code></pre>"},{"location":"concepts/plugins/#using-the-docling-cli","title":"Using the <code>docling</code> CLI","text":"<p>Similarly, when using the <code>docling</code> users have to enable external plugins before selecting the new one.</p> <pre><code># Show the external plugins\ndocling --show-external-plugins\n\n# Run docling with the new plugin\ndocling --allow-external-plugins --ocr-engine=NAME\n</code></pre>"},{"location":"concepts/serialization/","title":"Serialization","text":""},{"location":"concepts/serialization/#introduction","title":"Introduction","text":"<p>A document serializer (AKA simply serializer) is a Docling abstraction that is initialized with a given <code>DoclingDocument</code> and returns a textual representation for that document.</p> <p>Besides the document serializer, Docling defines similar abstractions for several document subcomponents, for example: text serializer, table serializer, picture serializer, list serializer, inline serializer, and more.</p> <p>Last but not least, a serializer provider is a wrapper that abstracts the document serialization strategy from the document instance.</p>"},{"location":"concepts/serialization/#base-classes","title":"Base classes","text":"<p>To enable both flexibility for downstream applications and out-of-the-box utility, Docling defines a serialization class hierarchy, providing:</p> <ul> <li>base types for the above abstractions: <code>BaseDocSerializer</code>, as well as   <code>BaseTextSerializer</code>, <code>BaseTableSerializer</code> etc, and <code>BaseSerializerProvider</code>, and</li> <li>specific subclasses for the above-mentioned base types, e.g. <code>MarkdownDocSerializer</code>.</li> </ul> <p>You can review all methods required to define the above base classes here.</p> <p>From a client perspective, the most relevant is <code>BaseDocSerializer.serialize()</code>, which returns the textual representation,\u00a0as well as relevant metadata on which document components contributed to that serialization.</p>"},{"location":"concepts/serialization/#use-in-doclingdocument-export-methods","title":"Use in <code>DoclingDocument</code> export methods","text":"<p>Docling provides predefined serializers for Markdown, HTML, and DocTags.</p> <p>The respective <code>DoclingDocument</code> export methods (e.g. <code>export_to_markdown()</code>) are provided as user shorthands \u2014 internally directly instantiating and delegating to respective serializers.</p>"},{"location":"concepts/serialization/#examples","title":"Examples","text":"<p>For an example showcasing how to use serializers, see here.</p>"},{"location":"examples/","title":"Examples","text":"<p>In this space, you can explore numerous Docling application recipes &amp; end-to-end workflows!</p> <p>Here some of our picks to get you started:</p> <ul> <li>\ud83d\udd00 conversion examples ranging from simple conversion to Markdown and export of figures &amp; tables, to VLM and audio pipelines</li> <li>\ud83d\udcac various RAG examples, e.g. based on LangChain, LlamaIndex, or Haystack, including visual grounding, and using different vector stores like Milvus, Weaviate, or Qdrant</li> <li>\ud83d\udce4 [ beta] structured data extraction</li> <li>examples for \u270d\ufe0f serialization and \u2702\ufe0f chunking, including user-defined customizations</li> <li>\ud83d\uddbc\ufe0f picture annotations and enrichments</li> </ul> <p>\ud83d\udc48 ... and there is much more: explore all the examples using the navigation menu on the side</p>          Visual grounding                   Picture annotations"},{"location":"examples/advanced_chunking_and_serialization/","title":"Advanced chunking &amp; serialization","text":"<p>In this notebook we show how to customize the serialization strategies that come into play during chunking.</p> <p>We will work with a document that contains some picture annotations:</p> In\u00a0[1]: Copied! <pre>from docling_core.types.doc.document import DoclingDocument\n\nSOURCE = \"./data/2408.09869v3_enriched.json\"\n\ndoc = DoclingDocument.load_from_json(SOURCE)\n</pre> from docling_core.types.doc.document import DoclingDocument  SOURCE = \"./data/2408.09869v3_enriched.json\"  doc = DoclingDocument.load_from_json(SOURCE) <p>Below we define the chunker (for more details check out Hybrid Chunking):</p> In\u00a0[2]: Copied! <pre>from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\nfrom docling_core.transforms.chunker.tokenizer.base import BaseTokenizer\nfrom docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\nfrom transformers import AutoTokenizer\n\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n\ntokenizer: BaseTokenizer = HuggingFaceTokenizer(\n    tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n)\nchunker = HybridChunker(tokenizer=tokenizer)\n</pre> from docling_core.transforms.chunker.hybrid_chunker import HybridChunker from docling_core.transforms.chunker.tokenizer.base import BaseTokenizer from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer from transformers import AutoTokenizer  EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"  tokenizer: BaseTokenizer = HuggingFaceTokenizer(     tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID), ) chunker = HybridChunker(tokenizer=tokenizer) In\u00a0[3]: Copied! <pre>print(f\"{tokenizer.get_max_tokens()=}\")\n</pre> print(f\"{tokenizer.get_max_tokens()=}\") <pre>tokenizer.get_max_tokens()=512\n</pre> <p>Defining some helper methods:</p> In\u00a0[4]: Copied! <pre>from typing import Iterable, Optional\n\nfrom docling_core.transforms.chunker.base import BaseChunk\nfrom docling_core.transforms.chunker.hierarchical_chunker import DocChunk\nfrom docling_core.types.doc.labels import DocItemLabel\nfrom rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console(\n    width=200,  # for getting Markdown tables rendered nicely\n)\n\n\ndef find_n_th_chunk_with_label(\n    iter: Iterable[BaseChunk], n: int, label: DocItemLabel\n) -&gt; Optional[DocChunk]:\n    num_found = -1\n    for i, chunk in enumerate(iter):\n        doc_chunk = DocChunk.model_validate(chunk)\n        for it in doc_chunk.meta.doc_items:\n            if it.label == label:\n                num_found += 1\n                if num_found == n:\n                    return i, chunk\n    return None, None\n\n\ndef print_chunk(chunks, chunk_pos):\n    chunk = chunks[chunk_pos]\n    ctx_text = chunker.contextualize(chunk=chunk)\n    num_tokens = tokenizer.count_tokens(text=ctx_text)\n    doc_items_refs = [it.self_ref for it in chunk.meta.doc_items]\n    title = f\"{chunk_pos=} {num_tokens=} {doc_items_refs=}\"\n    console.print(Panel(ctx_text, title=title))\n</pre> from typing import Iterable, Optional  from docling_core.transforms.chunker.base import BaseChunk from docling_core.transforms.chunker.hierarchical_chunker import DocChunk from docling_core.types.doc.labels import DocItemLabel from rich.console import Console from rich.panel import Panel  console = Console(     width=200,  # for getting Markdown tables rendered nicely )   def find_n_th_chunk_with_label(     iter: Iterable[BaseChunk], n: int, label: DocItemLabel ) -&gt; Optional[DocChunk]:     num_found = -1     for i, chunk in enumerate(iter):         doc_chunk = DocChunk.model_validate(chunk)         for it in doc_chunk.meta.doc_items:             if it.label == label:                 num_found += 1                 if num_found == n:                     return i, chunk     return None, None   def print_chunk(chunks, chunk_pos):     chunk = chunks[chunk_pos]     ctx_text = chunker.contextualize(chunk=chunk)     num_tokens = tokenizer.count_tokens(text=ctx_text)     doc_items_refs = [it.self_ref for it in chunk.meta.doc_items]     title = f\"{chunk_pos=} {num_tokens=} {doc_items_refs=}\"     console.print(Panel(ctx_text, title=title)) <p>Below we inspect the first chunk containing a table \u2014 using the default serialization strategy:</p> In\u00a0[5]: Copied! <pre>chunker = HybridChunker(tokenizer=tokenizer)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)\nprint_chunk(\n    chunks=chunks,\n    chunk_pos=i,\n)\n</pre> chunker = HybridChunker(tokenizer=tokenizer)  chunk_iter = chunker.chunk(dl_doc=doc)  chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE) print_chunk(     chunks=chunks,     chunk_pos=i, ) <pre>Token indices sequence length is longer than the specified maximum sequence length for this model (652 &gt; 512). Running this sequence through the model will result in indexing errors\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=13 num_tokens=426 doc_items_refs=['#/texts/72', '#/tables/0'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report                                                                                                                                                                             \u2502\n\u2502 4 Performance                                                                                                                                                                                        \u2502\n\u2502 Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u2502\n\u2502 (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads.           \u2502\n\u2502                                                                                                                                                                                                      \u2502\n\u2502 Apple M3 Max, Thread budget. = 4. Apple M3 Max, native backend.TTS = 177 s 167 s. Apple M3 Max, native backend.Pages/s = 1.27 1.34. Apple M3 Max, native backend.Mem = 6.20 GB. Apple M3 Max,        \u2502\n\u2502 pypdfium backend.TTS = 103 s 92 s. Apple M3 Max, pypdfium backend.Pages/s = 2.18 2.45. Apple M3 Max, pypdfium backend.Mem = 2.56 GB. (16 cores) Intel(R) Xeon E5-2690, Thread budget. = 16 4 16. (16 \u2502\n\u2502 cores) Intel(R) Xeon E5-2690, native backend.TTS = 375 s 244 s. (16 cores) Intel(R) Xeon E5-2690, native backend.Pages/s = 0.60 0.92. (16 cores) Intel(R) Xeon E5-2690, native backend.Mem = 6.16    \u2502\n\u2502 GB. (16 cores) Intel(R) Xeon E5-2690, pypdfium backend.TTS = 239 s 143 s. (16 cores) Intel(R) Xeon E5-2690, pypdfium backend.Pages/s = 0.94 1.57. (16 cores) Intel(R) Xeon E5-2690, pypdfium         \u2502\n\u2502 backend.Mem = 2.42 GB                                                                                                                                                                                \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> INFO: As you see above, using the <code>HybridChunker</code> can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here.  <p>We can configure a different serialization strategy. In the example below, we specify a different table serializer that serializes tables to Markdown instead of the triplet notation used by default:</p> In\u00a0[6]: Copied! <pre>from docling_core.transforms.chunker.hierarchical_chunker import (\n    ChunkingDocSerializer,\n    ChunkingSerializerProvider,\n)\nfrom docling_core.transforms.serializer.markdown import MarkdownTableSerializer\n\n\nclass MDTableSerializerProvider(ChunkingSerializerProvider):\n    def get_serializer(self, doc):\n        return ChunkingDocSerializer(\n            doc=doc,\n            table_serializer=MarkdownTableSerializer(),  # configuring a different table serializer\n        )\n\n\nchunker = HybridChunker(\n    tokenizer=tokenizer,\n    serializer_provider=MDTableSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)\nprint_chunk(\n    chunks=chunks,\n    chunk_pos=i,\n)\n</pre> from docling_core.transforms.chunker.hierarchical_chunker import (     ChunkingDocSerializer,     ChunkingSerializerProvider, ) from docling_core.transforms.serializer.markdown import MarkdownTableSerializer   class MDTableSerializerProvider(ChunkingSerializerProvider):     def get_serializer(self, doc):         return ChunkingDocSerializer(             doc=doc,             table_serializer=MarkdownTableSerializer(),  # configuring a different table serializer         )   chunker = HybridChunker(     tokenizer=tokenizer,     serializer_provider=MDTableSerializerProvider(), )  chunk_iter = chunker.chunk(dl_doc=doc)  chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE) print_chunk(     chunks=chunks,     chunk_pos=i, ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=13 num_tokens=431 doc_items_refs=['#/texts/72', '#/tables/0'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report                                                                                                                                                                             \u2502\n\u2502 4 Performance                                                                                                                                                                                        \u2502\n\u2502 Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u2502\n\u2502 (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads.           \u2502\n\u2502                                                                                                                                                                                                      \u2502\n\u2502 | CPU                              | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   |                       \u2502\n\u2502 |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------|                       \u2502\n\u2502 |                                  |                 | TTS              | Pages/s          | Mem              | TTS                | Pages/s            | Mem                |                       \u2502\n\u2502 | Apple M3 Max                     | 4               | 177 s 167 s      | 1.27 1.34        | 6.20 GB          | 103 s 92 s         | 2.18 2.45          | 2.56 GB            |                       \u2502\n\u2502 | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16         | 375 s 244 s      | 0.60 0.92        | 6.16 GB          | 239 s 143 s        | 0.94 1.57          | 2.42 GB            |                       \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>Below we inspect the first chunk containing a picture.</p> <p>Even when using the default strategy, we can modify the relevant parameters, e.g. which placeholder is used for pictures:</p> In\u00a0[7]: Copied! <pre>from docling_core.transforms.serializer.markdown import MarkdownParams\n\n\nclass ImgPlaceholderSerializerProvider(ChunkingSerializerProvider):\n    def get_serializer(self, doc):\n        return ChunkingDocSerializer(\n            doc=doc,\n            params=MarkdownParams(\n                image_placeholder=\"&lt;!-- image --&gt;\",\n            ),\n        )\n\n\nchunker = HybridChunker(\n    tokenizer=tokenizer,\n    serializer_provider=ImgPlaceholderSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)\nprint_chunk(\n    chunks=chunks,\n    chunk_pos=i,\n)\n</pre> from docling_core.transforms.serializer.markdown import MarkdownParams   class ImgPlaceholderSerializerProvider(ChunkingSerializerProvider):     def get_serializer(self, doc):         return ChunkingDocSerializer(             doc=doc,             params=MarkdownParams(                 image_placeholder=\"\",             ),         )   chunker = HybridChunker(     tokenizer=tokenizer,     serializer_provider=ImgPlaceholderSerializerProvider(), )  chunk_iter = chunker.chunk(dl_doc=doc)  chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE) print_chunk(     chunks=chunks,     chunk_pos=i, ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=0 num_tokens=117 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report                                                                                                                                                                             \u2502\n\u2502 &lt;!-- image --&gt;                                                                                                                                                                                       \u2502\n\u2502 Version 1.0                                                                                                                                                                                          \u2502\n\u2502 Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta  \u2502\n\u2502 Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar                                                                                                  \u2502\n\u2502 AI4K Group, IBM Research R\u00a8 uschlikon, Switzerland                                                                                                                                                   \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>Below we define and use our custom picture serialization strategy which leverages picture annotations:</p> In\u00a0[8]: Copied! <pre>from typing import Any\n\nfrom docling_core.transforms.serializer.base import (\n    BaseDocSerializer,\n    SerializationResult,\n)\nfrom docling_core.transforms.serializer.common import create_ser_result\nfrom docling_core.transforms.serializer.markdown import MarkdownPictureSerializer\nfrom docling_core.types.doc.document import (\n    PictureClassificationData,\n    PictureDescriptionData,\n    PictureItem,\n    PictureMoleculeData,\n)\nfrom typing_extensions import override\n\n\nclass AnnotationPictureSerializer(MarkdownPictureSerializer):\n    @override\n    def serialize(\n        self,\n        *,\n        item: PictureItem,\n        doc_serializer: BaseDocSerializer,\n        doc: DoclingDocument,\n        **kwargs: Any,\n    ) -&gt; SerializationResult:\n        text_parts: list[str] = []\n        for annotation in item.annotations:\n            if isinstance(annotation, PictureClassificationData):\n                predicted_class = (\n                    annotation.predicted_classes[0].class_name\n                    if annotation.predicted_classes\n                    else None\n                )\n                if predicted_class is not None:\n                    text_parts.append(f\"Picture type: {predicted_class}\")\n            elif isinstance(annotation, PictureMoleculeData):\n                text_parts.append(f\"SMILES: {annotation.smi}\")\n            elif isinstance(annotation, PictureDescriptionData):\n                text_parts.append(f\"Picture description: {annotation.text}\")\n\n        text_res = \"\\n\".join(text_parts)\n        text_res = doc_serializer.post_process(text=text_res)\n        return create_ser_result(text=text_res, span_source=item)\n</pre> from typing import Any  from docling_core.transforms.serializer.base import (     BaseDocSerializer,     SerializationResult, ) from docling_core.transforms.serializer.common import create_ser_result from docling_core.transforms.serializer.markdown import MarkdownPictureSerializer from docling_core.types.doc.document import (     PictureClassificationData,     PictureDescriptionData,     PictureItem,     PictureMoleculeData, ) from typing_extensions import override   class AnnotationPictureSerializer(MarkdownPictureSerializer):     @override     def serialize(         self,         *,         item: PictureItem,         doc_serializer: BaseDocSerializer,         doc: DoclingDocument,         **kwargs: Any,     ) -&gt; SerializationResult:         text_parts: list[str] = []         for annotation in item.annotations:             if isinstance(annotation, PictureClassificationData):                 predicted_class = (                     annotation.predicted_classes[0].class_name                     if annotation.predicted_classes                     else None                 )                 if predicted_class is not None:                     text_parts.append(f\"Picture type: {predicted_class}\")             elif isinstance(annotation, PictureMoleculeData):                 text_parts.append(f\"SMILES: {annotation.smi}\")             elif isinstance(annotation, PictureDescriptionData):                 text_parts.append(f\"Picture description: {annotation.text}\")          text_res = \"\\n\".join(text_parts)         text_res = doc_serializer.post_process(text=text_res)         return create_ser_result(text=text_res, span_source=item) In\u00a0[9]: Copied! <pre>class ImgAnnotationSerializerProvider(ChunkingSerializerProvider):\n    def get_serializer(self, doc: DoclingDocument):\n        return ChunkingDocSerializer(\n            doc=doc,\n            picture_serializer=AnnotationPictureSerializer(),  # configuring a different picture serializer\n        )\n\n\nchunker = HybridChunker(\n    tokenizer=tokenizer,\n    serializer_provider=ImgAnnotationSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)\nprint_chunk(\n    chunks=chunks,\n    chunk_pos=i,\n)\n</pre> class ImgAnnotationSerializerProvider(ChunkingSerializerProvider):     def get_serializer(self, doc: DoclingDocument):         return ChunkingDocSerializer(             doc=doc,             picture_serializer=AnnotationPictureSerializer(),  # configuring a different picture serializer         )   chunker = HybridChunker(     tokenizer=tokenizer,     serializer_provider=ImgAnnotationSerializerProvider(), )  chunk_iter = chunker.chunk(dl_doc=doc)  chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE) print_chunk(     chunks=chunks,     chunk_pos=i, ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=0 num_tokens=128 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report                                                                                                                                                                             \u2502\n\u2502 Picture description: In this image we can see a cartoon image of a duck holding a paper.                                                                                                             \u2502\n\u2502 Version 1.0                                                                                                                                                                                          \u2502\n\u2502 Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta  \u2502\n\u2502 Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar                                                                                                  \u2502\n\u2502 AI4K Group, IBM Research R\u00a8 uschlikon, Switzerland                                                                                                                                                   \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/advanced_chunking_and_serialization/#advanced-chunking-serialization","title":"Advanced chunking &amp; serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#overview","title":"Overview\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#setup","title":"Setup\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#table-serialization","title":"Table serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-the-default-strategy","title":"Using the default strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#configuring-a-different-strategy","title":"Configuring a different strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#picture-serialization","title":"Picture serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-the-default-strategy","title":"Using the default strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-a-custom-strategy","title":"Using a custom strategy\u00b6","text":""},{"location":"examples/asr_pipeline_performance_comparison/","title":"Asr pipeline performance comparison","text":"In\u00a0[\u00a0]: Copied! <pre>\"\"\"\nPerformance comparison between CPU and MLX Whisper on Apple Silicon.\n\nThis script compares the performance of:\n1. Native Whisper (forced to CPU)\n2. MLX Whisper (Apple Silicon optimized)\n\nBoth use the same model size for fair comparison.\n\"\"\"\n</pre> \"\"\" Performance comparison between CPU and MLX Whisper on Apple Silicon.  This script compares the performance of: 1. Native Whisper (forced to CPU) 2. MLX Whisper (Apple Silicon optimized)  Both use the same model size for fair comparison. \"\"\" In\u00a0[\u00a0]: Copied! <pre>import argparse\nimport sys\nimport time\nfrom pathlib import Path\n</pre> import argparse import sys import time from pathlib import Path In\u00a0[\u00a0]: Copied! <pre># Add the repository root to the path so we can import docling\nsys.path.insert(0, str(Path(__file__).parent.parent.parent))\n</pre> # Add the repository root to the path so we can import docling sys.path.insert(0, str(Path(__file__).parent.parent.parent)) In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import AsrPipelineOptions\nfrom docling.datamodel.pipeline_options_asr_model import (\n    InferenceAsrFramework,\n    InlineAsrMlxWhisperOptions,\n    InlineAsrNativeWhisperOptions,\n)\nfrom docling.document_converter import AudioFormatOption, DocumentConverter\nfrom docling.pipeline.asr_pipeline import AsrPipeline\n</pre> from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import AsrPipelineOptions from docling.datamodel.pipeline_options_asr_model import (     InferenceAsrFramework,     InlineAsrMlxWhisperOptions,     InlineAsrNativeWhisperOptions, ) from docling.document_converter import AudioFormatOption, DocumentConverter from docling.pipeline.asr_pipeline import AsrPipeline In\u00a0[\u00a0]: Copied! <pre>def create_cpu_whisper_options(model_size: str = \"turbo\"):\n    \"\"\"Create native Whisper options forced to CPU.\"\"\"\n    return InlineAsrNativeWhisperOptions(\n        repo_id=model_size,\n        inference_framework=InferenceAsrFramework.WHISPER,\n        verbose=True,\n        timestamps=True,\n        word_timestamps=True,\n        temperature=0.0,\n        max_new_tokens=256,\n        max_time_chunk=30.0,\n    )\n</pre> def create_cpu_whisper_options(model_size: str = \"turbo\"):     \"\"\"Create native Whisper options forced to CPU.\"\"\"     return InlineAsrNativeWhisperOptions(         repo_id=model_size,         inference_framework=InferenceAsrFramework.WHISPER,         verbose=True,         timestamps=True,         word_timestamps=True,         temperature=0.0,         max_new_tokens=256,         max_time_chunk=30.0,     ) In\u00a0[\u00a0]: Copied! <pre>def create_mlx_whisper_options(model_size: str = \"turbo\"):\n    \"\"\"Create MLX Whisper options for Apple Silicon.\"\"\"\n    model_map = {\n        \"tiny\": \"mlx-community/whisper-tiny-mlx\",\n        \"small\": \"mlx-community/whisper-small-mlx\",\n        \"base\": \"mlx-community/whisper-base-mlx\",\n        \"medium\": \"mlx-community/whisper-medium-mlx-8bit\",\n        \"large\": \"mlx-community/whisper-large-mlx-8bit\",\n        \"turbo\": \"mlx-community/whisper-turbo\",\n    }\n\n    return InlineAsrMlxWhisperOptions(\n        repo_id=model_map[model_size],\n        inference_framework=InferenceAsrFramework.MLX,\n        language=\"en\",\n        task=\"transcribe\",\n        word_timestamps=True,\n        no_speech_threshold=0.6,\n        logprob_threshold=-1.0,\n        compression_ratio_threshold=2.4,\n    )\n</pre> def create_mlx_whisper_options(model_size: str = \"turbo\"):     \"\"\"Create MLX Whisper options for Apple Silicon.\"\"\"     model_map = {         \"tiny\": \"mlx-community/whisper-tiny-mlx\",         \"small\": \"mlx-community/whisper-small-mlx\",         \"base\": \"mlx-community/whisper-base-mlx\",         \"medium\": \"mlx-community/whisper-medium-mlx-8bit\",         \"large\": \"mlx-community/whisper-large-mlx-8bit\",         \"turbo\": \"mlx-community/whisper-turbo\",     }      return InlineAsrMlxWhisperOptions(         repo_id=model_map[model_size],         inference_framework=InferenceAsrFramework.MLX,         language=\"en\",         task=\"transcribe\",         word_timestamps=True,         no_speech_threshold=0.6,         logprob_threshold=-1.0,         compression_ratio_threshold=2.4,     ) In\u00a0[\u00a0]: Copied! <pre>def run_transcription_test(\n    audio_file: Path, asr_options, device: AcceleratorDevice, test_name: str\n):\n    \"\"\"Run a single transcription test and return timing results.\"\"\"\n    print(f\"\\n{'=' * 60}\")\n    print(f\"Running {test_name}\")\n    print(f\"Device: {device}\")\n    print(f\"Model: {asr_options.repo_id}\")\n    print(f\"Framework: {asr_options.inference_framework}\")\n    print(f\"{'=' * 60}\")\n\n    # Create pipeline options\n    pipeline_options = AsrPipelineOptions(\n        accelerator_options=AcceleratorOptions(device=device),\n        asr_options=asr_options,\n    )\n\n    # Create document converter\n    converter = DocumentConverter(\n        format_options={\n            InputFormat.AUDIO: AudioFormatOption(\n                pipeline_cls=AsrPipeline,\n                pipeline_options=pipeline_options,\n            )\n        }\n    )\n\n    # Run transcription with timing\n    start_time = time.time()\n    try:\n        result = converter.convert(audio_file)\n        end_time = time.time()\n\n        duration = end_time - start_time\n\n        if result.status.value == \"success\":\n            # Extract text for verification\n            text_content = []\n            for item in result.document.texts:\n                text_content.append(item.text)\n\n            print(f\"\u2705 Success! Duration: {duration:.2f} seconds\")\n            print(f\"Transcribed text: {''.join(text_content)[:100]}...\")\n            return duration, True\n        else:\n            print(f\"\u274c Failed! Status: {result.status}\")\n            return duration, False\n\n    except Exception as e:\n        end_time = time.time()\n        duration = end_time - start_time\n        print(f\"\u274c Error: {e}\")\n        return duration, False\n</pre> def run_transcription_test(     audio_file: Path, asr_options, device: AcceleratorDevice, test_name: str ):     \"\"\"Run a single transcription test and return timing results.\"\"\"     print(f\"\\n{'=' * 60}\")     print(f\"Running {test_name}\")     print(f\"Device: {device}\")     print(f\"Model: {asr_options.repo_id}\")     print(f\"Framework: {asr_options.inference_framework}\")     print(f\"{'=' * 60}\")      # Create pipeline options     pipeline_options = AsrPipelineOptions(         accelerator_options=AcceleratorOptions(device=device),         asr_options=asr_options,     )      # Create document converter     converter = DocumentConverter(         format_options={             InputFormat.AUDIO: AudioFormatOption(                 pipeline_cls=AsrPipeline,                 pipeline_options=pipeline_options,             )         }     )      # Run transcription with timing     start_time = time.time()     try:         result = converter.convert(audio_file)         end_time = time.time()          duration = end_time - start_time          if result.status.value == \"success\":             # Extract text for verification             text_content = []             for item in result.document.texts:                 text_content.append(item.text)              print(f\"\u2705 Success! Duration: {duration:.2f} seconds\")             print(f\"Transcribed text: {''.join(text_content)[:100]}...\")             return duration, True         else:             print(f\"\u274c Failed! Status: {result.status}\")             return duration, False      except Exception as e:         end_time = time.time()         duration = end_time - start_time         print(f\"\u274c Error: {e}\")         return duration, False In\u00a0[\u00a0]: Copied! <pre>def parse_args():\n    \"\"\"Parse command line arguments.\"\"\"\n    parser = argparse.ArgumentParser(\n        description=\"Performance comparison between CPU and MLX Whisper on Apple Silicon\",\n        formatter_class=argparse.RawDescriptionHelpFormatter,\n        epilog=\"\"\"\nExamples:\n\n# Use default test audio file\npython asr_pipeline_performance_comparison.py\n\n# Use your own audio file\npython asr_pipeline_performance_comparison.py --audio /path/to/your/audio.mp3\n\n# Use a different audio file from the tests directory\npython asr_pipeline_performance_comparison.py --audio tests/data/audio/another_sample.wav\n        \"\"\",\n    )\n\n    parser.add_argument(\n        \"--audio\",\n        type=str,\n        help=\"Path to audio file for testing (default: tests/data/audio/sample_10s.mp3)\",\n    )\n\n    return parser.parse_args()\n</pre> def parse_args():     \"\"\"Parse command line arguments.\"\"\"     parser = argparse.ArgumentParser(         description=\"Performance comparison between CPU and MLX Whisper on Apple Silicon\",         formatter_class=argparse.RawDescriptionHelpFormatter,         epilog=\"\"\" Examples:  # Use default test audio file python asr_pipeline_performance_comparison.py  # Use your own audio file python asr_pipeline_performance_comparison.py --audio /path/to/your/audio.mp3  # Use a different audio file from the tests directory python asr_pipeline_performance_comparison.py --audio tests/data/audio/another_sample.wav         \"\"\",     )      parser.add_argument(         \"--audio\",         type=str,         help=\"Path to audio file for testing (default: tests/data/audio/sample_10s.mp3)\",     )      return parser.parse_args() In\u00a0[\u00a0]: Copied! <pre>def main():\n    \"\"\"Run performance comparison between CPU and MLX Whisper.\"\"\"\n    args = parse_args()\n\n    # Check if we're on Apple Silicon\n    try:\n        import torch\n\n        has_mps = torch.backends.mps.is_built() and torch.backends.mps.is_available()\n    except ImportError:\n        has_mps = False\n\n    try:\n        import mlx_whisper\n\n        has_mlx_whisper = True\n    except ImportError:\n        has_mlx_whisper = False\n\n    print(\"ASR Pipeline Performance Comparison\")\n    print(\"=\" * 50)\n    print(f\"Apple Silicon (MPS) available: {has_mps}\")\n    print(f\"MLX Whisper available: {has_mlx_whisper}\")\n\n    if not has_mps:\n        print(\"\u26a0\ufe0f  Apple Silicon (MPS) not available - running CPU-only comparison\")\n        print(\"   For MLX Whisper performance benefits, run on Apple Silicon devices\")\n        print(\"   MLX Whisper is optimized for Apple Silicon devices.\")\n\n    if not has_mlx_whisper:\n        print(\"\u26a0\ufe0f  MLX Whisper not installed - running CPU-only comparison\")\n        print(\"   Install with: pip install mlx-whisper\")\n        print(\"   Or: uv sync --extra asr\")\n        print(\"   For MLX Whisper performance benefits, install the dependency\")\n\n    # Determine audio file path\n    if args.audio:\n        audio_file = Path(args.audio)\n        if not audio_file.is_absolute():\n            # If relative path, make it relative to the script's directory\n            audio_file = Path(__file__).parent.parent.parent / audio_file\n    else:\n        # Use default test audio file\n        audio_file = (\n            Path(__file__).parent.parent.parent\n            / \"tests\"\n            / \"data\"\n            / \"audio\"\n            / \"sample_10s.mp3\"\n        )\n\n    if not audio_file.exists():\n        print(f\"\u274c Audio file not found: {audio_file}\")\n        print(\"   Please check the path and try again.\")\n        sys.exit(1)\n\n    print(f\"Using test audio: {audio_file}\")\n    print(f\"File size: {audio_file.stat().st_size / 1024:.1f} KB\")\n\n    # Test different model sizes\n    model_sizes = [\"tiny\", \"base\", \"turbo\"]\n    results = {}\n\n    for model_size in model_sizes:\n        print(f\"\\n{'#' * 80}\")\n        print(f\"Testing model size: {model_size}\")\n        print(f\"{'#' * 80}\")\n\n        model_results = {}\n\n        # Test 1: Native Whisper (forced to CPU)\n        cpu_options = create_cpu_whisper_options(model_size)\n        cpu_duration, cpu_success = run_transcription_test(\n            audio_file,\n            cpu_options,\n            AcceleratorDevice.CPU,\n            f\"Native Whisper {model_size} (CPU)\",\n        )\n        model_results[\"cpu\"] = {\"duration\": cpu_duration, \"success\": cpu_success}\n\n        # Test 2: MLX Whisper (Apple Silicon optimized) - only if available\n        if has_mps and has_mlx_whisper:\n            mlx_options = create_mlx_whisper_options(model_size)\n            mlx_duration, mlx_success = run_transcription_test(\n                audio_file,\n                mlx_options,\n                AcceleratorDevice.MPS,\n                f\"MLX Whisper {model_size} (MPS)\",\n            )\n            model_results[\"mlx\"] = {\"duration\": mlx_duration, \"success\": mlx_success}\n        else:\n            print(f\"\\n{'=' * 60}\")\n            print(f\"Skipping MLX Whisper {model_size} (MPS) - not available\")\n            print(f\"{'=' * 60}\")\n            model_results[\"mlx\"] = {\"duration\": 0.0, \"success\": False}\n\n        results[model_size] = model_results\n\n    # Print summary\n    print(f\"\\n{'#' * 80}\")\n    print(\"PERFORMANCE COMPARISON SUMMARY\")\n    print(f\"{'#' * 80}\")\n    print(\n        f\"{'Model':&lt;10} {'CPU (sec)':&lt;12} {'MLX (sec)':&lt;12} {'Speedup':&lt;12} {'Status':&lt;10}\"\n    )\n    print(\"-\" * 80)\n\n    for model_size, model_results in results.items():\n        cpu_duration = model_results[\"cpu\"][\"duration\"]\n        mlx_duration = model_results[\"mlx\"][\"duration\"]\n        cpu_success = model_results[\"cpu\"][\"success\"]\n        mlx_success = model_results[\"mlx\"][\"success\"]\n\n        if cpu_success and mlx_success:\n            speedup = cpu_duration / mlx_duration\n            status = \"\u2705 Both OK\"\n        elif cpu_success:\n            speedup = float(\"inf\")\n            status = \"\u274c MLX Failed\"\n        elif mlx_success:\n            speedup = 0\n            status = \"\u274c CPU Failed\"\n        else:\n            speedup = 0\n            status = \"\u274c Both Failed\"\n\n        print(\n            f\"{model_size:&lt;10} {cpu_duration:&lt;12.2f} {mlx_duration:&lt;12.2f} {speedup:&lt;12.2f}x {status:&lt;10}\"\n        )\n\n    # Calculate overall improvement\n    successful_tests = [\n        (r[\"cpu\"][\"duration\"], r[\"mlx\"][\"duration\"])\n        for r in results.values()\n        if r[\"cpu\"][\"success\"] and r[\"mlx\"][\"success\"]\n    ]\n\n    if successful_tests:\n        avg_cpu = sum(cpu for cpu, mlx in successful_tests) / len(successful_tests)\n        avg_mlx = sum(mlx for cpu, mlx in successful_tests) / len(successful_tests)\n        avg_speedup = avg_cpu / avg_mlx\n\n        print(\"-\" * 80)\n        print(\n            f\"{'AVERAGE':&lt;10} {avg_cpu:&lt;12.2f} {avg_mlx:&lt;12.2f} {avg_speedup:&lt;12.2f}x {'Overall':&lt;10}\"\n        )\n\n        print(f\"\\n\ud83c\udfaf MLX Whisper provides {avg_speedup:.1f}x average speedup over CPU!\")\n    else:\n        if has_mps and has_mlx_whisper:\n            print(\"\\n\u274c No successful comparisons available.\")\n        else:\n            print(\"\\n\u26a0\ufe0f  MLX Whisper not available - only CPU results shown.\")\n            print(\n                \"   Install MLX Whisper and run on Apple Silicon for performance comparison.\"\n            )\n</pre> def main():     \"\"\"Run performance comparison between CPU and MLX Whisper.\"\"\"     args = parse_args()      # Check if we're on Apple Silicon     try:         import torch          has_mps = torch.backends.mps.is_built() and torch.backends.mps.is_available()     except ImportError:         has_mps = False      try:         import mlx_whisper          has_mlx_whisper = True     except ImportError:         has_mlx_whisper = False      print(\"ASR Pipeline Performance Comparison\")     print(\"=\" * 50)     print(f\"Apple Silicon (MPS) available: {has_mps}\")     print(f\"MLX Whisper available: {has_mlx_whisper}\")      if not has_mps:         print(\"\u26a0\ufe0f  Apple Silicon (MPS) not available - running CPU-only comparison\")         print(\"   For MLX Whisper performance benefits, run on Apple Silicon devices\")         print(\"   MLX Whisper is optimized for Apple Silicon devices.\")      if not has_mlx_whisper:         print(\"\u26a0\ufe0f  MLX Whisper not installed - running CPU-only comparison\")         print(\"   Install with: pip install mlx-whisper\")         print(\"   Or: uv sync --extra asr\")         print(\"   For MLX Whisper performance benefits, install the dependency\")      # Determine audio file path     if args.audio:         audio_file = Path(args.audio)         if not audio_file.is_absolute():             # If relative path, make it relative to the script's directory             audio_file = Path(__file__).parent.parent.parent / audio_file     else:         # Use default test audio file         audio_file = (             Path(__file__).parent.parent.parent             / \"tests\"             / \"data\"             / \"audio\"             / \"sample_10s.mp3\"         )      if not audio_file.exists():         print(f\"\u274c Audio file not found: {audio_file}\")         print(\"   Please check the path and try again.\")         sys.exit(1)      print(f\"Using test audio: {audio_file}\")     print(f\"File size: {audio_file.stat().st_size / 1024:.1f} KB\")      # Test different model sizes     model_sizes = [\"tiny\", \"base\", \"turbo\"]     results = {}      for model_size in model_sizes:         print(f\"\\n{'#' * 80}\")         print(f\"Testing model size: {model_size}\")         print(f\"{'#' * 80}\")          model_results = {}          # Test 1: Native Whisper (forced to CPU)         cpu_options = create_cpu_whisper_options(model_size)         cpu_duration, cpu_success = run_transcription_test(             audio_file,             cpu_options,             AcceleratorDevice.CPU,             f\"Native Whisper {model_size} (CPU)\",         )         model_results[\"cpu\"] = {\"duration\": cpu_duration, \"success\": cpu_success}          # Test 2: MLX Whisper (Apple Silicon optimized) - only if available         if has_mps and has_mlx_whisper:             mlx_options = create_mlx_whisper_options(model_size)             mlx_duration, mlx_success = run_transcription_test(                 audio_file,                 mlx_options,                 AcceleratorDevice.MPS,                 f\"MLX Whisper {model_size} (MPS)\",             )             model_results[\"mlx\"] = {\"duration\": mlx_duration, \"success\": mlx_success}         else:             print(f\"\\n{'=' * 60}\")             print(f\"Skipping MLX Whisper {model_size} (MPS) - not available\")             print(f\"{'=' * 60}\")             model_results[\"mlx\"] = {\"duration\": 0.0, \"success\": False}          results[model_size] = model_results      # Print summary     print(f\"\\n{'#' * 80}\")     print(\"PERFORMANCE COMPARISON SUMMARY\")     print(f\"{'#' * 80}\")     print(         f\"{'Model':&lt;10} {'CPU (sec)':&lt;12} {'MLX (sec)':&lt;12} {'Speedup':&lt;12} {'Status':&lt;10}\"     )     print(\"-\" * 80)      for model_size, model_results in results.items():         cpu_duration = model_results[\"cpu\"][\"duration\"]         mlx_duration = model_results[\"mlx\"][\"duration\"]         cpu_success = model_results[\"cpu\"][\"success\"]         mlx_success = model_results[\"mlx\"][\"success\"]          if cpu_success and mlx_success:             speedup = cpu_duration / mlx_duration             status = \"\u2705 Both OK\"         elif cpu_success:             speedup = float(\"inf\")             status = \"\u274c MLX Failed\"         elif mlx_success:             speedup = 0             status = \"\u274c CPU Failed\"         else:             speedup = 0             status = \"\u274c Both Failed\"          print(             f\"{model_size:&lt;10} {cpu_duration:&lt;12.2f} {mlx_duration:&lt;12.2f} {speedup:&lt;12.2f}x {status:&lt;10}\"         )      # Calculate overall improvement     successful_tests = [         (r[\"cpu\"][\"duration\"], r[\"mlx\"][\"duration\"])         for r in results.values()         if r[\"cpu\"][\"success\"] and r[\"mlx\"][\"success\"]     ]      if successful_tests:         avg_cpu = sum(cpu for cpu, mlx in successful_tests) / len(successful_tests)         avg_mlx = sum(mlx for cpu, mlx in successful_tests) / len(successful_tests)         avg_speedup = avg_cpu / avg_mlx          print(\"-\" * 80)         print(             f\"{'AVERAGE':&lt;10} {avg_cpu:&lt;12.2f} {avg_mlx:&lt;12.2f} {avg_speedup:&lt;12.2f}x {'Overall':&lt;10}\"         )          print(f\"\\n\ud83c\udfaf MLX Whisper provides {avg_speedup:.1f}x average speedup over CPU!\")     else:         if has_mps and has_mlx_whisper:             print(\"\\n\u274c No successful comparisons available.\")         else:             print(\"\\n\u26a0\ufe0f  MLX Whisper not available - only CPU results shown.\")             print(                 \"   Install MLX Whisper and run on Apple Silicon for performance comparison.\"             ) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n    main()\n</pre> if __name__ == \"__main__\":     main()"},{"location":"examples/backend_csv/","title":"Conversion of CSV files","text":"In\u00a0[59]: Copied! <pre>from pathlib import Path\n\nfrom docling.document_converter import DocumentConverter\n\n# Convert CSV to Docling document\nconverter = DocumentConverter()\nresult = converter.convert(Path(\"../../tests/data/csv/csv-comma.csv\"))\noutput = result.document.export_to_markdown()\n</pre> from pathlib import Path  from docling.document_converter import DocumentConverter  # Convert CSV to Docling document converter = DocumentConverter() result = converter.convert(Path(\"../../tests/data/csv/csv-comma.csv\")) output = result.document.export_to_markdown() <p>This code generates the following output:</p> Index Customer Id First Name Last Name Company City Country Phone 1 Phone 2 Email Subscription Date Website 1 DD37Cf93aecA6Dc Sheryl Baxter Rasmussen Group East Leonard Chile 229.077.5154 397.884.0519x718 zunigavanessa@smith.info 2020-08-24 http://www.stephenson.com/ 2 1Ef7b82A4CAAD10 Preston Lozano, Dr Vega-Gentry East Jimmychester Djibouti 5153435776 686-620-1820x944 vmata@colon.com 2021-04-23 http://www.hobbs.com/ 3 6F94879bDAfE5a6 Roy Berry Murillo-Perry Isabelborough Antigua and Barbuda +1-539-402-0259 (496)978-3969x58947 beckycarr@hogan.com 2020-03-25 http://www.lawrence.com/ 4 5Cef8BFA16c5e3c Linda Olsen Dominguez, Mcmillan and Donovan Bensonview Dominican Republic 001-808-617-6467x12895 +1-813-324-8756 stanleyblackwell@benson.org 2020-06-02 http://www.good-lyons.com/ 5 053d585Ab6b3159 Joanna Bender Martin, Lang and Andrade West Priscilla Slovakia (Slovak Republic) 001-234-203-0635x76146 001-199-446-3860x3486 colinalvarado@miles.net 2021-04-17 https://goodwin-ingram.com/"},{"location":"examples/backend_csv/#conversion-of-csv-files","title":"Conversion of CSV files\u00b6","text":"<p>This example shows how to convert CSV files to a structured Docling Document.</p> <ul> <li>Multiple delimiters are supported: <code>,</code> <code>;</code> <code>|</code> <code>[tab]</code></li> <li>Additional CSV dialect settings are detected automatically (e.g. quotes, line separator, escape character)</li> </ul>"},{"location":"examples/backend_csv/#example-code","title":"Example Code\u00b6","text":""},{"location":"examples/backend_xml_rag/","title":"Conversion of custom XML","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote <p>This is an example of using Docling for converting structured data (XML) into a unified document representation format, <code>DoclingDocument</code>, and leverage its riched structured content for RAG applications.</p> <p>Data used in this example consist of patents from the United States Patent and Trademark Office (USPTO) and medical articles from PubMed Central\u00ae (PMC).</p> <p>In this notebook, we accomplish the following:</p> <ul> <li>Simple conversion of supported XML files in a nutshell</li> <li>An end-to-end application using public collections of XML files supported by Docling<ul> <li>Setup the API access for generative AI</li> <li>Fetch the data from USPTO and PubMed Central\u00ae sites, using Docling custom backends</li> <li>Parse, chunk, and index the documents in a vector database</li> <li>Perform RAG using LlamaIndex Docling extension</li> </ul> </li> </ul> <p>For more details on document chunking with Docling, refer to the Chunking documentation. For RAG with Docling and LlamaIndex, also check the example RAG with LlamaIndex.</p> In\u00a0[1]: Copied! <pre>from docling.document_converter import DocumentConverter\n\n# a sample PMC article:\nsource = \"../../tests/data/jats/elife-56337.nxml\"\nconverter = DocumentConverter()\nresult = converter.convert(source)\nprint(result.status)\n</pre> from docling.document_converter import DocumentConverter  # a sample PMC article: source = \"../../tests/data/jats/elife-56337.nxml\" converter = DocumentConverter() result = converter.convert(source) print(result.status) <pre>ConversionStatus.SUCCESS\n</pre> <p>Once the document is converted, it can be exported to any format supported by Docling. For instance, to markdown (showing here the first lines only):</p> In\u00a0[2]: Copied! <pre>md_doc = result.document.export_to_markdown()\n\ndelim = \"\\n\"\nprint(delim.join(md_doc.split(delim)[:8]))\n</pre> md_doc = result.document.export_to_markdown()  delim = \"\\n\" print(delim.join(md_doc.split(delim)[:8])) <pre># KRAB-zinc finger protein gene expansion in response to active retrotransposons in the murine lineage\n\nGernot Wolf, Alberto de Iaco, Ming-An Sun, Melania Bruno, Matthew Tinkham, Don Hoang, Apratim Mitra, Sherry Ralls, Didier Trono, Todd S Macfarlan\n\nThe Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health, Bethesda, United States; School of Life Sciences, \u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL), Lausanne, Switzerland\n\n## Abstract\n\n</pre> <p>If the XML file is not supported, a <code>ConversionError</code> message will be raised.</p> In\u00a0[3]: Copied! <pre>from io import BytesIO\n\nfrom docling.datamodel.base_models import DocumentStream\nfrom docling.exceptions import ConversionError\n\nxml_content = (\n    b'&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;&lt;!DOCTYPE docling_test SYSTEM '\n    b'\"test.dtd\"&gt;&lt;docling&gt;Random content&lt;/docling&gt;'\n)\nstream = DocumentStream(name=\"docling_test.xml\", stream=BytesIO(xml_content))\ntry:\n    result = converter.convert(stream)\nexcept ConversionError as ce:\n    print(ce)\n</pre> from io import BytesIO  from docling.datamodel.base_models import DocumentStream from docling.exceptions import ConversionError  xml_content = (     b' Random content' ) stream = DocumentStream(name=\"docling_test.xml\", stream=BytesIO(xml_content)) try:     result = converter.convert(stream) except ConversionError as ce:     print(ce) <pre>Input document docling_test.xml does not match any allowed format.\n</pre> <pre>File format not allowed: docling_test.xml\n</pre> <p>You can always refer to the Usage documentation page for a list of supported formats.</p> <p>Requirements can be installed as shown below. The <code>--no-warn-conflicts</code> argument is meant for Colab's pre-populated Python environment, feel free to remove for stricter usage.</p> In\u00a0[4]: Copied! <pre>%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv\n</pre> %pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> <p>This notebook uses HuggingFace's Inference API. For an increased LLM quota, a token can be provided via the environment variable <code>HF_TOKEN</code>.</p> <p>If you're running this notebook in Google Colab, make sure you add your API key as a secret.</p> In\u00a0[5]: Copied! <pre>import os\nfrom warnings import filterwarnings\n\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n    try:\n        from google.colab import userdata\n\n        try:\n            return userdata.get(key)\n        except userdata.SecretNotFoundError:\n            pass\n    except ImportError:\n        pass\n    return os.getenv(key)\n\n\nload_dotenv()\n\nfilterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")\n</pre> import os from warnings import filterwarnings  from dotenv import load_dotenv   def _get_env_from_colab_or_os(key):     try:         from google.colab import userdata          try:             return userdata.get(key)         except userdata.SecretNotFoundError:             pass     except ImportError:         pass     return os.getenv(key)   load_dotenv()  filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\") <p>We can now define the main parameters:</p> In\u00a0[6]: Copied! <pre>from pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n\nEMBED_MODEL_ID = \"BAAI/bge-small-en-v1.5\"\nEMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID)\nTEMP_DIR = Path(mkdtemp())\nMILVUS_URI = str(TEMP_DIR / \"docling.db\")\nGEN_MODEL = HuggingFaceInferenceAPI(\n    token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n    model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n)\nembed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n</pre> from pathlib import Path from tempfile import mkdtemp  from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI  EMBED_MODEL_ID = \"BAAI/bge-small-en-v1.5\" EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID) TEMP_DIR = Path(mkdtemp()) MILVUS_URI = str(TEMP_DIR / \"docling.db\") GEN_MODEL = HuggingFaceInferenceAPI(     token=_get_env_from_colab_or_os(\"HF_TOKEN\"),     model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", ) embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\")) # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" <p>In this notebook we will use XML data from collections supported by Docling:</p> <ul> <li>Medical articles from the PubMed Central\u00ae (PMC). They are available in an FTP server as <code>.tar.gz</code> files. Each file contains the full article data in XML format, among other supplementary files like images or spreadsheets.</li> <li>Patents from the United States Patent and Trademark Office. They are available in the Bulk Data Storage System (BDSS) as zip files. Each zip file may contain several patents in XML format.</li> </ul> <p>The raw files will be downloaded form the source and saved in a temporary directory.</p> In\u00a0[7]: Copied! <pre>import tarfile\nfrom io import BytesIO\n\nimport requests\n\n# PMC article PMC11703268\nurl: str = \"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz\"\n\nprint(f\"Downloading {url}...\")\nbuf = BytesIO(requests.get(url).content)\nprint(\"Extracting and storing the XML file containing the article text...\")\nwith tarfile.open(fileobj=buf, mode=\"r:gz\") as tar_file:\n    for tarinfo in tar_file:\n        if tarinfo.isreg():\n            file_path = Path(tarinfo.name)\n            if file_path.suffix == \".nxml\":\n                with open(TEMP_DIR / file_path.name, \"wb\") as file_obj:\n                    file_obj.write(tar_file.extractfile(tarinfo).read())\n                print(f\"Stored XML file {file_path.name}\")\n</pre> import tarfile from io import BytesIO  import requests  # PMC article PMC11703268 url: str = \"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz\"  print(f\"Downloading {url}...\") buf = BytesIO(requests.get(url).content) print(\"Extracting and storing the XML file containing the article text...\") with tarfile.open(fileobj=buf, mode=\"r:gz\") as tar_file:     for tarinfo in tar_file:         if tarinfo.isreg():             file_path = Path(tarinfo.name)             if file_path.suffix == \".nxml\":                 with open(TEMP_DIR / file_path.name, \"wb\") as file_obj:                     file_obj.write(tar_file.extractfile(tarinfo).read())                 print(f\"Stored XML file {file_path.name}\") <pre>Downloading https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz...\nExtracting and storing the XML file containing the article text...\nStored XML file nihpp-2024.12.26.630351v1.nxml\n</pre> In\u00a0[8]: Copied! <pre>import zipfile\n\n# Patent grants from December 17-23, 2024\nurl: str = (\n    \"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip\"\n)\nXML_SPLITTER: str = '&lt;?xml version=\"1.0\"'\ndoc_num: int = 0\n\nprint(f\"Downloading {url}...\")\nbuf = BytesIO(requests.get(url).content)\nprint(\"Parsing zip file, splitting into XML sections, and exporting to files...\")\nwith zipfile.ZipFile(buf) as zf:\n    res = zf.testzip()\n    if res:\n        print(\"Error validating zip file\")\n    else:\n        with zf.open(zf.namelist()[0]) as xf:\n            is_patent = False\n            patent_buffer = BytesIO()\n            for xf_line in xf:\n                decoded_line = xf_line.decode(errors=\"ignore\").rstrip()\n                xml_index = decoded_line.find(XML_SPLITTER)\n                if xml_index != -1:\n                    if (\n                        xml_index &gt; 0\n                    ):  # cases like &lt;/sequence-cwu&gt;&lt;?xml version=\"1.0\"...\n                        patent_buffer.write(xf_line[:xml_index])\n                        patent_buffer.write(b\"\\r\\n\")\n                        xf_line = xf_line[xml_index:]\n                    if patent_buffer.getbuffer().nbytes &gt; 0 and is_patent:\n                        doc_num += 1\n                        patent_id = f\"ipg241217-{doc_num}\"\n                        with open(TEMP_DIR / f\"{patent_id}.xml\", \"wb\") as file_obj:\n                            file_obj.write(patent_buffer.getbuffer())\n                    is_patent = False\n                    patent_buffer = BytesIO()\n                elif decoded_line.startswith(\"&lt;!DOCTYPE\"):\n                    is_patent = True\n                patent_buffer.write(xf_line)\n</pre> import zipfile  # Patent grants from December 17-23, 2024 url: str = (     \"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip\" ) XML_SPLITTER: str = ' 0                     ):  # cases like  0 and is_patent:                         doc_num += 1                         patent_id = f\"ipg241217-{doc_num}\"                         with open(TEMP_DIR / f\"{patent_id}.xml\", \"wb\") as file_obj:                             file_obj.write(patent_buffer.getbuffer())                     is_patent = False                     patent_buffer = BytesIO()                 elif decoded_line.startswith(\" <pre>Downloading https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip...\nParsing zip file, splitting into XML sections, and exporting to files...\n</pre> In\u00a0[9]: Copied! <pre>print(f\"Fetched and exported {doc_num} documents.\")\n</pre> print(f\"Fetched and exported {doc_num} documents.\") <pre>Fetched and exported 4014 documents.\n</pre> <p>The <code>DoclingDocument</code> format of the converted patents has a rich hierarchical structure, inherited from the original XML document and preserved by the Docling custom backend. In this notebook, we will leverage:</p> <ul> <li>The <code>SimpleDirectoryReader</code> pattern to iterate over the exported XML files created in section Fetch the data.</li> <li>The LlamaIndex extensions, <code>DoclingReader</code> and <code>DoclingNodeParser</code>, to ingest the patent chunks into a Milvus vector store.</li> <li>The <code>HierarchicalChunker</code> implementation, which applies a document-based hierarchical chunking, to leverage the patent structures like sections and paragraphs within sections.</li> </ul> <p>Refer to other possible implementations and usage patterns in the Chunking documentation and the RAG with LlamaIndex notebook.</p> In\u00a0[13]: Copied! <pre>from llama_index.core import SimpleDirectoryReader\nfrom llama_index.readers.docling import DoclingReader\n\nreader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\ndir_reader = SimpleDirectoryReader(\n    input_dir=TEMP_DIR,\n    exclude=[\"docling.db\", \"*.nxml\"],\n    file_extractor={\".xml\": reader},\n    filename_as_id=True,\n    num_files_limit=100,\n)\n</pre> from llama_index.core import SimpleDirectoryReader from llama_index.readers.docling import DoclingReader  reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) dir_reader = SimpleDirectoryReader(     input_dir=TEMP_DIR,     exclude=[\"docling.db\", \"*.nxml\"],     file_extractor={\".xml\": reader},     filename_as_id=True,     num_files_limit=100, ) In\u00a0[14]: Copied! <pre>from llama_index.node_parser.docling import DoclingNodeParser\n\nnode_parser = DoclingNodeParser()\n</pre> from llama_index.node_parser.docling import DoclingNodeParser  node_parser = DoclingNodeParser() In\u00a0[\u00a0]: Copied! <pre>from llama_index.core import StorageContext, VectorStoreIndex\nfrom llama_index.vector_stores.milvus import MilvusVectorStore\n\nvector_store = MilvusVectorStore(\n    uri=MILVUS_URI,\n    dim=embed_dim,\n    overwrite=True,\n)\n\nindex = VectorStoreIndex.from_documents(\n    documents=dir_reader.load_data(show_progress=True),\n    transformations=[node_parser],\n    storage_context=StorageContext.from_defaults(vector_store=vector_store),\n    embed_model=EMBED_MODEL,\n    show_progress=True,\n)\n</pre> from llama_index.core import StorageContext, VectorStoreIndex from llama_index.vector_stores.milvus import MilvusVectorStore  vector_store = MilvusVectorStore(     uri=MILVUS_URI,     dim=embed_dim,     overwrite=True, )  index = VectorStoreIndex.from_documents(     documents=dir_reader.load_data(show_progress=True),     transformations=[node_parser],     storage_context=StorageContext.from_defaults(vector_store=vector_store),     embed_model=EMBED_MODEL,     show_progress=True, ) <p>Finally, add the PMC article to the vector store directly from the reader.</p> In\u00a0[14]: Copied! <pre>index.from_documents(\n    documents=reader.load_data(TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"),\n    transformations=[node_parser],\n    storage_context=StorageContext.from_defaults(vector_store=vector_store),\n    embed_model=EMBED_MODEL,\n)\n</pre> index.from_documents(     documents=reader.load_data(TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"),     transformations=[node_parser],     storage_context=StorageContext.from_defaults(vector_store=vector_store),     embed_model=EMBED_MODEL, ) Out[14]: <pre>&lt;llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x373a7f7d0&gt;</pre> <p>The retriever can be used to identify highly relevant documents:</p> In\u00a0[15]: Copied! <pre>retriever = index.as_retriever(similarity_top_k=3)\nresults = retriever.retrieve(\"What patents are related to fitness devices?\")\n\nfor item in results:\n    print(item)\n</pre> retriever = index.as_retriever(similarity_top_k=3) results = retriever.retrieve(\"What patents are related to fitness devices?\")  for item in results:     print(item) <pre>Node ID: 5afd36c0-a739-4a88-a51c-6d0f75358db5\nText: The portable fitness monitoring device 102 may be a device such\nas, for example, a mobile phone, a personal digital assistant, a music\nfile player (e.g. and MP3 player), an intelligent article for wearing\n(e.g. a fitness monitoring garment, wrist band, or watch), a dongle\n(e.g. a small hardware device that protects software) that includes a\nfitn...\nScore:  0.772\n\nNode ID: f294b5fd-9089-43cb-8c4e-d1095a634ff1\nText: US Patent Application US 20120071306 entitled \u201cPortable\nMultipurpose Whole Body Exercise Device\u201d discloses a portable\nmultipurpose whole body exercise device which can be used for general\nfitness, Pilates-type, core strengthening, therapeutic, and\nrehabilitative exercises as well as stretching and physical therapy\nand which includes storable acc...\nScore:  0.749\n\nNode ID: 8251c7ef-1165-42e1-8c91-c99c8a711bf7\nText: Program products, methods, and systems for providing fitness\nmonitoring services of the present invention can include any software\napplication executed by one or more computing devices. A computing\ndevice can be any type of computing device having one or more\nprocessors. For example, a computing device can be a workstation,\nmobile device (e.g., ...\nScore:  0.744\n\n</pre> <p>With the query engine, we can run the question-answering with the RAG pattern on the set of indexed documents.</p> <p>First, we can prompt the LLM directly:</p> In\u00a0[16]: Copied! <pre>from llama_index.core.base.llms.types import ChatMessage, MessageRole\nfrom rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console()\nquery = \"Do mosquitoes in high altitude expand viruses over large distances?\"\n\nusr_msg = ChatMessage(role=MessageRole.USER, content=query)\nresponse = GEN_MODEL.chat(messages=[usr_msg])\n\nconsole.print(Panel(query, title=\"Prompt\", border_style=\"bold red\"))\nconsole.print(\n    Panel(\n        response.message.content.strip(),\n        title=\"Generated Content\",\n        border_style=\"bold green\",\n    )\n)\n</pre> from llama_index.core.base.llms.types import ChatMessage, MessageRole from rich.console import Console from rich.panel import Panel  console = Console() query = \"Do mosquitoes in high altitude expand viruses over large distances?\"  usr_msg = ChatMessage(role=MessageRole.USER, content=query) response = GEN_MODEL.chat(messages=[usr_msg])  console.print(Panel(query, title=\"Prompt\", border_style=\"bold red\")) console.print(     Panel(         response.message.content.strip(),         title=\"Generated Content\",         border_style=\"bold green\",     ) ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Do mosquitoes in high altitude expand viruses over large distances?                                             \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Mosquitoes can be found at high altitudes, but their ability to transmit viruses over long distances is not     \u2502\n\u2502 primarily dependent on altitude. Mosquitoes are vectors for various diseases, such as malaria, dengue fever,    \u2502\n\u2502 and Zika virus, and their transmission range is more closely related to their movement, the presence of a host, \u2502\n\u2502 and environmental conditions that support their survival and reproduction.                                      \u2502\n\u2502                                                                                                                 \u2502\n\u2502 At high altitudes, the environment can be less suitable for mosquitoes due to factors such as colder            \u2502\n\u2502 temperatures, lower humidity, and stronger winds, which can limit their population size and distribution.       \u2502\n\u2502 However, some species of mosquitoes have adapted to high-altitude environments and can still transmit diseases  \u2502\n\u2502 in these areas.                                                                                                 \u2502\n\u2502                                                                                                                 \u2502\n\u2502 It is possible for mosquitoes to be transported by wind or human activities to higher altitudes, but this is    \u2502\n\u2502 not a significant factor in their ability to transmit viruses over long distances. Instead, long-distance       \u2502\n\u2502 transmission of viruses is more often associated with human travel and transportation, which can rapidly spread \u2502\n\u2502 infected mosquitoes or humans to new areas, leading to the spread of disease.                                   \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>Now, we can compare the response when the model is prompted with the indexed PMC article as supporting context:</p> In\u00a0[17]: Copied! <pre>from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters\n\nfilters = MetadataFilters(\n    filters=[\n        ExactMatchFilter(key=\"filename\", value=\"nihpp-2024.12.26.630351v1.nxml\"),\n    ]\n)\n\nquery_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3)\nresult = query_engine.query(query)\n\nconsole.print(\n    Panel(\n        result.response.strip(),\n        title=\"Generated Content with RAG\",\n        border_style=\"bold green\",\n    )\n)\n</pre> from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters  filters = MetadataFilters(     filters=[         ExactMatchFilter(key=\"filename\", value=\"nihpp-2024.12.26.630351v1.nxml\"),     ] )  query_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3) result = query_engine.query(query)  console.print(     Panel(         result.response.strip(),         title=\"Generated Content with RAG\",         border_style=\"bold green\",     ) ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content with RAG \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female      \u2502\n\u2502 mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with      \u2502\n\u2502 arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with            \u2502\n\u2502 flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens,         \u2502\n\u2502 including three arboviruses that affect humans (dengue, West Nile, and M\u2019Poko viruses). The study provides      \u2502\n\u2502 compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude.         \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre>"},{"location":"examples/backend_xml_rag/#conversion-of-custom-xml","title":"Conversion of custom XML\u00b6","text":""},{"location":"examples/backend_xml_rag/#overview","title":"Overview\u00b6","text":""},{"location":"examples/backend_xml_rag/#simple-conversion","title":"Simple conversion\u00b6","text":"<p>The XML file format defines and stores data in a format that is both human-readable and machine-readable. Because of this flexibility, Docling requires custom backend processors to interpret XML definitions and convert them into <code>DoclingDocument</code> objects.</p> <p>Some public data collections in XML format are already supported by Docling (USTPO patents and PMC articles). In these cases, the document conversion is straightforward and the same as with any other supported format, such as PDF or HTML. The execution example in Simple Conversion is the recommended usage of Docling for a single file:</p>"},{"location":"examples/backend_xml_rag/#end-to-end-application","title":"End-to-end application\u00b6","text":"<p>This section describes a step-by-step application for processing XML files from supported public collections and use them for question-answering.</p>"},{"location":"examples/backend_xml_rag/#setup","title":"Setup\u00b6","text":""},{"location":"examples/backend_xml_rag/#fetch-the-data","title":"Fetch the data\u00b6","text":""},{"location":"examples/backend_xml_rag/#pmc-articles","title":"PMC articles\u00b6","text":"<p>The OA file is a manifest file of all the PMC articles, including the URL path to download the source files. In this notebook we will use as example the article Pathogens spread by high-altitude windborne mosquitoes, which is available in the archive file PMC11703268.tar.gz.</p>"},{"location":"examples/backend_xml_rag/#uspto-patents","title":"USPTO patents\u00b6","text":"<p>Since each USPTO file is a concatenation of several patents, we need to split its content into valid XML pieces. The following code downloads a sample zip file, split its content in sections, and dumps each section as an XML file. For simplicity, this pipeline is shown here in a sequential manner, but it could be parallelized.</p>"},{"location":"examples/backend_xml_rag/#parse-chunk-and-index","title":"Parse, chunk, and index\u00b6","text":""},{"location":"examples/backend_xml_rag/#set-the-docling-reader-and-the-directory-reader","title":"Set the Docling reader and the directory reader\u00b6","text":"<p>Note that <code>DoclingReader</code> uses Docling's <code>DocumentConverter</code> by default and therefore it will recognize the format of the XML files and leverage the <code>PatentUsptoDocumentBackend</code> automatically.</p> <p>For demonstration purposes, we limit the scope of the analysis to the first 100 patents.</p>"},{"location":"examples/backend_xml_rag/#set-the-node-parser","title":"Set the node parser\u00b6","text":"<p>Note that the <code>HierarchicalChunker</code> is the default chunking implementation of the <code>DoclingNodeParser</code>.</p>"},{"location":"examples/backend_xml_rag/#set-a-local-milvus-database-and-run-the-ingestion","title":"Set a local Milvus database and run the ingestion\u00b6","text":""},{"location":"examples/backend_xml_rag/#question-answering-with-rag","title":"Question-answering with RAG\u00b6","text":""},{"location":"examples/batch_convert/","title":"Batch conversion","text":"<p>Batch convert multiple PDF files and export results in several formats.</p> <p>What this example does</p> <ul> <li>Loads a small set of sample PDFs.</li> <li>Runs the Docling PDF pipeline once per file.</li> <li>Writes outputs to <code>scratch/</code> in multiple formats (JSON, HTML, Markdown, text, doctags, YAML).</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling and dependencies as described in the repository README.</li> <li>Ensure you can import <code>docling</code> from your Python environment.</li> </ul> <p>Input documents</p> <ul> <li>By default, this example uses a few PDFs from <code>tests/data/pdf/</code> in the repo.</li> <li>If you cloned without test data, or want to use your own files, edit <code>input_doc_paths</code> below to point to PDFs on your machine.</li> </ul> <p>Output formats (controlled by flags)</p> <ul> <li><code>USE_V2 = True</code> enables the current Docling document exports (recommended).</li> <li><code>USE_LEGACY = False</code> keeps legacy Deep Search exports disabled. You can set it to <code>True</code> if you need legacy formats for compatibility tests.</li> </ul> <p>Notes</p> <ul> <li>Set <code>pipeline_options.generate_page_images = True</code> to include page images in HTML.</li> <li>The script logs conversion progress and raises if any documents fail.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import json\nimport logging\nimport time\nfrom collections.abc import Iterable\nfrom pathlib import Path\n\nimport yaml\nfrom docling_core.types.doc import ImageRefMode\n\nfrom docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n_log = logging.getLogger(__name__)\n\n# Export toggles:\n# - USE_V2 controls modern Docling document exports.\n# - USE_LEGACY enables legacy Deep Search exports for comparison or migration.\nUSE_V2 = True\nUSE_LEGACY = False\n\n\ndef export_documents(\n    conv_results: Iterable[ConversionResult],\n    output_dir: Path,\n):\n    output_dir.mkdir(parents=True, exist_ok=True)\n\n    success_count = 0\n    failure_count = 0\n    partial_success_count = 0\n\n    for conv_res in conv_results:\n        if conv_res.status == ConversionStatus.SUCCESS:\n            success_count += 1\n            doc_filename = conv_res.input.file.stem\n\n            if USE_V2:\n                # Recommended modern Docling exports. These helpers mirror the\n                # lower-level \"export_to_*\" methods used below, but handle\n                # common details like image handling.\n                conv_res.document.save_as_json(\n                    output_dir / f\"{doc_filename}.json\",\n                    image_mode=ImageRefMode.PLACEHOLDER,\n                )\n                conv_res.document.save_as_html(\n                    output_dir / f\"{doc_filename}.html\",\n                    image_mode=ImageRefMode.EMBEDDED,\n                )\n                conv_res.document.save_as_doctags(\n                    output_dir / f\"{doc_filename}.doctags.txt\"\n                )\n                conv_res.document.save_as_markdown(\n                    output_dir / f\"{doc_filename}.md\",\n                    image_mode=ImageRefMode.PLACEHOLDER,\n                )\n                conv_res.document.save_as_markdown(\n                    output_dir / f\"{doc_filename}.txt\",\n                    image_mode=ImageRefMode.PLACEHOLDER,\n                    strict_text=True,\n                )\n\n                # Export Docling document format to YAML:\n                with (output_dir / f\"{doc_filename}.yaml\").open(\"w\") as fp:\n                    fp.write(yaml.safe_dump(conv_res.document.export_to_dict()))\n\n                # Export Docling document format to doctags:\n                with (output_dir / f\"{doc_filename}.doctags.txt\").open(\"w\") as fp:\n                    fp.write(conv_res.document.export_to_doctags())\n\n                # Export Docling document format to markdown:\n                with (output_dir / f\"{doc_filename}.md\").open(\"w\") as fp:\n                    fp.write(conv_res.document.export_to_markdown())\n\n                # Export Docling document format to text:\n                with (output_dir / f\"{doc_filename}.txt\").open(\"w\") as fp:\n                    fp.write(conv_res.document.export_to_markdown(strict_text=True))\n\n            if USE_LEGACY:\n                # Export Deep Search document JSON format:\n                with (output_dir / f\"{doc_filename}.legacy.json\").open(\n                    \"w\", encoding=\"utf-8\"\n                ) as fp:\n                    fp.write(json.dumps(conv_res.legacy_document.export_to_dict()))\n\n                # Export Text format:\n                with (output_dir / f\"{doc_filename}.legacy.txt\").open(\n                    \"w\", encoding=\"utf-8\"\n                ) as fp:\n                    fp.write(\n                        conv_res.legacy_document.export_to_markdown(strict_text=True)\n                    )\n\n                # Export Markdown format:\n                with (output_dir / f\"{doc_filename}.legacy.md\").open(\n                    \"w\", encoding=\"utf-8\"\n                ) as fp:\n                    fp.write(conv_res.legacy_document.export_to_markdown())\n\n                # Export Document Tags format:\n                with (output_dir / f\"{doc_filename}.legacy.doctags.txt\").open(\n                    \"w\", encoding=\"utf-8\"\n                ) as fp:\n                    fp.write(conv_res.legacy_document.export_to_document_tokens())\n\n        elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS:\n            _log.info(\n                f\"Document {conv_res.input.file} was partially converted with the following errors:\"\n            )\n            for item in conv_res.errors:\n                _log.info(f\"\\t{item.error_message}\")\n            partial_success_count += 1\n        else:\n            _log.info(f\"Document {conv_res.input.file} failed to convert.\")\n            failure_count += 1\n\n    _log.info(\n        f\"Processed {success_count + partial_success_count + failure_count} docs, \"\n        f\"of which {failure_count} failed \"\n        f\"and {partial_success_count} were partially converted.\"\n    )\n    return success_count, partial_success_count, failure_count\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO)\n\n    # Location of sample PDFs used by this example. If your checkout does not\n    # include test data, change `data_folder` or point `input_doc_paths` to\n    # your own files.\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_paths = [\n        data_folder / \"pdf/2206.01062.pdf\",\n        data_folder / \"pdf/2203.01017v2.pdf\",\n        data_folder / \"pdf/2305.03393v1.pdf\",\n        data_folder / \"pdf/redp5110_sampled.pdf\",\n    ]\n\n    # buf = BytesIO((data_folder / \"pdf/2206.01062.pdf\").open(\"rb\").read())\n    # docs = [DocumentStream(name=\"my_doc.pdf\", stream=buf)]\n    # input = DocumentConversionInput.from_streams(docs)\n\n    # # Turn on inline debug visualizations:\n    # settings.debug.visualize_layout = True\n    # settings.debug.visualize_ocr = True\n    # settings.debug.visualize_tables = True\n    # settings.debug.visualize_cells = True\n\n    # Configure the PDF pipeline. Enabling page image generation improves HTML\n    # previews (embedded images) but adds processing time.\n    pipeline_options = PdfPipelineOptions()\n    pipeline_options.generate_page_images = True\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_options=pipeline_options, backend=DoclingParseV4DocumentBackend\n            )\n        }\n    )\n\n    start_time = time.time()\n\n    # Convert all inputs. Set `raises_on_error=False` to keep processing other\n    # files even if one fails; errors are summarized after the run.\n    conv_results = doc_converter.convert_all(\n        input_doc_paths,\n        raises_on_error=False,  # to let conversion run through all and examine results at the end\n    )\n    # Write outputs to ./scratch and log a summary.\n    _success_count, _partial_success_count, failure_count = export_documents(\n        conv_results, output_dir=Path(\"scratch\")\n    )\n\n    end_time = time.time() - start_time\n\n    _log.info(f\"Document conversion complete in {end_time:.2f} seconds.\")\n\n    if failure_count &gt; 0:\n        raise RuntimeError(\n            f\"The example failed converting {failure_count} on {len(input_doc_paths)}.\"\n        )\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import json import logging import time from collections.abc import Iterable from pathlib import Path  import yaml from docling_core.types.doc import ImageRefMode  from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption  _log = logging.getLogger(__name__)  # Export toggles: # - USE_V2 controls modern Docling document exports. # - USE_LEGACY enables legacy Deep Search exports for comparison or migration. USE_V2 = True USE_LEGACY = False   def export_documents(     conv_results: Iterable[ConversionResult],     output_dir: Path, ):     output_dir.mkdir(parents=True, exist_ok=True)      success_count = 0     failure_count = 0     partial_success_count = 0      for conv_res in conv_results:         if conv_res.status == ConversionStatus.SUCCESS:             success_count += 1             doc_filename = conv_res.input.file.stem              if USE_V2:                 # Recommended modern Docling exports. These helpers mirror the                 # lower-level \"export_to_*\" methods used below, but handle                 # common details like image handling.                 conv_res.document.save_as_json(                     output_dir / f\"{doc_filename}.json\",                     image_mode=ImageRefMode.PLACEHOLDER,                 )                 conv_res.document.save_as_html(                     output_dir / f\"{doc_filename}.html\",                     image_mode=ImageRefMode.EMBEDDED,                 )                 conv_res.document.save_as_doctags(                     output_dir / f\"{doc_filename}.doctags.txt\"                 )                 conv_res.document.save_as_markdown(                     output_dir / f\"{doc_filename}.md\",                     image_mode=ImageRefMode.PLACEHOLDER,                 )                 conv_res.document.save_as_markdown(                     output_dir / f\"{doc_filename}.txt\",                     image_mode=ImageRefMode.PLACEHOLDER,                     strict_text=True,                 )                  # Export Docling document format to YAML:                 with (output_dir / f\"{doc_filename}.yaml\").open(\"w\") as fp:                     fp.write(yaml.safe_dump(conv_res.document.export_to_dict()))                  # Export Docling document format to doctags:                 with (output_dir / f\"{doc_filename}.doctags.txt\").open(\"w\") as fp:                     fp.write(conv_res.document.export_to_doctags())                  # Export Docling document format to markdown:                 with (output_dir / f\"{doc_filename}.md\").open(\"w\") as fp:                     fp.write(conv_res.document.export_to_markdown())                  # Export Docling document format to text:                 with (output_dir / f\"{doc_filename}.txt\").open(\"w\") as fp:                     fp.write(conv_res.document.export_to_markdown(strict_text=True))              if USE_LEGACY:                 # Export Deep Search document JSON format:                 with (output_dir / f\"{doc_filename}.legacy.json\").open(                     \"w\", encoding=\"utf-8\"                 ) as fp:                     fp.write(json.dumps(conv_res.legacy_document.export_to_dict()))                  # Export Text format:                 with (output_dir / f\"{doc_filename}.legacy.txt\").open(                     \"w\", encoding=\"utf-8\"                 ) as fp:                     fp.write(                         conv_res.legacy_document.export_to_markdown(strict_text=True)                     )                  # Export Markdown format:                 with (output_dir / f\"{doc_filename}.legacy.md\").open(                     \"w\", encoding=\"utf-8\"                 ) as fp:                     fp.write(conv_res.legacy_document.export_to_markdown())                  # Export Document Tags format:                 with (output_dir / f\"{doc_filename}.legacy.doctags.txt\").open(                     \"w\", encoding=\"utf-8\"                 ) as fp:                     fp.write(conv_res.legacy_document.export_to_document_tokens())          elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS:             _log.info(                 f\"Document {conv_res.input.file} was partially converted with the following errors:\"             )             for item in conv_res.errors:                 _log.info(f\"\\t{item.error_message}\")             partial_success_count += 1         else:             _log.info(f\"Document {conv_res.input.file} failed to convert.\")             failure_count += 1      _log.info(         f\"Processed {success_count + partial_success_count + failure_count} docs, \"         f\"of which {failure_count} failed \"         f\"and {partial_success_count} were partially converted.\"     )     return success_count, partial_success_count, failure_count   def main():     logging.basicConfig(level=logging.INFO)      # Location of sample PDFs used by this example. If your checkout does not     # include test data, change `data_folder` or point `input_doc_paths` to     # your own files.     data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_paths = [         data_folder / \"pdf/2206.01062.pdf\",         data_folder / \"pdf/2203.01017v2.pdf\",         data_folder / \"pdf/2305.03393v1.pdf\",         data_folder / \"pdf/redp5110_sampled.pdf\",     ]      # buf = BytesIO((data_folder / \"pdf/2206.01062.pdf\").open(\"rb\").read())     # docs = [DocumentStream(name=\"my_doc.pdf\", stream=buf)]     # input = DocumentConversionInput.from_streams(docs)      # # Turn on inline debug visualizations:     # settings.debug.visualize_layout = True     # settings.debug.visualize_ocr = True     # settings.debug.visualize_tables = True     # settings.debug.visualize_cells = True      # Configure the PDF pipeline. Enabling page image generation improves HTML     # previews (embedded images) but adds processing time.     pipeline_options = PdfPipelineOptions()     pipeline_options.generate_page_images = True      doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_options=pipeline_options, backend=DoclingParseV4DocumentBackend             )         }     )      start_time = time.time()      # Convert all inputs. Set `raises_on_error=False` to keep processing other     # files even if one fails; errors are summarized after the run.     conv_results = doc_converter.convert_all(         input_doc_paths,         raises_on_error=False,  # to let conversion run through all and examine results at the end     )     # Write outputs to ./scratch and log a summary.     _success_count, _partial_success_count, failure_count = export_documents(         conv_results, output_dir=Path(\"scratch\")     )      end_time = time.time() - start_time      _log.info(f\"Document conversion complete in {end_time:.2f} seconds.\")      if failure_count &gt; 0:         raise RuntimeError(             f\"The example failed converting {failure_count} on {len(input_doc_paths)}.\"         )   if __name__ == \"__main__\":     main()"},{"location":"examples/compare_vlm_models/","title":"VLM comparison","text":"<p>Compare different VLM models by running the VLM pipeline and timing outputs.</p> <p>What this example does</p> <ul> <li>Iterates through a list of VLM model configurations and converts the same file.</li> <li>Prints per-page generation times and saves JSON/MD/HTML to <code>scratch/</code>.</li> <li>Summarizes total inference time and pages processed in a table.</li> </ul> <p>Requirements</p> <ul> <li>Install <code>tabulate</code> for pretty printing (<code>pip install tabulate</code>).</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling with VLM extras. Ensure models can be downloaded or are available.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/compare_vlm_models.py</code>.</li> <li>Results are saved to <code>scratch/</code> with filenames including the model and framework.</li> </ul> <p>Notes</p> <ul> <li>MLX models are skipped automatically on non-macOS platforms.</li> <li>On CUDA systems, you can enable flash_attention_2 (see commented lines).</li> <li>Running multiple VLMs can be GPU/CPU intensive and time-consuming; ensure enough VRAM/system RAM and close other memory-heavy apps.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import json\nimport sys\nimport time\nfrom pathlib import Path\n\nfrom docling_core.types.doc import DocItemLabel, ImageRefMode\nfrom docling_core.types.doc.document import DEFAULT_EXPORT_LABELS\nfrom tabulate import tabulate\n\nfrom docling.datamodel import vlm_model_specs\nfrom docling.datamodel.accelerator_options import AcceleratorDevice\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n    VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import (\n    InferenceFramework,\n    InlineVlmOptions,\n    ResponseFormat,\n    TransformersModelType,\n    TransformersPromptStyle,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\n\ndef convert(sources: list[Path], converter: DocumentConverter):\n    # Note: this helper assumes a single-item `sources` list. It returns after\n    # processing the first source to keep runtime/output focused.\n    model_id = pipeline_options.vlm_options.repo_id.replace(\"/\", \"_\")\n    framework = pipeline_options.vlm_options.inference_framework\n    for source in sources:\n        print(\"================================================\")\n        print(\"Processing...\")\n        print(f\"Source: {source}\")\n        print(\"---\")\n        print(f\"Model: {model_id}\")\n        print(f\"Framework: {framework}\")\n        print(\"================================================\")\n        print(\"\")\n\n        res = converter.convert(source)\n\n        print(\"\")\n\n        fname = f\"{res.input.file.stem}-{model_id}-{framework}\"\n\n        inference_time = 0.0\n        for i, page in enumerate(res.pages):\n            inference_time += page.predictions.vlm_response.generation_time\n            print(\"\")\n            print(\n                f\" ---------- Predicted page {i} in {pipeline_options.vlm_options.response_format} in {page.predictions.vlm_response.generation_time} [sec]:\"\n            )\n            print(page.predictions.vlm_response.text)\n            print(\" ---------- \")\n\n        print(\"===== Final output of the converted document =======\")\n\n        # Manual export for illustration. Below, `save_as_json()` writes the same\n        # JSON again; kept intentionally to show both approaches.\n        with (out_path / f\"{fname}.json\").open(\"w\") as fp:\n            fp.write(json.dumps(res.document.export_to_dict()))\n\n        res.document.save_as_json(\n            out_path / f\"{fname}.json\",\n            image_mode=ImageRefMode.PLACEHOLDER,\n        )\n        print(f\" =&gt; produced {out_path / fname}.json\")\n\n        res.document.save_as_markdown(\n            out_path / f\"{fname}.md\",\n            image_mode=ImageRefMode.PLACEHOLDER,\n        )\n        print(f\" =&gt; produced {out_path / fname}.md\")\n\n        res.document.save_as_html(\n            out_path / f\"{fname}.html\",\n            image_mode=ImageRefMode.EMBEDDED,\n            labels=[*DEFAULT_EXPORT_LABELS, DocItemLabel.FOOTNOTE],\n            split_page_view=True,\n        )\n        print(f\" =&gt; produced {out_path / fname}.html\")\n\n        pg_num = res.document.num_pages()\n        print(\"\")\n        print(\n            f\"Total document prediction time: {inference_time:.2f} seconds, pages: {pg_num}\"\n        )\n        print(\"====================================================\")\n\n        return [\n            source,\n            model_id,\n            str(framework),\n            pg_num,\n            inference_time,\n        ]\n\n\nif __name__ == \"__main__\":\n    sources = [\n        \"tests/data/pdf/2305.03393v1-pg9.pdf\",\n    ]\n\n    out_path = Path(\"scratch\")\n    out_path.mkdir(parents=True, exist_ok=True)\n\n    ## Definiton of more inline models\n    llava_qwen = InlineVlmOptions(\n        repo_id=\"llava-hf/llava-interleave-qwen-0.5b-hf\",\n        # prompt=\"Read text in the image.\",\n        prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\",\n        # prompt=\"Parse the reading order of this document.\",\n        response_format=ResponseFormat.MARKDOWN,\n        inference_framework=InferenceFramework.TRANSFORMERS,\n        transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,\n        supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU],\n        scale=2.0,\n        temperature=0.0,\n    )\n\n    # Note that this is not the expected way of using the Dolphin model, but it shows the usage of a raw prompt.\n    dolphin_oneshot = InlineVlmOptions(\n        repo_id=\"ByteDance/Dolphin\",\n        prompt=\"&lt;s&gt;Read text in the image. &lt;Answer/&gt;\",\n        response_format=ResponseFormat.MARKDOWN,\n        inference_framework=InferenceFramework.TRANSFORMERS,\n        transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,\n        transformers_prompt_style=TransformersPromptStyle.RAW,\n        supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU],\n        scale=2.0,\n        temperature=0.0,\n    )\n\n    ## Use VlmPipeline\n    pipeline_options = VlmPipelineOptions()\n    pipeline_options.generate_page_images = True\n\n    ## On GPU systems, enable flash_attention_2 with CUDA:\n    # pipeline_options.accelerator_options.device = AcceleratorDevice.CUDA\n    # pipeline_options.accelerator_options.cuda_use_flash_attention2 = True\n\n    vlm_models = [\n        ## DocTags / SmolDocling models\n        vlm_model_specs.SMOLDOCLING_MLX,\n        vlm_model_specs.SMOLDOCLING_TRANSFORMERS,\n        ## Markdown models (using MLX framework)\n        vlm_model_specs.QWEN25_VL_3B_MLX,\n        vlm_model_specs.PIXTRAL_12B_MLX,\n        vlm_model_specs.GEMMA3_12B_MLX,\n        ## Markdown models (using Transformers framework)\n        vlm_model_specs.GRANITE_VISION_TRANSFORMERS,\n        vlm_model_specs.PHI4_TRANSFORMERS,\n        vlm_model_specs.PIXTRAL_12B_TRANSFORMERS,\n        ## More inline models\n        dolphin_oneshot,\n        llava_qwen,\n    ]\n\n    # Remove MLX models if not on Mac\n    if sys.platform != \"darwin\":\n        vlm_models = [\n            m for m in vlm_models if m.inference_framework != InferenceFramework.MLX\n        ]\n\n    rows = []\n    for vlm_options in vlm_models:\n        pipeline_options.vlm_options = vlm_options\n\n        ## Set up pipeline for PDF or image inputs\n        converter = DocumentConverter(\n            format_options={\n                InputFormat.PDF: PdfFormatOption(\n                    pipeline_cls=VlmPipeline,\n                    pipeline_options=pipeline_options,\n                ),\n                InputFormat.IMAGE: PdfFormatOption(\n                    pipeline_cls=VlmPipeline,\n                    pipeline_options=pipeline_options,\n                ),\n            },\n        )\n\n        row = convert(sources=sources, converter=converter)\n        rows.append(row)\n\n        print(\n            tabulate(\n                rows, headers=[\"source\", \"model_id\", \"framework\", \"num_pages\", \"time\"]\n            )\n        )\n\n        print(\"see if memory gets released ...\")\n        time.sleep(10)\n</pre>  import json import sys import time from pathlib import Path  from docling_core.types.doc import DocItemLabel, ImageRefMode from docling_core.types.doc.document import DEFAULT_EXPORT_LABELS from tabulate import tabulate  from docling.datamodel import vlm_model_specs from docling.datamodel.accelerator_options import AcceleratorDevice from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import (     VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import (     InferenceFramework,     InlineVlmOptions,     ResponseFormat,     TransformersModelType,     TransformersPromptStyle, ) from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline   def convert(sources: list[Path], converter: DocumentConverter):     # Note: this helper assumes a single-item `sources` list. It returns after     # processing the first source to keep runtime/output focused.     model_id = pipeline_options.vlm_options.repo_id.replace(\"/\", \"_\")     framework = pipeline_options.vlm_options.inference_framework     for source in sources:         print(\"================================================\")         print(\"Processing...\")         print(f\"Source: {source}\")         print(\"---\")         print(f\"Model: {model_id}\")         print(f\"Framework: {framework}\")         print(\"================================================\")         print(\"\")          res = converter.convert(source)          print(\"\")          fname = f\"{res.input.file.stem}-{model_id}-{framework}\"          inference_time = 0.0         for i, page in enumerate(res.pages):             inference_time += page.predictions.vlm_response.generation_time             print(\"\")             print(                 f\" ---------- Predicted page {i} in {pipeline_options.vlm_options.response_format} in {page.predictions.vlm_response.generation_time} [sec]:\"             )             print(page.predictions.vlm_response.text)             print(\" ---------- \")          print(\"===== Final output of the converted document =======\")          # Manual export for illustration. Below, `save_as_json()` writes the same         # JSON again; kept intentionally to show both approaches.         with (out_path / f\"{fname}.json\").open(\"w\") as fp:             fp.write(json.dumps(res.document.export_to_dict()))          res.document.save_as_json(             out_path / f\"{fname}.json\",             image_mode=ImageRefMode.PLACEHOLDER,         )         print(f\" =&gt; produced {out_path / fname}.json\")          res.document.save_as_markdown(             out_path / f\"{fname}.md\",             image_mode=ImageRefMode.PLACEHOLDER,         )         print(f\" =&gt; produced {out_path / fname}.md\")          res.document.save_as_html(             out_path / f\"{fname}.html\",             image_mode=ImageRefMode.EMBEDDED,             labels=[*DEFAULT_EXPORT_LABELS, DocItemLabel.FOOTNOTE],             split_page_view=True,         )         print(f\" =&gt; produced {out_path / fname}.html\")          pg_num = res.document.num_pages()         print(\"\")         print(             f\"Total document prediction time: {inference_time:.2f} seconds, pages: {pg_num}\"         )         print(\"====================================================\")          return [             source,             model_id,             str(framework),             pg_num,             inference_time,         ]   if __name__ == \"__main__\":     sources = [         \"tests/data/pdf/2305.03393v1-pg9.pdf\",     ]      out_path = Path(\"scratch\")     out_path.mkdir(parents=True, exist_ok=True)      ## Definiton of more inline models     llava_qwen = InlineVlmOptions(         repo_id=\"llava-hf/llava-interleave-qwen-0.5b-hf\",         # prompt=\"Read text in the image.\",         prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\",         # prompt=\"Parse the reading order of this document.\",         response_format=ResponseFormat.MARKDOWN,         inference_framework=InferenceFramework.TRANSFORMERS,         transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,         supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU],         scale=2.0,         temperature=0.0,     )      # Note that this is not the expected way of using the Dolphin model, but it shows the usage of a raw prompt.     dolphin_oneshot = InlineVlmOptions(         repo_id=\"ByteDance/Dolphin\",         prompt=\"Read text in the image. \",         response_format=ResponseFormat.MARKDOWN,         inference_framework=InferenceFramework.TRANSFORMERS,         transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,         transformers_prompt_style=TransformersPromptStyle.RAW,         supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU],         scale=2.0,         temperature=0.0,     )      ## Use VlmPipeline     pipeline_options = VlmPipelineOptions()     pipeline_options.generate_page_images = True      ## On GPU systems, enable flash_attention_2 with CUDA:     # pipeline_options.accelerator_options.device = AcceleratorDevice.CUDA     # pipeline_options.accelerator_options.cuda_use_flash_attention2 = True      vlm_models = [         ## DocTags / SmolDocling models         vlm_model_specs.SMOLDOCLING_MLX,         vlm_model_specs.SMOLDOCLING_TRANSFORMERS,         ## Markdown models (using MLX framework)         vlm_model_specs.QWEN25_VL_3B_MLX,         vlm_model_specs.PIXTRAL_12B_MLX,         vlm_model_specs.GEMMA3_12B_MLX,         ## Markdown models (using Transformers framework)         vlm_model_specs.GRANITE_VISION_TRANSFORMERS,         vlm_model_specs.PHI4_TRANSFORMERS,         vlm_model_specs.PIXTRAL_12B_TRANSFORMERS,         ## More inline models         dolphin_oneshot,         llava_qwen,     ]      # Remove MLX models if not on Mac     if sys.platform != \"darwin\":         vlm_models = [             m for m in vlm_models if m.inference_framework != InferenceFramework.MLX         ]      rows = []     for vlm_options in vlm_models:         pipeline_options.vlm_options = vlm_options          ## Set up pipeline for PDF or image inputs         converter = DocumentConverter(             format_options={                 InputFormat.PDF: PdfFormatOption(                     pipeline_cls=VlmPipeline,                     pipeline_options=pipeline_options,                 ),                 InputFormat.IMAGE: PdfFormatOption(                     pipeline_cls=VlmPipeline,                     pipeline_options=pipeline_options,                 ),             },         )          row = convert(sources=sources, converter=converter)         rows.append(row)          print(             tabulate(                 rows, headers=[\"source\", \"model_id\", \"framework\", \"num_pages\", \"time\"]             )         )          print(\"see if memory gets released ...\")         time.sleep(10)"},{"location":"examples/custom_convert/","title":"Custom conversion","text":"<p>Customize PDF conversion by toggling OCR/backends and pipeline options.</p> <p>What this example does</p> <ul> <li>Shows several alternative configurations for the Docling PDF pipeline.</li> <li>Lets you try OCR engines (EasyOCR, Tesseract, system OCR) or no OCR.</li> <li>Converts a single sample PDF and exports results to <code>scratch/</code>.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling and its optional OCR backends per the docs.</li> <li>Ensure you can import <code>docling</code> from your Python environment.</li> </ul> <p>How to run</p> <ul> <li>From the repository root, run: <code>python docs/examples/custom_convert.py</code>.</li> <li>Outputs are written under <code>scratch/</code> next to where you run the script.</li> </ul> <p>Choosing a configuration</p> <ul> <li>Only one configuration block should be active at a time.</li> <li>Uncomment exactly one of the sections below to experiment.</li> <li>The file ships with \"Docling Parse with EasyOCR\" enabled as a sensible default.</li> <li>If you uncomment a backend or OCR option that is not imported above, also import its class, e.g.:<ul> <li><code>from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend</code></li> <li><code>from docling.datamodel.pipeline_options import TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions</code></li> </ul> </li> </ul> <p>Input document</p> <ul> <li>Defaults to a single PDF from <code>tests/data/pdf/</code> in the repo.</li> <li>If you don't have the test data, update <code>input_doc_path</code> to a local PDF.</li> </ul> <p>Notes</p> <ul> <li>EasyOCR language: adjust <code>pipeline_options.ocr_options.lang</code> (e.g., [\"en\"], [\"es\"], [\"en\", \"de\"]).</li> <li>Accelerators: tune <code>AcceleratorOptions</code> to select CPU/GPU or threads.</li> <li>Exports: JSON, plain text, Markdown, and doctags are saved in <code>scratch/</code>.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import json\nimport logging\nimport time\nfrom pathlib import Path\n\nfrom docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n    PdfPipelineOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n_log = logging.getLogger(__name__)\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO)\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n    ###########################################################################\n\n    # The sections below demo combinations of PdfPipelineOptions and backends.\n    # Tip: Uncomment exactly one section at a time to compare outputs.\n\n    # PyPdfium without EasyOCR\n    # --------------------\n    # pipeline_options = PdfPipelineOptions()\n    # pipeline_options.do_ocr = False\n    # pipeline_options.do_table_structure = True\n    # pipeline_options.table_structure_options.do_cell_matching = False\n\n    # doc_converter = DocumentConverter(\n    #     format_options={\n    #         InputFormat.PDF: PdfFormatOption(\n    #             pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend\n    #         )\n    #     }\n    # )\n\n    # PyPdfium with EasyOCR\n    # -----------------\n    # pipeline_options = PdfPipelineOptions()\n    # pipeline_options.do_ocr = True\n    # pipeline_options.do_table_structure = True\n    # pipeline_options.table_structure_options.do_cell_matching = True\n\n    # doc_converter = DocumentConverter(\n    #     format_options={\n    #         InputFormat.PDF: PdfFormatOption(\n    #             pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend\n    #         )\n    #     }\n    # )\n\n    # Docling Parse without EasyOCR\n    # -------------------------\n    # pipeline_options = PdfPipelineOptions()\n    # pipeline_options.do_ocr = False\n    # pipeline_options.do_table_structure = True\n    # pipeline_options.table_structure_options.do_cell_matching = True\n\n    # doc_converter = DocumentConverter(\n    #     format_options={\n    #         InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n    #     }\n    # )\n\n    # Docling Parse with EasyOCR (default)\n    # -------------------------------\n    # Enables OCR and table structure with EasyOCR, using automatic device\n    # selection via AcceleratorOptions. Adjust languages as needed.\n    pipeline_options = PdfPipelineOptions()\n    pipeline_options.do_ocr = True\n    pipeline_options.do_table_structure = True\n    pipeline_options.table_structure_options.do_cell_matching = True\n    pipeline_options.ocr_options.lang = [\"es\"]\n    pipeline_options.accelerator_options = AcceleratorOptions(\n        num_threads=4, device=AcceleratorDevice.AUTO\n    )\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n        }\n    )\n\n    # Docling Parse with EasyOCR (CPU only)\n    # -------------------------------------\n    # pipeline_options = PdfPipelineOptions()\n    # pipeline_options.do_ocr = True\n    # pipeline_options.ocr_options.use_gpu = False  # &lt;-- set this.\n    # pipeline_options.do_table_structure = True\n    # pipeline_options.table_structure_options.do_cell_matching = True\n\n    # doc_converter = DocumentConverter(\n    #     format_options={\n    #         InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n    #     }\n    # )\n\n    # Docling Parse with Tesseract\n    # ----------------------------\n    # pipeline_options = PdfPipelineOptions()\n    # pipeline_options.do_ocr = True\n    # pipeline_options.do_table_structure = True\n    # pipeline_options.table_structure_options.do_cell_matching = True\n    # pipeline_options.ocr_options = TesseractOcrOptions()\n\n    # doc_converter = DocumentConverter(\n    #     format_options={\n    #         InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n    #     }\n    # )\n\n    # Docling Parse with Tesseract CLI\n    # --------------------------------\n    # pipeline_options = PdfPipelineOptions()\n    # pipeline_options.do_ocr = True\n    # pipeline_options.do_table_structure = True\n    # pipeline_options.table_structure_options.do_cell_matching = True\n    # pipeline_options.ocr_options = TesseractCliOcrOptions()\n\n    # doc_converter = DocumentConverter(\n    #     format_options={\n    #         InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n    #     }\n    # )\n\n    # Docling Parse with ocrmac (macOS only)\n    # --------------------------------------\n    # pipeline_options = PdfPipelineOptions()\n    # pipeline_options.do_ocr = True\n    # pipeline_options.do_table_structure = True\n    # pipeline_options.table_structure_options.do_cell_matching = True\n    # pipeline_options.ocr_options = OcrMacOptions()\n\n    # doc_converter = DocumentConverter(\n    #     format_options={\n    #         InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n    #     }\n    # )\n\n    ###########################################################################\n\n    start_time = time.time()\n    conv_result = doc_converter.convert(input_doc_path)\n    end_time = time.time() - start_time\n\n    _log.info(f\"Document converted in {end_time:.2f} seconds.\")\n\n    ## Export results\n    output_dir = Path(\"scratch\")\n    output_dir.mkdir(parents=True, exist_ok=True)\n    doc_filename = conv_result.input.file.stem\n\n    # Export Docling document JSON format:\n    with (output_dir / f\"{doc_filename}.json\").open(\"w\", encoding=\"utf-8\") as fp:\n        fp.write(json.dumps(conv_result.document.export_to_dict()))\n\n    # Export Text format (plain text via Markdown export):\n    with (output_dir / f\"{doc_filename}.txt\").open(\"w\", encoding=\"utf-8\") as fp:\n        fp.write(conv_result.document.export_to_markdown(strict_text=True))\n\n    # Export Markdown format:\n    with (output_dir / f\"{doc_filename}.md\").open(\"w\", encoding=\"utf-8\") as fp:\n        fp.write(conv_result.document.export_to_markdown())\n\n    # Export Document Tags format:\n    with (output_dir / f\"{doc_filename}.doctags\").open(\"w\", encoding=\"utf-8\") as fp:\n        fp.write(conv_result.document.export_to_doctags())\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import json import logging import time from pathlib import Path  from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import (     PdfPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption  _log = logging.getLogger(__name__)   def main():     logging.basicConfig(level=logging.INFO)      data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2206.01062.pdf\"      ###########################################################################      # The sections below demo combinations of PdfPipelineOptions and backends.     # Tip: Uncomment exactly one section at a time to compare outputs.      # PyPdfium without EasyOCR     # --------------------     # pipeline_options = PdfPipelineOptions()     # pipeline_options.do_ocr = False     # pipeline_options.do_table_structure = True     # pipeline_options.table_structure_options.do_cell_matching = False      # doc_converter = DocumentConverter(     #     format_options={     #         InputFormat.PDF: PdfFormatOption(     #             pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend     #         )     #     }     # )      # PyPdfium with EasyOCR     # -----------------     # pipeline_options = PdfPipelineOptions()     # pipeline_options.do_ocr = True     # pipeline_options.do_table_structure = True     # pipeline_options.table_structure_options.do_cell_matching = True      # doc_converter = DocumentConverter(     #     format_options={     #         InputFormat.PDF: PdfFormatOption(     #             pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend     #         )     #     }     # )      # Docling Parse without EasyOCR     # -------------------------     # pipeline_options = PdfPipelineOptions()     # pipeline_options.do_ocr = False     # pipeline_options.do_table_structure = True     # pipeline_options.table_structure_options.do_cell_matching = True      # doc_converter = DocumentConverter(     #     format_options={     #         InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)     #     }     # )      # Docling Parse with EasyOCR (default)     # -------------------------------     # Enables OCR and table structure with EasyOCR, using automatic device     # selection via AcceleratorOptions. Adjust languages as needed.     pipeline_options = PdfPipelineOptions()     pipeline_options.do_ocr = True     pipeline_options.do_table_structure = True     pipeline_options.table_structure_options.do_cell_matching = True     pipeline_options.ocr_options.lang = [\"es\"]     pipeline_options.accelerator_options = AcceleratorOptions(         num_threads=4, device=AcceleratorDevice.AUTO     )      doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)         }     )      # Docling Parse with EasyOCR (CPU only)     # -------------------------------------     # pipeline_options = PdfPipelineOptions()     # pipeline_options.do_ocr = True     # pipeline_options.ocr_options.use_gpu = False  # &lt;-- set this.     # pipeline_options.do_table_structure = True     # pipeline_options.table_structure_options.do_cell_matching = True      # doc_converter = DocumentConverter(     #     format_options={     #         InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)     #     }     # )      # Docling Parse with Tesseract     # ----------------------------     # pipeline_options = PdfPipelineOptions()     # pipeline_options.do_ocr = True     # pipeline_options.do_table_structure = True     # pipeline_options.table_structure_options.do_cell_matching = True     # pipeline_options.ocr_options = TesseractOcrOptions()      # doc_converter = DocumentConverter(     #     format_options={     #         InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)     #     }     # )      # Docling Parse with Tesseract CLI     # --------------------------------     # pipeline_options = PdfPipelineOptions()     # pipeline_options.do_ocr = True     # pipeline_options.do_table_structure = True     # pipeline_options.table_structure_options.do_cell_matching = True     # pipeline_options.ocr_options = TesseractCliOcrOptions()      # doc_converter = DocumentConverter(     #     format_options={     #         InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)     #     }     # )      # Docling Parse with ocrmac (macOS only)     # --------------------------------------     # pipeline_options = PdfPipelineOptions()     # pipeline_options.do_ocr = True     # pipeline_options.do_table_structure = True     # pipeline_options.table_structure_options.do_cell_matching = True     # pipeline_options.ocr_options = OcrMacOptions()      # doc_converter = DocumentConverter(     #     format_options={     #         InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)     #     }     # )      ###########################################################################      start_time = time.time()     conv_result = doc_converter.convert(input_doc_path)     end_time = time.time() - start_time      _log.info(f\"Document converted in {end_time:.2f} seconds.\")      ## Export results     output_dir = Path(\"scratch\")     output_dir.mkdir(parents=True, exist_ok=True)     doc_filename = conv_result.input.file.stem      # Export Docling document JSON format:     with (output_dir / f\"{doc_filename}.json\").open(\"w\", encoding=\"utf-8\") as fp:         fp.write(json.dumps(conv_result.document.export_to_dict()))      # Export Text format (plain text via Markdown export):     with (output_dir / f\"{doc_filename}.txt\").open(\"w\", encoding=\"utf-8\") as fp:         fp.write(conv_result.document.export_to_markdown(strict_text=True))      # Export Markdown format:     with (output_dir / f\"{doc_filename}.md\").open(\"w\", encoding=\"utf-8\") as fp:         fp.write(conv_result.document.export_to_markdown())      # Export Document Tags format:     with (output_dir / f\"{doc_filename}.doctags\").open(\"w\", encoding=\"utf-8\") as fp:         fp.write(conv_result.document.export_to_doctags())   if __name__ == \"__main__\":     main()"},{"location":"examples/demo_layout_vlm/","title":"Demo layout vlm","text":"In\u00a0[\u00a0]: Copied! <pre>\"\"\"Demo script for the new ThreadedLayoutVlmPipeline.\n\nThis script demonstrates the usage of the experimental ThreadedLayoutVlmPipeline pipeline\nthat combines layout model preprocessing with VLM processing in a threaded manner.\n\"\"\"\n</pre> \"\"\"Demo script for the new ThreadedLayoutVlmPipeline.  This script demonstrates the usage of the experimental ThreadedLayoutVlmPipeline pipeline that combines layout model preprocessing with VLM processing in a threaded manner. \"\"\" In\u00a0[\u00a0]: Copied! <pre>import argparse\nimport logging\nimport traceback\nfrom pathlib import Path\n</pre> import argparse import logging import traceback from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat\nfrom docling.datamodel.vlm_model_specs import GRANITEDOCLING_TRANSFORMERS\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.experimental.datamodel.threaded_layout_vlm_pipeline_options import (\n    ThreadedLayoutVlmPipelineOptions,\n)\nfrom docling.experimental.pipeline.threaded_layout_vlm_pipeline import (\n    ThreadedLayoutVlmPipeline,\n)\n</pre> from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat from docling.datamodel.vlm_model_specs import GRANITEDOCLING_TRANSFORMERS from docling.document_converter import DocumentConverter, PdfFormatOption from docling.experimental.datamodel.threaded_layout_vlm_pipeline_options import (     ThreadedLayoutVlmPipelineOptions, ) from docling.experimental.pipeline.threaded_layout_vlm_pipeline import (     ThreadedLayoutVlmPipeline, ) In\u00a0[\u00a0]: Copied! <pre>_log = logging.getLogger(__name__)\n</pre> _log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied! <pre>def _parse_args():\n    parser = argparse.ArgumentParser(\n        description=\"Demo script for the experimental ThreadedLayoutVlmPipeline\"\n    )\n    parser.add_argument(\n        \"--input-file\",\n        type=str,\n        default=\"tests/data/pdf/code_and_formula.pdf\",\n        help=\"Path to a PDF file\",\n    )\n    parser.add_argument(\n        \"--output-dir\",\n        type=str,\n        default=\"scratch/demo_layout_vlm/\",\n        help=\"Output directory for converted files\",\n    )\n    return parser.parse_args()\n</pre> def _parse_args():     parser = argparse.ArgumentParser(         description=\"Demo script for the experimental ThreadedLayoutVlmPipeline\"     )     parser.add_argument(         \"--input-file\",         type=str,         default=\"tests/data/pdf/code_and_formula.pdf\",         help=\"Path to a PDF file\",     )     parser.add_argument(         \"--output-dir\",         type=str,         default=\"scratch/demo_layout_vlm/\",         help=\"Output directory for converted files\",     )     return parser.parse_args() <p>Can be used to read multiple pdf files under a folder def _get_docs(input_doc_path): \"\"\"Yield DocumentStream objects from list of input document paths\"\"\" for path in input_doc_path: buf = BytesIO(path.read_bytes()) stream = DocumentStream(name=path.name, stream=buf) yield stream</p> In\u00a0[\u00a0]: Copied! <pre>def openai_compatible_vlm_options(\n    model: str,\n    prompt: str,\n    format: ResponseFormat,\n    hostname_and_port,\n    temperature: float = 0.7,\n    max_tokens: int = 4096,\n    api_key: str = \"\",\n    skip_special_tokens=False,\n):\n    headers = {}\n    if api_key:\n        headers[\"Authorization\"] = f\"Bearer {api_key}\"\n\n    options = ApiVlmOptions(\n        url=f\"http://{hostname_and_port}/v1/chat/completions\",  # LM studio defaults to port 1234, VLLM to 8000\n        params=dict(\n            model=model,\n            max_tokens=max_tokens,\n            skip_special_tokens=skip_special_tokens,  # needed for VLLM\n        ),\n        headers=headers,\n        prompt=prompt,\n        timeout=90,\n        scale=2.0,\n        temperature=temperature,\n        response_format=format,\n    )\n\n    return options\n</pre> def openai_compatible_vlm_options(     model: str,     prompt: str,     format: ResponseFormat,     hostname_and_port,     temperature: float = 0.7,     max_tokens: int = 4096,     api_key: str = \"\",     skip_special_tokens=False, ):     headers = {}     if api_key:         headers[\"Authorization\"] = f\"Bearer {api_key}\"      options = ApiVlmOptions(         url=f\"http://{hostname_and_port}/v1/chat/completions\",  # LM studio defaults to port 1234, VLLM to 8000         params=dict(             model=model,             max_tokens=max_tokens,             skip_special_tokens=skip_special_tokens,  # needed for VLLM         ),         headers=headers,         prompt=prompt,         timeout=90,         scale=2.0,         temperature=temperature,         response_format=format,     )      return options In\u00a0[\u00a0]: Copied! <pre>def demo_threaded_layout_vlm_pipeline(\n    input_doc_path: Path, out_dir_layout_aware: Path, use_api_vlm: bool\n):\n    \"\"\"Demonstrate the threaded layout+VLM pipeline.\"\"\"\n\n    vlm_options = GRANITEDOCLING_TRANSFORMERS.model_copy()\n\n    if use_api_vlm:\n        vlm_options = openai_compatible_vlm_options(\n            model=\"granite-docling-258m-mlx\",  # For VLLM use \"ibm-granite/granite-docling-258M\"\n            hostname_and_port=\"localhost:1234\",  # LM studio defaults to port 1234, VLLM to 8000\n            prompt=\"Convert this page to docling.\",\n            format=ResponseFormat.DOCTAGS,\n            api_key=\"\",\n        )\n    vlm_options.track_input_prompt = True\n\n    # Configure pipeline options\n    print(\"Configuring pipeline options...\")\n    pipeline_options_layout_aware = ThreadedLayoutVlmPipelineOptions(\n        # VLM configuration - defaults to GRANITEDOCLING_TRANSFORMERS\n        vlm_options=vlm_options,\n        # Layout configuration - defaults to DOCLING_LAYOUT_HERON\n        # Batch sizes for parallel processing\n        layout_batch_size=2,\n        vlm_batch_size=1,\n        # Queue configuration\n        queue_max_size=10,\n        # Image processing\n        images_scale=vlm_options.scale,\n        generate_page_images=True,\n        enable_remote_services=use_api_vlm,\n    )\n\n    # Create converter with the new pipeline\n    print(\"Initializing DocumentConverter (this may take a while - loading models)...\")\n    doc_converter_layout_enhanced = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_cls=ThreadedLayoutVlmPipeline,\n                pipeline_options=pipeline_options_layout_aware,\n            )\n        }\n    )\n\n    result_layout_aware = doc_converter_layout_enhanced.convert(\n        source=input_doc_path, raises_on_error=False\n    )\n\n    if result_layout_aware.status == ConversionStatus.FAILURE:\n        _log.error(f\"Conversion failed: {result_layout_aware.status}\")\n\n    doc_filename = result_layout_aware.input.file.stem\n    result_layout_aware.document.save_as_json(\n        out_dir_layout_aware / f\"{doc_filename}.json\"\n    )\n\n    result_layout_aware.document.save_as_html(\n        out_dir_layout_aware / f\"{doc_filename}.html\", split_page_view=True\n    )\n    for page in result_layout_aware.pages:\n        _log.info(\"Page %s of VLM response:\", page.page_no)\n        if page.predictions.vlm_response:\n            _log.info(page.predictions.vlm_response)\n</pre> def demo_threaded_layout_vlm_pipeline(     input_doc_path: Path, out_dir_layout_aware: Path, use_api_vlm: bool ):     \"\"\"Demonstrate the threaded layout+VLM pipeline.\"\"\"      vlm_options = GRANITEDOCLING_TRANSFORMERS.model_copy()      if use_api_vlm:         vlm_options = openai_compatible_vlm_options(             model=\"granite-docling-258m-mlx\",  # For VLLM use \"ibm-granite/granite-docling-258M\"             hostname_and_port=\"localhost:1234\",  # LM studio defaults to port 1234, VLLM to 8000             prompt=\"Convert this page to docling.\",             format=ResponseFormat.DOCTAGS,             api_key=\"\",         )     vlm_options.track_input_prompt = True      # Configure pipeline options     print(\"Configuring pipeline options...\")     pipeline_options_layout_aware = ThreadedLayoutVlmPipelineOptions(         # VLM configuration - defaults to GRANITEDOCLING_TRANSFORMERS         vlm_options=vlm_options,         # Layout configuration - defaults to DOCLING_LAYOUT_HERON         # Batch sizes for parallel processing         layout_batch_size=2,         vlm_batch_size=1,         # Queue configuration         queue_max_size=10,         # Image processing         images_scale=vlm_options.scale,         generate_page_images=True,         enable_remote_services=use_api_vlm,     )      # Create converter with the new pipeline     print(\"Initializing DocumentConverter (this may take a while - loading models)...\")     doc_converter_layout_enhanced = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_cls=ThreadedLayoutVlmPipeline,                 pipeline_options=pipeline_options_layout_aware,             )         }     )      result_layout_aware = doc_converter_layout_enhanced.convert(         source=input_doc_path, raises_on_error=False     )      if result_layout_aware.status == ConversionStatus.FAILURE:         _log.error(f\"Conversion failed: {result_layout_aware.status}\")      doc_filename = result_layout_aware.input.file.stem     result_layout_aware.document.save_as_json(         out_dir_layout_aware / f\"{doc_filename}.json\"     )      result_layout_aware.document.save_as_html(         out_dir_layout_aware / f\"{doc_filename}.html\", split_page_view=True     )     for page in result_layout_aware.pages:         _log.info(\"Page %s of VLM response:\", page.page_no)         if page.predictions.vlm_response:             _log.info(page.predictions.vlm_response) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n    logging.basicConfig(level=logging.INFO)\n    try:\n        args = _parse_args()\n        _log.info(\n            f\"Parsed arguments: input={args.input_file}, output={args.output_dir}\"\n        )\n\n        input_path = Path(args.input_file)\n\n        if not input_path.exists():\n            raise FileNotFoundError(f\"Input file does not exist: {input_path}\")\n\n        if input_path.suffix.lower() != \".pdf\":\n            raise ValueError(f\"Input file must be a PDF: {input_path}\")\n\n        out_dir_layout_aware = Path(args.output_dir) / \"layout_aware/\"\n        out_dir_layout_aware.mkdir(parents=True, exist_ok=True)\n\n        use_api_vlm = False  # Set to False to use inline VLM model\n\n        demo_threaded_layout_vlm_pipeline(input_path, out_dir_layout_aware, use_api_vlm)\n    except Exception:\n        traceback.print_exc()\n        raise\n</pre> if __name__ == \"__main__\":     logging.basicConfig(level=logging.INFO)     try:         args = _parse_args()         _log.info(             f\"Parsed arguments: input={args.input_file}, output={args.output_dir}\"         )          input_path = Path(args.input_file)          if not input_path.exists():             raise FileNotFoundError(f\"Input file does not exist: {input_path}\")          if input_path.suffix.lower() != \".pdf\":             raise ValueError(f\"Input file must be a PDF: {input_path}\")          out_dir_layout_aware = Path(args.output_dir) / \"layout_aware/\"         out_dir_layout_aware.mkdir(parents=True, exist_ok=True)          use_api_vlm = False  # Set to False to use inline VLM model          demo_threaded_layout_vlm_pipeline(input_path, out_dir_layout_aware, use_api_vlm)     except Exception:         traceback.print_exc()         raise"},{"location":"examples/develop_formula_understanding/","title":"Formula enrichment","text":"<p>Developing an enrichment model example (formula understanding: scaffold only).</p> <p>What this example does</p> <ul> <li>Shows how to define pipeline options, an enrichment model, and extend a pipeline.</li> <li>Displays cropped images of formula items and yields them back unchanged.</li> </ul> <p>Important</p> <ul> <li>This is a development scaffold; it does not run a real formula understanding model.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/develop_formula_understanding.py</code>.</li> </ul> <p>Notes</p> <ul> <li>Set <code>do_formula_understanding=True</code> to enable the example enrichment stage.</li> <li>Extends <code>StandardPdfPipeline</code> and keeps the backend when enrichment is enabled.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import logging\nfrom collections.abc import Iterable\nfrom pathlib import Path\n\nfrom docling_core.types.doc import DocItemLabel, DoclingDocument, NodeItem, TextItem\n\nfrom docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.models.base_model import BaseItemAndImageEnrichmentModel\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\n\n\nclass ExampleFormulaUnderstandingPipelineOptions(PdfPipelineOptions):\n    do_formula_understanding: bool = True\n\n\n# A new enrichment model using both the document element and its image as input\nclass ExampleFormulaUnderstandingEnrichmentModel(BaseItemAndImageEnrichmentModel):\n    images_scale = 2.6\n\n    def __init__(self, enabled: bool):\n        self.enabled = enabled\n\n    def is_processable(self, doc: DoclingDocument, element: NodeItem) -&gt; bool:\n        return (\n            self.enabled\n            and isinstance(element, TextItem)\n            and element.label == DocItemLabel.FORMULA\n        )\n\n    def __call__(\n        self,\n        doc: DoclingDocument,\n        element_batch: Iterable[ItemAndImageEnrichmentElement],\n    ) -&gt; Iterable[NodeItem]:\n        if not self.enabled:\n            return\n\n        for enrich_element in element_batch:\n            # Opens a window for each cropped formula image; comment this out when\n            # running headless or processing many items to avoid blocking spam.\n            enrich_element.image.show()\n\n            yield enrich_element.item\n\n\n# How the pipeline can be extended.\nclass ExampleFormulaUnderstandingPipeline(StandardPdfPipeline):\n    def __init__(self, pipeline_options: ExampleFormulaUnderstandingPipelineOptions):\n        super().__init__(pipeline_options)\n        self.pipeline_options: ExampleFormulaUnderstandingPipelineOptions\n\n        self.enrichment_pipe = [\n            ExampleFormulaUnderstandingEnrichmentModel(\n                enabled=self.pipeline_options.do_formula_understanding\n            )\n        ]\n\n        if self.pipeline_options.do_formula_understanding:\n            self.keep_backend = True\n\n    @classmethod\n    def get_default_options(cls) -&gt; ExampleFormulaUnderstandingPipelineOptions:\n        return ExampleFormulaUnderstandingPipelineOptions()\n\n\n# Example main. In the final version, we simply have to set do_formula_understanding to true.\ndef main():\n    logging.basicConfig(level=logging.INFO)\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2203.01017v2.pdf\"\n\n    pipeline_options = ExampleFormulaUnderstandingPipelineOptions()\n    pipeline_options.do_formula_understanding = True\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_cls=ExampleFormulaUnderstandingPipeline,\n                pipeline_options=pipeline_options,\n            )\n        }\n    )\n    doc_converter.convert(input_doc_path)\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import logging from collections.abc import Iterable from pathlib import Path  from docling_core.types.doc import DocItemLabel, DoclingDocument, NodeItem, TextItem  from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.models.base_model import BaseItemAndImageEnrichmentModel from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline   class ExampleFormulaUnderstandingPipelineOptions(PdfPipelineOptions):     do_formula_understanding: bool = True   # A new enrichment model using both the document element and its image as input class ExampleFormulaUnderstandingEnrichmentModel(BaseItemAndImageEnrichmentModel):     images_scale = 2.6      def __init__(self, enabled: bool):         self.enabled = enabled      def is_processable(self, doc: DoclingDocument, element: NodeItem) -&gt; bool:         return (             self.enabled             and isinstance(element, TextItem)             and element.label == DocItemLabel.FORMULA         )      def __call__(         self,         doc: DoclingDocument,         element_batch: Iterable[ItemAndImageEnrichmentElement],     ) -&gt; Iterable[NodeItem]:         if not self.enabled:             return          for enrich_element in element_batch:             # Opens a window for each cropped formula image; comment this out when             # running headless or processing many items to avoid blocking spam.             enrich_element.image.show()              yield enrich_element.item   # How the pipeline can be extended. class ExampleFormulaUnderstandingPipeline(StandardPdfPipeline):     def __init__(self, pipeline_options: ExampleFormulaUnderstandingPipelineOptions):         super().__init__(pipeline_options)         self.pipeline_options: ExampleFormulaUnderstandingPipelineOptions          self.enrichment_pipe = [             ExampleFormulaUnderstandingEnrichmentModel(                 enabled=self.pipeline_options.do_formula_understanding             )         ]          if self.pipeline_options.do_formula_understanding:             self.keep_backend = True      @classmethod     def get_default_options(cls) -&gt; ExampleFormulaUnderstandingPipelineOptions:         return ExampleFormulaUnderstandingPipelineOptions()   # Example main. In the final version, we simply have to set do_formula_understanding to true. def main():     logging.basicConfig(level=logging.INFO)      data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2203.01017v2.pdf\"      pipeline_options = ExampleFormulaUnderstandingPipelineOptions()     pipeline_options.do_formula_understanding = True      doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_cls=ExampleFormulaUnderstandingPipeline,                 pipeline_options=pipeline_options,             )         }     )     doc_converter.convert(input_doc_path)   if __name__ == \"__main__\":     main()"},{"location":"examples/develop_picture_enrichment/","title":"Figure enrichment","text":"<p>Developing a picture enrichment model (classifier scaffold only).</p> <p>What this example does</p> <ul> <li>Demonstrates how to implement an enrichment model that annotates pictures.</li> <li>Adds a dummy PictureClassificationData entry to each PictureItem.</li> </ul> <p>Important</p> <ul> <li>This is a scaffold for development; it does not run a real classifier.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/develop_picture_enrichment.py</code>.</li> </ul> <p>Notes</p> <ul> <li>Enables picture image generation and sets <code>images_scale</code> to improve crops.</li> <li>Extends <code>StandardPdfPipeline</code> with a custom enrichment stage.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import logging\nfrom collections.abc import Iterable\nfrom pathlib import Path\nfrom typing import Any\n\nfrom docling_core.types.doc import (\n    DoclingDocument,\n    NodeItem,\n    PictureClassificationClass,\n    PictureClassificationData,\n    PictureItem,\n)\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.models.base_model import BaseEnrichmentModel\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\n\n\nclass ExamplePictureClassifierPipelineOptions(PdfPipelineOptions):\n    do_picture_classifer: bool = True\n\n\nclass ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel):\n    def __init__(self, enabled: bool):\n        self.enabled = enabled\n\n    def is_processable(self, doc: DoclingDocument, element: NodeItem) -&gt; bool:\n        return self.enabled and isinstance(element, PictureItem)\n\n    def __call__(\n        self, doc: DoclingDocument, element_batch: Iterable[NodeItem]\n    ) -&gt; Iterable[Any]:\n        if not self.enabled:\n            return\n\n        for element in element_batch:\n            assert isinstance(element, PictureItem)\n\n            # uncomment this to interactively visualize the image\n            # element.get_image(doc).show()  # may block; avoid in headless runs\n\n            element.annotations.append(\n                PictureClassificationData(\n                    provenance=\"example_classifier-0.0.1\",\n                    predicted_classes=[\n                        PictureClassificationClass(class_name=\"dummy\", confidence=0.42)\n                    ],\n                )\n            )\n\n            yield element\n\n\nclass ExamplePictureClassifierPipeline(StandardPdfPipeline):\n    def __init__(self, pipeline_options: ExamplePictureClassifierPipelineOptions):\n        super().__init__(pipeline_options)\n        self.pipeline_options: ExamplePictureClassifierPipeline\n\n        self.enrichment_pipe = [\n            ExamplePictureClassifierEnrichmentModel(\n                enabled=pipeline_options.do_picture_classifer\n            )\n        ]\n\n    @classmethod\n    def get_default_options(cls) -&gt; ExamplePictureClassifierPipelineOptions:\n        return ExamplePictureClassifierPipelineOptions()\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO)\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n    pipeline_options = ExamplePictureClassifierPipelineOptions()\n    pipeline_options.images_scale = 2.0\n    pipeline_options.generate_picture_images = True\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_cls=ExamplePictureClassifierPipeline,\n                pipeline_options=pipeline_options,\n            )\n        }\n    )\n    result = doc_converter.convert(input_doc_path)\n\n    for element, _level in result.document.iterate_items():\n        if isinstance(element, PictureItem):\n            print(\n                f\"The model populated the `data` portion of picture {element.self_ref}:\\n{element.annotations}\"\n            )\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import logging from collections.abc import Iterable from pathlib import Path from typing import Any  from docling_core.types.doc import (     DoclingDocument,     NodeItem,     PictureClassificationClass,     PictureClassificationData,     PictureItem, )  from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.models.base_model import BaseEnrichmentModel from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline   class ExamplePictureClassifierPipelineOptions(PdfPipelineOptions):     do_picture_classifer: bool = True   class ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel):     def __init__(self, enabled: bool):         self.enabled = enabled      def is_processable(self, doc: DoclingDocument, element: NodeItem) -&gt; bool:         return self.enabled and isinstance(element, PictureItem)      def __call__(         self, doc: DoclingDocument, element_batch: Iterable[NodeItem]     ) -&gt; Iterable[Any]:         if not self.enabled:             return          for element in element_batch:             assert isinstance(element, PictureItem)              # uncomment this to interactively visualize the image             # element.get_image(doc).show()  # may block; avoid in headless runs              element.annotations.append(                 PictureClassificationData(                     provenance=\"example_classifier-0.0.1\",                     predicted_classes=[                         PictureClassificationClass(class_name=\"dummy\", confidence=0.42)                     ],                 )             )              yield element   class ExamplePictureClassifierPipeline(StandardPdfPipeline):     def __init__(self, pipeline_options: ExamplePictureClassifierPipelineOptions):         super().__init__(pipeline_options)         self.pipeline_options: ExamplePictureClassifierPipeline          self.enrichment_pipe = [             ExamplePictureClassifierEnrichmentModel(                 enabled=pipeline_options.do_picture_classifer             )         ]      @classmethod     def get_default_options(cls) -&gt; ExamplePictureClassifierPipelineOptions:         return ExamplePictureClassifierPipelineOptions()   def main():     logging.basicConfig(level=logging.INFO)      data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2206.01062.pdf\"      pipeline_options = ExamplePictureClassifierPipelineOptions()     pipeline_options.images_scale = 2.0     pipeline_options.generate_picture_images = True      doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_cls=ExamplePictureClassifierPipeline,                 pipeline_options=pipeline_options,             )         }     )     result = doc_converter.convert(input_doc_path)      for element, _level in result.document.iterate_items():         if isinstance(element, PictureItem):             print(                 f\"The model populated the `data` portion of picture {element.self_ref}:\\n{element.annotations}\"             )   if __name__ == \"__main__\":     main()"},{"location":"examples/dpk-ingest-chunk-tokenize/","title":"Chunking &amp; tokenization with Data Prep Kit","text":"In\u00a0[\u00a0]: Copied! <pre>%%capture\n%pip install \"data-prep-toolkit-transforms[docling2parquet,doc_chunk,tokenization]\"\n%pip install pandas\n%pip install \"numpy&lt;2.0\"\nfrom dotenv import load_dotenv\n\nload_dotenv(\".env\", override=True)\n</pre> %%capture %pip install \"data-prep-toolkit-transforms[docling2parquet,doc_chunk,tokenization]\" %pip install pandas %pip install \"numpy&lt;2.0\" from dotenv import load_dotenv  load_dotenv(\".env\", override=True) <p>We will define and use a utility function for downloading the articles and saving them to the local disk:</p> <p>load_corpus: Uses http request with the wikimedia api token to connect to a Wikimedia endpoint and retrieve the HTML articles that will be used as a seed for our LLM application. The article will then be saved to a local cache folder for further processing</p> In\u00a0[\u00a0]: Copied! <pre>def load_corpus(articles: list, folder: str) -&gt; int:\n    import os\n    import re\n\n    import requests\n\n    headers = {\"Authorization\": f\"Bearer {os.getenv('WIKI_ACCESS_TOKEN')}\"}\n    count = 0\n    for article in articles:\n        try:\n            endpoint = f\"https://api.enterprise.wikimedia.com/v2/articles/{article}\"\n            response = requests.get(endpoint, headers=headers)\n            response.raise_for_status()\n            doc = response.json()\n            for article in doc:\n                filename = re.sub(r\"[^a-zA-Z0-9_]\", \"_\", article[\"name\"])\n                with open(f\"{folder}/{filename}.html\", \"w\") as f:\n                    f.write(article[\"article_body\"][\"html\"])\n                    count = count + 1\n        except Exception as e:\n            print(f\"Failed to retrieve content: {e}\")\n    return count\n</pre> def load_corpus(articles: list, folder: str) -&gt; int:     import os     import re      import requests      headers = {\"Authorization\": f\"Bearer {os.getenv('WIKI_ACCESS_TOKEN')}\"}     count = 0     for article in articles:         try:             endpoint = f\"https://api.enterprise.wikimedia.com/v2/articles/{article}\"             response = requests.get(endpoint, headers=headers)             response.raise_for_status()             doc = response.json()             for article in doc:                 filename = re.sub(r\"[^a-zA-Z0-9_]\", \"_\", article[\"name\"])                 with open(f\"{folder}/{filename}.html\", \"w\") as f:                     f.write(article[\"article_body\"][\"html\"])                     count = count + 1         except Exception as e:             print(f\"Failed to retrieve content: {e}\")     return count In\u00a0[\u00a0]: Copied! <pre>import os\nimport tempfile\n\ndatafolder = tempfile.mkdtemp(dir=os.getcwd())\narticles = [\"Science,_technology,_engineering,_and_mathematics\"]\nassert load_corpus(articles, datafolder) &gt; 0, \"Faild to download any documents\"\n</pre> import os import tempfile  datafolder = tempfile.mkdtemp(dir=os.getcwd()) articles = [\"Science,_technology,_engineering,_and_mathematics\"] assert load_corpus(articles, datafolder) &gt; 0, \"Faild to download any documents\" In\u00a0[\u00a0]: Copied! <pre>%%capture\nfrom dpk_docling2parquet import Docling2Parquet, docling2parquet_contents_types\n\nresult = Docling2Parquet(\n    input_folder=datafolder,\n    output_folder=f\"{datafolder}/docling2parquet\",\n    data_files_to_use=[\".html\"],\n    docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN,  # markdown\n).transform()\n</pre> %%capture from dpk_docling2parquet import Docling2Parquet, docling2parquet_contents_types  result = Docling2Parquet(     input_folder=datafolder,     output_folder=f\"{datafolder}/docling2parquet\",     data_files_to_use=[\".html\"],     docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN,  # markdown ).transform() In\u00a0[\u00a0]: Copied! <pre>%%capture\nfrom dpk_doc_chunk import DocChunk\n\nresult = DocChunk(\n    input_folder=f\"{datafolder}/docling2parquet\",\n    output_folder=f\"{datafolder}/doc_chunk\",\n    doc_chunk_chunking_type=\"li_markdown\",\n    doc_chunk_chunk_size_tokens=128,  # default 128\n    doc_chunk_chunk_overlap_tokens=30,  # default 30\n).transform()\n</pre> %%capture from dpk_doc_chunk import DocChunk  result = DocChunk(     input_folder=f\"{datafolder}/docling2parquet\",     output_folder=f\"{datafolder}/doc_chunk\",     doc_chunk_chunking_type=\"li_markdown\",     doc_chunk_chunk_size_tokens=128,  # default 128     doc_chunk_chunk_overlap_tokens=30,  # default 30 ).transform() In\u00a0[\u00a0]: Copied! <pre>%%capture\nfrom dpk_tokenization import Tokenization\n\nTokenization(\n    input_folder=f\"{datafolder}/doc_chunk\",\n    output_folder=f\"{datafolder}/tkn\",\n    tkn_tokenizer=\"hf-internal-testing/llama-tokenizer\",\n    tkn_chunk_size=20_000,\n).transform()\n</pre> %%capture from dpk_tokenization import Tokenization  Tokenization(     input_folder=f\"{datafolder}/doc_chunk\",     output_folder=f\"{datafolder}/tkn\",     tkn_tokenizer=\"hf-internal-testing/llama-tokenizer\",     tkn_chunk_size=20_000, ).transform() In\u00a0[\u00a0]: Copied! <pre>from pathlib import Path\n\nimport pandas as pd\n\nparquet_files = list(Path(f\"{datafolder}/tkn/\").glob(\"*.parquet\"))\npd.concat(pd.read_parquet(file) for file in parquet_files)\n</pre> from pathlib import Path  import pandas as pd  parquet_files = list(Path(f\"{datafolder}/tkn/\").glob(\"*.parquet\")) pd.concat(pd.read_parquet(file) for file in parquet_files) Out[\u00a0]: tokens document_id document_length token_count 0 [1, 444, 11814, 262, 3002] f1f5b56a78829ab2165b3bbeb94b1167e4c5583c437f1d... 14 5 1 [1, 835, 5298, 13, 13, 797, 278, 4688, 29871, ... 402e82a9e81cc3d2494fac36bebf8bf1a2662800e5a00c... 2100 655 2 [1, 835, 5901, 21833, 13, 13, 29899, 321, 1254... 4fb389d0f0e999c2496f137b4a7c0671e79c09cf9477e9... 2833 968 3 [1, 444, 26304, 4978, 13, 13, 14136, 1967, 666... 3709997548d84224361a6835760b5ae48a1637e78d54a0... 1496 483 4 [1, 444, 2648, 4234] 1e1a58ad5664d963bc207dc791825258c33337c2559f6a... 13 4 5 [1, 835, 8314, 13, 13, 1576, 9870, 315, 1038, ... 83a63864e5ddfdd41ef0f813fb7aa3c95e04c029c32ab3... 1340 442 6 [1, 835, 7400, 13, 13, 6028, 1114, 27871, 2987... 5e29fb4e4cf37ed4c49994620e4a00da9693bc061e82c1... 1800 548 7 [1, 835, 7551, 13, 13, 25411, 3762, 8950, 6020... 3fc34013d93391a7504e84069190479fbc85ba7e7072cb... 1784 511 8 [1, 835, 4092, 13, 13, 13393, 884, 29901, 518,... e8b28e20e3fc3da40b6b368e30f9c953f5218370ec2f7a... 774 229 9 [1, 3191, 18312, 13, 13, 1576, 365, 29965, 152... 94b54fbda274536622f70442b18126f554610e8915b235... 1076 263 10 [1, 3191, 3444, 13, 13, 1576, 1024, 310, 317, ... fef9b66567944df131851834e2fdfb42b5c668e4b08031... 238 60 11 [1, 835, 12798, 12026, 13, 13, 1254, 12665, 97... eeb74ae3490539aa07f25987b6b2666dc907b39147e810... 366 97 12 [1, 835, 7513, 13, 13, 19302, 284, 2879, 515, ... cc2ccd2e9f4d0a8224716109f7a6e7b30f33ff1f8c7adf... 1395 402 13 [1, 835, 20537, 423, 13, 13, 797, 20537, 423, ... baf13788a018da24d86b630a9032eaeee54913bbbdd0d4... 511 137 14 [1, 835, 21215, 13, 13, 1254, 12665, 17800, 52... a5b3973ab3a98d10f4ae07a004d70c6cdcfacb41fda8d7... 1949 536 15 [1, 835, 26260, 13, 13, 797, 278, 518, 4819, 2... dfa35b16704a4dd549701a7821b6aa856f2dd5e5b69daf... 1042 291 16 [1, 835, 660, 14873, 13, 13, 797, 518, 29984, ... a0809b265e4a011407d38cd06c7b3ce5932683a2f9c6af... 852 282 17 [1, 835, 25960, 13, 13, 1254, 12665, 338, 760,... 85e8f3b2af3268d49e60451d3ac87b3bd281a70cf6c4b7... 1165 285 18 [1, 835, 498, 26517, 13, 13, 797, 29871, 29906... 15c924efdbf0135de91a095237cbe831275bab67ee1371... 1612 397 19 [1, 835, 26459, 13, 13, 29911, 29641, 728, 317... b473b50753dd07f08da05bbf776c57747ab85ba79cb081... 435 145 20 [1, 835, 3303, 3900, 13, 13, 797, 278, 3303, 3... 841cefc910bd5d1920187b23554ee67e0e65563373e6de... 1212 344 21 [1, 3191, 3086, 9327, 10606, 13, 13, 14804, 25... 63924939eab38ad6636495f1c5c13760014efe42b330a6... 1592 416 22 [1, 3191, 1954, 29885, 16783, 8898, 13, 13, 24... 44288e766c343592a44f3da59ad3b57a9f26096ac13412... 1653 465 23 [1, 3191, 13151, 13, 13, 13393, 884, 29901, 51... 40a0f6e213901d92f1a158c3e2a55ad2558eb1deaa973f... 4418 1285 24 [1, 3191, 6981, 1455, 17261, 297, 317, 4330, 2... 5cc92a05d39ee56e9c65cdb00f55bc9dcbe8bc1647a442... 1289 375 25 [1, 3191, 402, 1581, 330, 2547, 297, 317, 4330... 37c88bed7898d9a7406b5b0e4b1ccfaca65a732dff0c03... 821 280 26 [1, 3191, 4124, 2042, 2877, 297, 317, 4330, 29... f144b97af462b2ab8aba5cb6d9cba0cf5f383cc710aba0... 1093 297 27 [1, 3191, 3082, 24620, 277, 20193, 512, 4812, ... 16525e2054a7bb7543308ad4e6642bf60e66dc475a0e0a... 2203 538 28 [1, 3191, 317, 4330, 29924, 13151, 3189, 284, ... ebb319391e1bda81edd5ec214887150044c15cfc04f42f... 514 149 29 [1, 3191, 2522, 449, 292, 13, 13, 797, 29871, ... 882582d1f6202a4e495f67952d3a27929177745b1f575e... 850 261 30 [1, 3191, 10317, 310, 5282, 1947, 11104, 13, 1... 311aa5c91354b6bf575682be701981ccc6569eb35fd726... 1561 416 31 [1, 3191, 24206, 13, 13, 1254, 12665, 23992, 2... abaa73aba997ea267d9b556679c5d680810ee5baa231fa... 384 139 32 [1, 3191, 18991, 362, 13, 13, 1576, 518, 29048... 00f85d6dffd914d89eb44dbb4caa3a1c6b2af47f5c4c96... 878 247 33 [1, 3191, 17163, 29879, 13, 13, 797, 3979, 298... f8d901fca6dcac6c266cf2799da814c5f5b5644c3b9476... 2321 682 34 [1, 3191, 3599, 296, 6728, 13, 13, 7504, 3278,... 8347c4988e3acde4723696fbf63a0f2c13d61e92c8fbac... 2960 841 35 [1, 3191, 28488, 322, 11104, 304, 1371, 2693, ... c3d0c80c861ffcd422f60b78d693bb953b69dfc3c3d55f... 222 81 36 [1, 835, 18444, 13, 13, 797, 18444, 29892, 676... 9c41677100393c4e5e3bc4bc36caee5561cb5c93546aaf... 1143 288 37 [1, 444, 10152, 13, 13, 6330, 7456, 29901, 518... 83f0f668bac5736d5f23f750f86ebbe173c0a56e3c51b8... 2777 833 38 [1, 444, 365, 7210, 29911, 29984, 29974, 13, 1... 24bbfff971979686cd41132b491060bdaaf357bd3bc7cf... 2579 847 39 [1, 444, 15976, 293, 1608, 13, 13, 1576, 8569,... 1b8c147d642e4d53152e1be73223ed58e0788700d82c73... 4700 1299 40 [1, 444, 2823, 884, 13, 13, 29899, 518, 29907,... ac3fb4073323718ea3e32e006ed67c298af9801c4a03dd... 1310 443 41 [1, 444, 28318, 13, 13, 29896, 29889, 518, 298... 2dad03b0e2b81c47012f94be0ab730e9c8341f0311c59e... 59373 26470 42 [1, 444, 8725, 5183, 13, 13, 29899, 4699, 1522... 07dabd1b5cfa6f8c70f97eb33c3a19189a866eae1203c7... 2648 1075 43 [1, 444, 3985, 2988, 13, 13, 29899, 8213, 4475... ef8cc66ae18d7238680d07372859c5be061d57b955cf7d... 5025 705 In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/dpk-ingest-chunk-tokenize/#chunking-tokenization-with-data-prep-kit","title":"Chunking &amp; tokenization with Data Prep Kit\u00b6","text":"<p>This notebook demonstrates how to build a sequence of  DPK transforms  for ingesting HTML documents using Docling2Parquet transforms and chunking them using Doc_Chunk transform. Both transforms are based on the  Docling library.</p> <p>In this example, we will use the Wikimedia API to retrieve the HTML articles that will be used as a seed for our LLM application. Once the articles are loaded to a local cache, we will construct and invoke the sequence of transforms to ingest the content and produce the embedding for the chuncked content.</p>"},{"location":"examples/dpk-ingest-chunk-tokenize/#why-dpk-pipelines","title":"\ud83d\udd0d Why DPK Pipelines\u00b6","text":"<p>DPK transform pipelines are intended to simplify how any number of transforms can be executed in a sequence to ingest, annotate, filter and create embedding used for LLM post-training and RAG applications.</p>"},{"location":"examples/dpk-ingest-chunk-tokenize/#key-transforms-in-this-recipe","title":"\ud83e\uddf0 Key Transforms in This Recipe\u00b6","text":"<p>We will use the following transforms from DPK:</p> <ul> <li><code>Docling2Parquet</code>: Ingest one or more HTML document and turn it into a parquet file.</li> <li><code>Doc_Chunk</code>: Create chunks from one more more ducment.</li> <li><code>Tokenization</code>: Create embedding for document chunks.</li> </ul>"},{"location":"examples/dpk-ingest-chunk-tokenize/#prerequisites","title":"Prerequisites\u00b6","text":"<p>1- This notebook uses Wikimedia API for retrieving the initial HTML documents and llama-tokenizer from hugging face.</p> <p>2- In order to use the notebook, users must provide a .env file with a valid access tokens to be used for accessing the wikimedia endpoint ( instructions can be found here ) and a Hugging face token for loading the model ( instructions can be found here). The .env file will look something like this:</p> <pre><code>WIKI_ACCESS_TOKEN='eyxxx'\nHF_READ_ACCESS_TOKEN='hf_xxx'\n</code></pre> <p>3- Install DPK library to environment</p>"},{"location":"examples/dpk-ingest-chunk-tokenize/#setup-the-experiment","title":"\ud83d\udd17 Setup the experiment\u00b6","text":"<p>DPK requires that we define a source/input folder where the transform sequence will be ingesting the document and a destination/output folder where the embedding will be stored. We will also initialize the list of articles we want to use in our application</p>"},{"location":"examples/dpk-ingest-chunk-tokenize/#injest","title":"\ud83d\udd17 Injest\u00b6","text":"<p>Invoke Docling2Parquet tansform that will parse the HTML document and create a Markdown</p>"},{"location":"examples/dpk-ingest-chunk-tokenize/#chunk","title":"\ud83d\udd17 Chunk\u00b6","text":"<p>Invoke DocChunk tansform to break the HTML document into chunks</p>"},{"location":"examples/dpk-ingest-chunk-tokenize/#tokenization","title":"\ud83d\udd17 Tokenization\u00b6","text":"<p>Invoke Tokenization transform to create embedding of various chunks</p>"},{"location":"examples/dpk-ingest-chunk-tokenize/#summary","title":"\u2705 Summary\u00b6","text":"<p>This notebook demonstrated how to run a DPK pipeline using IBM's Data Prep Kit and the Docling library. Each transform create one or more parquet files that users can explore to better understand what each stage of the pipeline produces. The see the output of the final stage, we will use Pandas to read the final parquet file and display its content</p>"},{"location":"examples/enrich_doclingdocument/","title":"Enrich a DoclingDocument","text":"<p>Enrich an existing DoclingDocument JSON with a custom model (post-conversion).</p> <p>What this example does</p> <ul> <li>Loads a previously converted DoclingDocument from JSON (no reconversion).</li> <li>Uses a backend to crop images for items and runs an enrichment model in batches.</li> <li>Prints a few example annotations to stdout.</li> </ul> <p>Prerequisites</p> <ul> <li>A DoclingDocument JSON produced by another conversion (path configured below).</li> <li>Install Docling and dependencies for the chosen enrichment model.</li> <li>Ensure the JSON and the referenced PDF match (same document/version), so provenance bounding boxes line up for accurate cropping.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/enrich_doclingdocument.py</code>.</li> <li>Adjust <code>input_doc_path</code> and <code>input_pdf_path</code> if your data is elsewhere.</li> </ul> <p>Notes</p> <ul> <li><code>BATCH_SIZE</code> controls how many elements are passed to the model at once.</li> <li><code>prepare_element()</code> crops context around elements based on the model's expansion.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>### Load modules\n\nfrom pathlib import Path\nfrom typing import Iterable, Optional\n\nfrom docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem\nfrom rich.pretty import pprint\n\nfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\nfrom docling.datamodel.accelerator_options import AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement\nfrom docling.datamodel.document import InputDocument\nfrom docling.models.base_model import BaseItemAndImageEnrichmentModel\nfrom docling.models.document_picture_classifier import (\n    DocumentPictureClassifier,\n    DocumentPictureClassifierOptions,\n)\nfrom docling.utils.utils import chunkify\n\n### Define batch size used for processing\n\nBATCH_SIZE = 4\n# Trade-off: larger batches improve throughput but increase memory usage.\n\n### From DocItem to the model inputs\n# The following function is responsible for taking an item and applying the required pre-processing for the model.\n# In this case we generate a cropped image from the document backend.\n\n\ndef prepare_element(\n    doc: DoclingDocument,\n    backend: PyPdfiumDocumentBackend,\n    model: BaseItemAndImageEnrichmentModel,\n    element: NodeItem,\n) -&gt; Optional[ItemAndImageEnrichmentElement]:\n    if not model.is_processable(doc=doc, element=element):\n        return None\n\n    assert isinstance(element, DocItem)\n    element_prov = element.prov[0]\n\n    bbox = element_prov.bbox\n    width = bbox.r - bbox.l\n    height = bbox.t - bbox.b\n\n    expanded_bbox = BoundingBox(\n        l=bbox.l - width * model.expansion_factor,\n        t=bbox.t + height * model.expansion_factor,\n        r=bbox.r + width * model.expansion_factor,\n        b=bbox.b - height * model.expansion_factor,\n        coord_origin=bbox.coord_origin,\n    )\n\n    page_ix = element_prov.page_no - 1\n    page_backend = backend.load_page(page_no=page_ix)\n    cropped_image = page_backend.get_page_image(\n        scale=model.images_scale, cropbox=expanded_bbox\n    )\n    return ItemAndImageEnrichmentElement(item=element, image=cropped_image)\n\n\n### Iterate through the document\n# This block defines the `enrich_document()` which is responsible for iterating through the document\n# and batch the selected document items for running through the model.\n\n\ndef enrich_document(\n    doc: DoclingDocument,\n    backend: PyPdfiumDocumentBackend,\n    model: BaseItemAndImageEnrichmentModel,\n) -&gt; DoclingDocument:\n    def _prepare_elements(\n        doc: DoclingDocument,\n        backend: PyPdfiumDocumentBackend,\n        model: BaseItemAndImageEnrichmentModel,\n    ) -&gt; Iterable[NodeItem]:\n        for doc_element, _level in doc.iterate_items():\n            prepared_element = prepare_element(\n                doc=doc, backend=backend, model=model, element=doc_element\n            )\n            if prepared_element is not None:\n                yield prepared_element\n\n    for element_batch in chunkify(\n        _prepare_elements(doc, backend, model),\n        BATCH_SIZE,\n    ):\n        for element in model(doc=doc, element_batch=element_batch):  # Must exhaust!\n            pass\n\n    return doc\n\n\n### Open and process\n# The `main()` function which initializes the document and model objects for calling `enrich_document()`.\n\n\ndef main():\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_pdf_path = data_folder / \"pdf/2206.01062.pdf\"\n\n    input_doc_path = data_folder / \"groundtruth/docling_v2/2206.01062.json\"\n\n    doc = DoclingDocument.load_from_json(input_doc_path)\n\n    in_pdf_doc = InputDocument(\n        input_pdf_path,\n        format=InputFormat.PDF,\n        backend=PyPdfiumDocumentBackend,\n        filename=input_pdf_path.name,\n    )\n    backend = in_pdf_doc._backend\n\n    model = DocumentPictureClassifier(\n        enabled=True,\n        artifacts_path=None,\n        options=DocumentPictureClassifierOptions(),\n        accelerator_options=AcceleratorOptions(),\n    )\n\n    doc = enrich_document(doc=doc, backend=backend, model=model)\n\n    for pic in doc.pictures[:5]:\n        print(pic.self_ref)\n        pprint(pic.annotations)\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  ### Load modules  from pathlib import Path from typing import Iterable, Optional  from docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem from rich.pretty import pprint  from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend from docling.datamodel.accelerator_options import AcceleratorOptions from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement from docling.datamodel.document import InputDocument from docling.models.base_model import BaseItemAndImageEnrichmentModel from docling.models.document_picture_classifier import (     DocumentPictureClassifier,     DocumentPictureClassifierOptions, ) from docling.utils.utils import chunkify  ### Define batch size used for processing  BATCH_SIZE = 4 # Trade-off: larger batches improve throughput but increase memory usage.  ### From DocItem to the model inputs # The following function is responsible for taking an item and applying the required pre-processing for the model. # In this case we generate a cropped image from the document backend.   def prepare_element(     doc: DoclingDocument,     backend: PyPdfiumDocumentBackend,     model: BaseItemAndImageEnrichmentModel,     element: NodeItem, ) -&gt; Optional[ItemAndImageEnrichmentElement]:     if not model.is_processable(doc=doc, element=element):         return None      assert isinstance(element, DocItem)     element_prov = element.prov[0]      bbox = element_prov.bbox     width = bbox.r - bbox.l     height = bbox.t - bbox.b      expanded_bbox = BoundingBox(         l=bbox.l - width * model.expansion_factor,         t=bbox.t + height * model.expansion_factor,         r=bbox.r + width * model.expansion_factor,         b=bbox.b - height * model.expansion_factor,         coord_origin=bbox.coord_origin,     )      page_ix = element_prov.page_no - 1     page_backend = backend.load_page(page_no=page_ix)     cropped_image = page_backend.get_page_image(         scale=model.images_scale, cropbox=expanded_bbox     )     return ItemAndImageEnrichmentElement(item=element, image=cropped_image)   ### Iterate through the document # This block defines the `enrich_document()` which is responsible for iterating through the document # and batch the selected document items for running through the model.   def enrich_document(     doc: DoclingDocument,     backend: PyPdfiumDocumentBackend,     model: BaseItemAndImageEnrichmentModel, ) -&gt; DoclingDocument:     def _prepare_elements(         doc: DoclingDocument,         backend: PyPdfiumDocumentBackend,         model: BaseItemAndImageEnrichmentModel,     ) -&gt; Iterable[NodeItem]:         for doc_element, _level in doc.iterate_items():             prepared_element = prepare_element(                 doc=doc, backend=backend, model=model, element=doc_element             )             if prepared_element is not None:                 yield prepared_element      for element_batch in chunkify(         _prepare_elements(doc, backend, model),         BATCH_SIZE,     ):         for element in model(doc=doc, element_batch=element_batch):  # Must exhaust!             pass      return doc   ### Open and process # The `main()` function which initializes the document and model objects for calling `enrich_document()`.   def main():     data_folder = Path(__file__).parent / \"../../tests/data\"     input_pdf_path = data_folder / \"pdf/2206.01062.pdf\"      input_doc_path = data_folder / \"groundtruth/docling_v2/2206.01062.json\"      doc = DoclingDocument.load_from_json(input_doc_path)      in_pdf_doc = InputDocument(         input_pdf_path,         format=InputFormat.PDF,         backend=PyPdfiumDocumentBackend,         filename=input_pdf_path.name,     )     backend = in_pdf_doc._backend      model = DocumentPictureClassifier(         enabled=True,         artifacts_path=None,         options=DocumentPictureClassifierOptions(),         accelerator_options=AcceleratorOptions(),     )      doc = enrich_document(doc=doc, backend=backend, model=model)      for pic in doc.pictures[:5]:         print(pic.self_ref)         pprint(pic.annotations)   if __name__ == \"__main__\":     main()"},{"location":"examples/enrich_simple_pipeline/","title":"Enrich simple pipeline","text":"In\u00a0[\u00a0]: Copied! <pre>import logging\nfrom pathlib import Path\n</pre> import logging from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import ConvertPipelineOptions\nfrom docling.document_converter import (\n    DocumentConverter,\n    HTMLFormatOption,\n    WordFormatOption,\n)\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ConvertPipelineOptions from docling.document_converter import (     DocumentConverter,     HTMLFormatOption,     WordFormatOption, ) In\u00a0[\u00a0]: Copied! <pre>_log = logging.getLogger(__name__)\n</pre> _log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied! <pre>def main():\n    input_path = Path(\"tests/data/docx/word_sample.docx\")\n\n    pipeline_options = ConvertPipelineOptions()\n    pipeline_options.do_picture_classification = True\n    pipeline_options.do_picture_description = True\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.DOCX: WordFormatOption(pipeline_options=pipeline_options),\n            InputFormat.HTML: HTMLFormatOption(pipeline_options=pipeline_options),\n        },\n    )\n\n    res = doc_converter.convert(input_path)\n\n    print(res.document.export_to_markdown())\n</pre> def main():     input_path = Path(\"tests/data/docx/word_sample.docx\")      pipeline_options = ConvertPipelineOptions()     pipeline_options.do_picture_classification = True     pipeline_options.do_picture_description = True      doc_converter = DocumentConverter(         format_options={             InputFormat.DOCX: WordFormatOption(pipeline_options=pipeline_options),             InputFormat.HTML: HTMLFormatOption(pipeline_options=pipeline_options),         },     )      res = doc_converter.convert(input_path)      print(res.document.export_to_markdown()) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n    main()\n</pre> if __name__ == \"__main__\":     main()"},{"location":"examples/export_figures/","title":"Figure export","text":"<p>Export page, figure, and table images from a PDF and save rich outputs.</p> <p>What this example does</p> <ul> <li>Converts a PDF, keeps page/element images, and writes them to <code>scratch/</code>.</li> <li>Exports Markdown and HTML with either embedded or referenced images.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling and image dependencies. Pillow is used for image saves (<code>pip install pillow</code>) if not already available via Docling's deps.</li> <li>Ensure you can import <code>docling</code> from your Python environment.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/export_figures.py</code>.</li> <li>Outputs (PNG, MD, HTML) are written to <code>scratch/</code>.</li> </ul> <p>Key options</p> <ul> <li><code>IMAGE_RESOLUTION_SCALE</code>: increase to render higher-resolution images (e.g., 2.0).</li> <li><code>PdfPipelineOptions.generate_page_images</code>/<code>generate_picture_images</code>: preserve images for export.</li> <li><code>ImageRefMode</code>: choose <code>EMBEDDED</code> or <code>REFERENCED</code> when saving Markdown/HTML.</li> </ul> <p>Input document</p> <ul> <li>Defaults to <code>tests/data/pdf/2206.01062.pdf</code>. Change <code>input_doc_path</code> as needed.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import logging\nimport time\nfrom pathlib import Path\n\nfrom docling_core.types.doc import ImageRefMode, PictureItem, TableItem\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n_log = logging.getLogger(__name__)\n\nIMAGE_RESOLUTION_SCALE = 2.0\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO)\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n    output_dir = Path(\"scratch\")\n\n    # Keep page/element images so they can be exported. The `images_scale` controls\n    # the rendered image resolution (scale=1 ~ 72 DPI). The `generate_*` toggles\n    # decide which elements are enriched with images.\n    pipeline_options = PdfPipelineOptions()\n    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n    pipeline_options.generate_page_images = True\n    pipeline_options.generate_picture_images = True\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n        }\n    )\n\n    start_time = time.time()\n\n    conv_res = doc_converter.convert(input_doc_path)\n\n    output_dir.mkdir(parents=True, exist_ok=True)\n    doc_filename = conv_res.input.file.stem\n\n    # Save page images\n    for page_no, page in conv_res.document.pages.items():\n        page_no = page.page_no\n        page_image_filename = output_dir / f\"{doc_filename}-{page_no}.png\"\n        with page_image_filename.open(\"wb\") as fp:\n            page.image.pil_image.save(fp, format=\"PNG\")\n\n    # Save images of figures and tables\n    table_counter = 0\n    picture_counter = 0\n    for element, _level in conv_res.document.iterate_items():\n        if isinstance(element, TableItem):\n            table_counter += 1\n            element_image_filename = (\n                output_dir / f\"{doc_filename}-table-{table_counter}.png\"\n            )\n            with element_image_filename.open(\"wb\") as fp:\n                element.get_image(conv_res.document).save(fp, \"PNG\")\n\n        if isinstance(element, PictureItem):\n            picture_counter += 1\n            element_image_filename = (\n                output_dir / f\"{doc_filename}-picture-{picture_counter}.png\"\n            )\n            with element_image_filename.open(\"wb\") as fp:\n                element.get_image(conv_res.document).save(fp, \"PNG\")\n\n    # Save markdown with embedded pictures\n    md_filename = output_dir / f\"{doc_filename}-with-images.md\"\n    conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n    # Save markdown with externally referenced pictures\n    md_filename = output_dir / f\"{doc_filename}-with-image-refs.md\"\n    conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)\n\n    # Save HTML with externally referenced pictures\n    html_filename = output_dir / f\"{doc_filename}-with-image-refs.html\"\n    conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED)\n\n    end_time = time.time() - start_time\n\n    _log.info(f\"Document converted and figures exported in {end_time:.2f} seconds.\")\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import logging import time from pathlib import Path  from docling_core.types.doc import ImageRefMode, PictureItem, TableItem  from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption  _log = logging.getLogger(__name__)  IMAGE_RESOLUTION_SCALE = 2.0   def main():     logging.basicConfig(level=logging.INFO)      data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2206.01062.pdf\"     output_dir = Path(\"scratch\")      # Keep page/element images so they can be exported. The `images_scale` controls     # the rendered image resolution (scale=1 ~ 72 DPI). The `generate_*` toggles     # decide which elements are enriched with images.     pipeline_options = PdfPipelineOptions()     pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE     pipeline_options.generate_page_images = True     pipeline_options.generate_picture_images = True      doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)         }     )      start_time = time.time()      conv_res = doc_converter.convert(input_doc_path)      output_dir.mkdir(parents=True, exist_ok=True)     doc_filename = conv_res.input.file.stem      # Save page images     for page_no, page in conv_res.document.pages.items():         page_no = page.page_no         page_image_filename = output_dir / f\"{doc_filename}-{page_no}.png\"         with page_image_filename.open(\"wb\") as fp:             page.image.pil_image.save(fp, format=\"PNG\")      # Save images of figures and tables     table_counter = 0     picture_counter = 0     for element, _level in conv_res.document.iterate_items():         if isinstance(element, TableItem):             table_counter += 1             element_image_filename = (                 output_dir / f\"{doc_filename}-table-{table_counter}.png\"             )             with element_image_filename.open(\"wb\") as fp:                 element.get_image(conv_res.document).save(fp, \"PNG\")          if isinstance(element, PictureItem):             picture_counter += 1             element_image_filename = (                 output_dir / f\"{doc_filename}-picture-{picture_counter}.png\"             )             with element_image_filename.open(\"wb\") as fp:                 element.get_image(conv_res.document).save(fp, \"PNG\")      # Save markdown with embedded pictures     md_filename = output_dir / f\"{doc_filename}-with-images.md\"     conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)      # Save markdown with externally referenced pictures     md_filename = output_dir / f\"{doc_filename}-with-image-refs.md\"     conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)      # Save HTML with externally referenced pictures     html_filename = output_dir / f\"{doc_filename}-with-image-refs.html\"     conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED)      end_time = time.time() - start_time      _log.info(f\"Document converted and figures exported in {end_time:.2f} seconds.\")   if __name__ == \"__main__\":     main()"},{"location":"examples/export_multimodal/","title":"Multimodal export","text":"<p>Export multimodal page data (image bytes, text, segments) to a Parquet file.</p> <p>What this example does</p> <ul> <li>Converts a PDF and assembles per-page multimodal records: image, cells, text, segments.</li> <li>Normalizes records to a pandas DataFrame and writes a timestamped <code>.parquet</code> in <code>scratch/</code>.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling and <code>pandas</code>. Optional: <code>datasets</code> and <code>Pillow</code> for the commented demo.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/export_multimodal.py</code>.</li> <li>Output parquet is written to <code>scratch/</code>.</li> </ul> <p>Key options</p> <ul> <li><code>IMAGE_RESOLUTION_SCALE</code>: page rendering scale (1 ~ 72 DPI).</li> <li><code>PdfPipelineOptions.generate_page_images</code>: keep page images for export.</li> </ul> <p>Requirements</p> <ul> <li>Writing Parquet requires an engine such as <code>pyarrow</code> or <code>fastparquet</code> (<code>pip install pyarrow</code> is the most common choice).</li> </ul> <p>Input document</p> <ul> <li>Defaults to <code>tests/data/pdf/2206.01062.pdf</code>. Change <code>input_doc_path</code> as needed.</li> </ul> <p>Notes</p> <ul> <li>The commented block at the bottom shows how to load the Parquet with HF Datasets and reconstruct images from raw bytes.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import datetime\nimport logging\nimport time\nfrom pathlib import Path\n\nimport pandas as pd\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.utils.export import generate_multimodal_pages\nfrom docling.utils.utils import create_hash\n\n_log = logging.getLogger(__name__)\n\nIMAGE_RESOLUTION_SCALE = 2.0\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO)\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n    output_dir = Path(\"scratch\")\n\n    # Keep page images so they can be exported to the multimodal rows.\n    # Use PdfPipelineOptions.images_scale to control the render scale (1 ~ 72 DPI).\n    pipeline_options = PdfPipelineOptions()\n    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n    pipeline_options.generate_page_images = True\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n        }\n    )\n\n    start_time = time.time()\n\n    conv_res = doc_converter.convert(input_doc_path)\n\n    output_dir.mkdir(parents=True, exist_ok=True)\n\n    rows = []\n    for (\n        content_text,\n        content_md,\n        content_dt,\n        page_cells,\n        page_segments,\n        page,\n    ) in generate_multimodal_pages(conv_res):\n        dpi = page._default_image_scale * 72\n\n        rows.append(\n            {\n                \"document\": conv_res.input.file.name,\n                \"hash\": conv_res.input.document_hash,\n                \"page_hash\": create_hash(\n                    conv_res.input.document_hash + \":\" + str(page.page_no - 1)\n                ),\n                \"image\": {\n                    \"width\": page.image.width,\n                    \"height\": page.image.height,\n                    \"bytes\": page.image.tobytes(),\n                },\n                \"cells\": page_cells,\n                \"contents\": content_text,\n                \"contents_md\": content_md,\n                \"contents_dt\": content_dt,\n                \"segments\": page_segments,\n                \"extra\": {\n                    \"page_num\": page.page_no + 1,\n                    \"width_in_points\": page.size.width,\n                    \"height_in_points\": page.size.height,\n                    \"dpi\": dpi,\n                },\n            }\n        )\n\n    # Generate one parquet from all documents\n    df_result = pd.json_normalize(rows)\n    now = datetime.datetime.now()\n    output_filename = output_dir / f\"multimodal_{now:%Y-%m-%d_%H%M%S}.parquet\"\n    df_result.to_parquet(output_filename)\n\n    end_time = time.time() - start_time\n\n    _log.info(\n        f\"Document converted and multimodal pages generated in {end_time:.2f} seconds.\"\n    )\n\n    # This block demonstrates how the file can be opened with the HF datasets library\n    # from datasets import Dataset\n    # from PIL import Image\n    # multimodal_df = pd.read_parquet(output_filename)\n\n    # # Convert pandas DataFrame to Hugging Face Dataset and load bytes into image\n    # dataset = Dataset.from_pandas(multimodal_df)\n    # def transforms(examples):\n    #     examples[\"image\"] = Image.frombytes('RGB', (examples[\"image.width\"], examples[\"image.height\"]), examples[\"image.bytes\"], 'raw')\n    #     return examples\n    # dataset = dataset.map(transforms)\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import datetime import logging import time from pathlib import Path  import pandas as pd  from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.utils.export import generate_multimodal_pages from docling.utils.utils import create_hash  _log = logging.getLogger(__name__)  IMAGE_RESOLUTION_SCALE = 2.0   def main():     logging.basicConfig(level=logging.INFO)      data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2206.01062.pdf\"     output_dir = Path(\"scratch\")      # Keep page images so they can be exported to the multimodal rows.     # Use PdfPipelineOptions.images_scale to control the render scale (1 ~ 72 DPI).     pipeline_options = PdfPipelineOptions()     pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE     pipeline_options.generate_page_images = True      doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)         }     )      start_time = time.time()      conv_res = doc_converter.convert(input_doc_path)      output_dir.mkdir(parents=True, exist_ok=True)      rows = []     for (         content_text,         content_md,         content_dt,         page_cells,         page_segments,         page,     ) in generate_multimodal_pages(conv_res):         dpi = page._default_image_scale * 72          rows.append(             {                 \"document\": conv_res.input.file.name,                 \"hash\": conv_res.input.document_hash,                 \"page_hash\": create_hash(                     conv_res.input.document_hash + \":\" + str(page.page_no - 1)                 ),                 \"image\": {                     \"width\": page.image.width,                     \"height\": page.image.height,                     \"bytes\": page.image.tobytes(),                 },                 \"cells\": page_cells,                 \"contents\": content_text,                 \"contents_md\": content_md,                 \"contents_dt\": content_dt,                 \"segments\": page_segments,                 \"extra\": {                     \"page_num\": page.page_no + 1,                     \"width_in_points\": page.size.width,                     \"height_in_points\": page.size.height,                     \"dpi\": dpi,                 },             }         )      # Generate one parquet from all documents     df_result = pd.json_normalize(rows)     now = datetime.datetime.now()     output_filename = output_dir / f\"multimodal_{now:%Y-%m-%d_%H%M%S}.parquet\"     df_result.to_parquet(output_filename)      end_time = time.time() - start_time      _log.info(         f\"Document converted and multimodal pages generated in {end_time:.2f} seconds.\"     )      # This block demonstrates how the file can be opened with the HF datasets library     # from datasets import Dataset     # from PIL import Image     # multimodal_df = pd.read_parquet(output_filename)      # # Convert pandas DataFrame to Hugging Face Dataset and load bytes into image     # dataset = Dataset.from_pandas(multimodal_df)     # def transforms(examples):     #     examples[\"image\"] = Image.frombytes('RGB', (examples[\"image.width\"], examples[\"image.height\"]), examples[\"image.bytes\"], 'raw')     #     return examples     # dataset = dataset.map(transforms)   if __name__ == \"__main__\":     main()"},{"location":"examples/export_tables/","title":"Table export","text":"<p>Extract tables from a PDF and export them as CSV and HTML.</p> <p>What this example does</p> <ul> <li>Converts a PDF and iterates detected tables.</li> <li>Prints each table as Markdown to stdout, and saves CSV/HTML to <code>scratch/</code>.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling and <code>pandas</code>.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/export_tables.py</code>.</li> <li>Outputs are written to <code>scratch/</code>.</li> </ul> <p>Input document</p> <ul> <li>Defaults to <code>tests/data/pdf/2206.01062.pdf</code>. Change <code>input_doc_path</code> as needed.</li> </ul> <p>Notes</p> <ul> <li><code>table.export_to_dataframe()</code> returns a pandas DataFrame for convenient export/processing.</li> <li>Printing via <code>DataFrame.to_markdown()</code> may require the optional <code>tabulate</code> package (<code>pip install tabulate</code>). If unavailable, skip the print or use <code>to_csv()</code>.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import logging\nimport time\nfrom pathlib import Path\n\nimport pandas as pd\n\nfrom docling.document_converter import DocumentConverter\n\n_log = logging.getLogger(__name__)\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO)\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n    output_dir = Path(\"scratch\")\n\n    doc_converter = DocumentConverter()\n\n    start_time = time.time()\n\n    conv_res = doc_converter.convert(input_doc_path)\n\n    output_dir.mkdir(parents=True, exist_ok=True)\n\n    doc_filename = conv_res.input.file.stem\n\n    # Export tables\n    for table_ix, table in enumerate(conv_res.document.tables):\n        table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res.document)\n        print(f\"## Table {table_ix}\")\n        print(table_df.to_markdown())\n\n        # Save the table as CSV\n        element_csv_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.csv\"\n        _log.info(f\"Saving CSV table to {element_csv_filename}\")\n        table_df.to_csv(element_csv_filename)\n\n        # Save the table as HTML\n        element_html_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.html\"\n        _log.info(f\"Saving HTML table to {element_html_filename}\")\n        with element_html_filename.open(\"w\") as fp:\n            fp.write(table.export_to_html(doc=conv_res.document))\n\n    end_time = time.time() - start_time\n\n    _log.info(f\"Document converted and tables exported in {end_time:.2f} seconds.\")\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import logging import time from pathlib import Path  import pandas as pd  from docling.document_converter import DocumentConverter  _log = logging.getLogger(__name__)   def main():     logging.basicConfig(level=logging.INFO)      data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2206.01062.pdf\"     output_dir = Path(\"scratch\")      doc_converter = DocumentConverter()      start_time = time.time()      conv_res = doc_converter.convert(input_doc_path)      output_dir.mkdir(parents=True, exist_ok=True)      doc_filename = conv_res.input.file.stem      # Export tables     for table_ix, table in enumerate(conv_res.document.tables):         table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res.document)         print(f\"## Table {table_ix}\")         print(table_df.to_markdown())          # Save the table as CSV         element_csv_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.csv\"         _log.info(f\"Saving CSV table to {element_csv_filename}\")         table_df.to_csv(element_csv_filename)          # Save the table as HTML         element_html_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.html\"         _log.info(f\"Saving HTML table to {element_html_filename}\")         with element_html_filename.open(\"w\") as fp:             fp.write(table.export_to_html(doc=conv_res.document))      end_time = time.time() - start_time      _log.info(f\"Document converted and tables exported in {end_time:.2f} seconds.\")   if __name__ == \"__main__\":     main()"},{"location":"examples/extraction/","title":"Information extraction","text":"<p>\ud83d\udc49 NOTE: The extraction API is currently in beta and may change without prior notice.</p> <p>Docling provides the capability of extracting information, i.e. structured data, from unstructured documents.</p> <p>The user can provide the desired data schema AKA template, either as a dictionary or as a Pydantic model, and Docling will return the extracted data as a standardized output, organized by page.</p> <p>Check out the subsections below for different usage scenarios.</p> In\u00a0[\u00a0]: Copied! <pre>%pip install -q docling[vlm]  # Install the Docling package with VLM support\n</pre> %pip install -q docling[vlm]  # Install the Docling package with VLM support In\u00a0[1]: Copied! <pre>from IPython import display\nfrom pydantic import BaseModel, Field\nfrom rich import print\n</pre> from IPython import display from pydantic import BaseModel, Field from rich import print <p>In this notebook, we will work with an example input image \u2014 let's quickly inspect it:</p> In\u00a0[2]: Copied! <pre>file_path = (\n    \"https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg\"\n)\ndisplay.HTML(f\"&lt;img src='{file_path}' height='1000'&gt;\")\n</pre> file_path = (     \"https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg\" ) display.HTML(f\"\") Out[2]: <p>Let's first define our extractor:</p> In\u00a0[3]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.document_extractor import DocumentExtractor\n\nextractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])\n</pre> from docling.datamodel.base_models import InputFormat from docling.document_extractor import DocumentExtractor  extractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF]) <p>Following, we look at different ways to define the data template.</p> In\u00a0[4]: Copied! <pre>result = extractor.extract(\n    source=file_path,\n    template='{\"bill_no\": \"string\", \"total\": \"float\"}',\n)\nprint(result.pages)\n</pre> result = extractor.extract(     source=file_path,     template='{\"bill_no\": \"string\", \"total\": \"float\"}', ) print(result.pages) <pre>/Users/pva/work/github.com/DS4SD/docling/docling/document_extractor.py:143: UserWarning: The extract API is currently experimental and may change without prior notice.\nOnly PDF and image formats are supported.\n  return next(all_res)\nYou have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.\nThe following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n</pre> <pre>[\n    ExtractedPageData(\n        page_no=1,\n        extracted_data={'bill_no': '3139', 'total': 3949.75},\n        raw_text='{\"bill_no\": \"3139\", \"total\": 3949.75}',\n        errors=[]\n    )\n]\n</pre> In\u00a0[5]: Copied! <pre>result = extractor.extract(\n    source=file_path,\n    template={\n        \"bill_no\": \"string\",\n        \"total\": \"float\",\n    },\n)\nprint(result.pages)\n</pre> result = extractor.extract(     source=file_path,     template={         \"bill_no\": \"string\",         \"total\": \"float\",     }, ) print(result.pages) <pre>The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n</pre> <pre>[\n    ExtractedPageData(\n        page_no=1,\n        extracted_data={'bill_no': '3139', 'total': 3949.75},\n        raw_text='{\"bill_no\": \"3139\", \"total\": 3949.75}',\n        errors=[]\n    )\n]\n</pre> <p>First we define the Pydantic model we want to use</p> In\u00a0[6]: Copied! <pre>from typing import Optional\n\n\nclass Invoice(BaseModel):\n    bill_no: str = Field(\n        examples=[\"A123\", \"5414\"]\n    )  # provide some examples, but no default value\n    total: float = Field(\n        default=10, examples=[20]\n    )  # provide some examples and a default value\n    tax_id: Optional[str] = Field(default=None, examples=[\"1234567890\"])\n</pre> from typing import Optional   class Invoice(BaseModel):     bill_no: str = Field(         examples=[\"A123\", \"5414\"]     )  # provide some examples, but no default value     total: float = Field(         default=10, examples=[20]     )  # provide some examples and a default value     tax_id: Optional[str] = Field(default=None, examples=[\"1234567890\"]) <p>The class itself can then be used directly as the template:</p> In\u00a0[7]: Copied! <pre>result = extractor.extract(\n    source=file_path,\n    template=Invoice,\n)\nprint(result.pages)\n</pre> result = extractor.extract(     source=file_path,     template=Invoice, ) print(result.pages) <pre>The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n</pre> <pre>[\n    ExtractedPageData(\n        page_no=1,\n        extracted_data={'bill_no': '3139', 'total': 3949.75, 'tax_id': None},\n        raw_text='{\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": null}',\n        errors=[]\n    )\n]\n</pre> <p>Alternatively, a Pydantic model instance can be passed as a template instead, allowing to override the default values.</p> <p>This can be very useful in scenarios where we happen to have available context that is more relevant than the default values predefined in the model definition.</p> <p>E.g. in the example below:</p> <ul> <li><code>bill_no</code> and <code>total</code> are actually set from the value extracted from the data,</li> <li>there was no <code>tax_id</code> to be extracted, so the updated default we provided was applied</li> </ul> In\u00a0[8]: Copied! <pre>result = extractor.extract(\n    source=file_path,\n    template=Invoice(\n        bill_no=\"41\",\n        total=100,\n        tax_id=\"42\",\n    ),\n)\nprint(result.pages)\n</pre> result = extractor.extract(     source=file_path,     template=Invoice(         bill_no=\"41\",         total=100,         tax_id=\"42\",     ), ) print(result.pages) <pre>The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n</pre> <pre>[\n    ExtractedPageData(\n        page_no=1,\n        extracted_data={'bill_no': '3139', 'total': 3949.75, 'tax_id': '42'},\n        raw_text='{\"bill_no\": \"3139\", \"total\": 3949.75, \"tax_id\": \"42\"}',\n        errors=[]\n    )\n]\n</pre> <p>Besides a flat template, we can in principle use any Pydantic model, thus leveraging reuse and being able to capture hierarchies:</p> In\u00a0[9]: Copied! <pre>class Contact(BaseModel):\n    name: Optional[str] = Field(default=None, examples=[\"Smith\"])\n    address: str = Field(default=\"123 Main St\", examples=[\"456 Elm St\"])\n    postal_code: str = Field(default=\"12345\", examples=[\"67890\"])\n    city: str = Field(default=\"Anytown\", examples=[\"Othertown\"])\n    country: Optional[str] = Field(default=None, examples=[\"Canada\"])\n\n\nclass ExtendedInvoice(BaseModel):\n    bill_no: str = Field(\n        examples=[\"A123\", \"5414\"]\n    )  # provide some examples, but not the actual value of the test sample\n    total: float = Field(\n        default=10, examples=[20]\n    )  # provide a default value and some examples\n    garden_work_hours: int = Field(default=1, examples=[2])\n    sender: Contact = Field(default=Contact(), examples=[Contact()])\n    receiver: Contact = Field(default=Contact(), examples=[Contact()])\n</pre> class Contact(BaseModel):     name: Optional[str] = Field(default=None, examples=[\"Smith\"])     address: str = Field(default=\"123 Main St\", examples=[\"456 Elm St\"])     postal_code: str = Field(default=\"12345\", examples=[\"67890\"])     city: str = Field(default=\"Anytown\", examples=[\"Othertown\"])     country: Optional[str] = Field(default=None, examples=[\"Canada\"])   class ExtendedInvoice(BaseModel):     bill_no: str = Field(         examples=[\"A123\", \"5414\"]     )  # provide some examples, but not the actual value of the test sample     total: float = Field(         default=10, examples=[20]     )  # provide a default value and some examples     garden_work_hours: int = Field(default=1, examples=[2])     sender: Contact = Field(default=Contact(), examples=[Contact()])     receiver: Contact = Field(default=Contact(), examples=[Contact()]) In\u00a0[10]: Copied! <pre>result = extractor.extract(\n    source=file_path,\n    template=ExtendedInvoice,\n)\nprint(result.pages)\n</pre> result = extractor.extract(     source=file_path,     template=ExtendedInvoice, ) print(result.pages) <pre>The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n</pre> <pre>[\n    ExtractedPageData(\n        page_no=1,\n        extracted_data={\n            'bill_no': '3139',\n            'total': 3949.75,\n            'garden_work_hours': 28,\n            'sender': {\n                'name': 'Robert Schneider',\n                'address': 'Rue du Lac 1268',\n                'postal_code': '2501',\n                'city': 'Biel',\n                'country': 'Switzerland'\n            },\n            'receiver': {\n                'name': 'Pia Rutschmann',\n                'address': 'Marktgasse 28',\n                'postal_code': '9400',\n                'city': 'Rorschach',\n                'country': 'Switzerland'\n            }\n        },\n        raw_text='{\"bill_no\": \"3139\", \"total\": 3949.75, \"garden_work_hours\": 28, \"sender\": {\"name\": \"Robert \nSchneider\", \"address\": \"Rue du Lac 1268\", \"postal_code\": \"2501\", \"city\": \"Biel\", \"country\": \"Switzerland\"}, \n\"receiver\": {\"name\": \"Pia Rutschmann\", \"address\": \"Marktgasse 28\", \"postal_code\": \"9400\", \"city\": \"Rorschach\", \n\"country\": \"Switzerland\"}}',\n        errors=[]\n    )\n]\n</pre> <p>The generated response data can be easily validated and loaded via Pydantic:</p> In\u00a0[11]: Copied! <pre>invoice = ExtendedInvoice.model_validate(result.pages[0].extracted_data)\nprint(invoice)\n</pre> invoice = ExtendedInvoice.model_validate(result.pages[0].extracted_data) print(invoice) <pre>ExtendedInvoice(\n    bill_no='3139',\n    total=3949.75,\n    garden_work_hours=28,\n    sender=Contact(\n        name='Robert Schneider',\n        address='Rue du Lac 1268',\n        postal_code='2501',\n        city='Biel',\n        country='Switzerland'\n    ),\n    receiver=Contact(\n        name='Pia Rutschmann',\n        address='Marktgasse 28',\n        postal_code='9400',\n        city='Rorschach',\n        country='Switzerland'\n    )\n)\n</pre> <p>This way, we can get from completely unstructured data to a very structured and developer-friendly representation:</p> In\u00a0[12]: Copied! <pre>print(\n    f\"Invoice #{invoice.bill_no} was sent by {invoice.sender.name} \"\n    f\"to {invoice.receiver.name} at {invoice.sender.address}.\"\n)\n</pre> print(     f\"Invoice #{invoice.bill_no} was sent by {invoice.sender.name} \"     f\"to {invoice.receiver.name} at {invoice.sender.address}.\" ) <pre>Invoice #3139 was sent by Robert Schneider to Pia Rutschmann at Rue du Lac 1268.\n</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/extraction/#information-extraction","title":"Information extraction\u00b6","text":""},{"location":"examples/extraction/#defining-the-extractor","title":"Defining the extractor\u00b6","text":""},{"location":"examples/extraction/#using-a-string-template","title":"Using a string template\u00b6","text":""},{"location":"examples/extraction/#using-a-dict-template","title":"Using a dict template\u00b6","text":""},{"location":"examples/extraction/#using-a-pydantic-model-template","title":"Using a Pydantic model template\u00b6","text":""},{"location":"examples/extraction/#advanced-pydantic-model","title":"Advanced Pydantic model\u00b6","text":""},{"location":"examples/extraction/#validating-and-loading-the-extracted-data","title":"Validating and loading the extracted data\u00b6","text":""},{"location":"examples/full_page_ocr/","title":"Force full page OCR","text":"<p>Force full-page OCR on a PDF using different OCR backends.</p> <p>What this example does</p> <ul> <li>Enables full-page OCR and table structure extraction for a sample PDF.</li> <li>Demonstrates how to switch between OCR backends via <code>ocr_options</code>.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling and the desired OCR backend's dependencies (Tesseract, EasyOCR, RapidOCR, or macOS OCR).</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/full_page_ocr.py</code>.</li> <li>The script prints Markdown text to stdout.</li> </ul> <p>Choosing an OCR backend</p> <ul> <li>Uncomment one <code>ocr_options = ...</code> line below. Exactly one should be active.</li> <li><code>force_full_page_ocr=True</code> processes each page purely via OCR (often slower than hybrid detection). Use when layout extraction is unreliable or the PDF contains scanned pages.</li> <li>If you switch OCR backends, ensure the corresponding option class is imported, e.g., <code>EasyOcrOptions</code>, <code>TesseractOcrOptions</code>, <code>OcrMacOptions</code>, <code>RapidOcrOptions</code>.</li> </ul> <p>Input document</p> <ul> <li>Defaults to <code>tests/data/pdf/2206.01062.pdf</code>. Change <code>input_doc_path</code> as needed.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>from pathlib import Path\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n    PdfPipelineOptions,\n    TesseractCliOcrOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n\ndef main():\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n    pipeline_options = PdfPipelineOptions()\n    pipeline_options.do_ocr = True\n    pipeline_options.do_table_structure = True\n    pipeline_options.table_structure_options.do_cell_matching = True\n\n    # Any of the OCR options can be used: EasyOcrOptions, TesseractOcrOptions,\n    # TesseractCliOcrOptions, OcrMacOptions (macOS only), RapidOcrOptions\n    # ocr_options = EasyOcrOptions(force_full_page_ocr=True)\n    # ocr_options = TesseractOcrOptions(force_full_page_ocr=True)\n    # ocr_options = OcrMacOptions(force_full_page_ocr=True)\n    # ocr_options = RapidOcrOptions(force_full_page_ocr=True)\n    ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)\n    pipeline_options.ocr_options = ocr_options\n\n    converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_options=pipeline_options,\n            )\n        }\n    )\n\n    doc = converter.convert(input_doc_path).document\n    md = doc.export_to_markdown()\n    print(md)\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  from pathlib import Path  from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import (     PdfPipelineOptions,     TesseractCliOcrOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption   def main():     data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2206.01062.pdf\"      pipeline_options = PdfPipelineOptions()     pipeline_options.do_ocr = True     pipeline_options.do_table_structure = True     pipeline_options.table_structure_options.do_cell_matching = True      # Any of the OCR options can be used: EasyOcrOptions, TesseractOcrOptions,     # TesseractCliOcrOptions, OcrMacOptions (macOS only), RapidOcrOptions     # ocr_options = EasyOcrOptions(force_full_page_ocr=True)     # ocr_options = TesseractOcrOptions(force_full_page_ocr=True)     # ocr_options = OcrMacOptions(force_full_page_ocr=True)     # ocr_options = RapidOcrOptions(force_full_page_ocr=True)     ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)     pipeline_options.ocr_options = ocr_options      converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_options=pipeline_options,             )         }     )      doc = converter.convert(input_doc_path).document     md = doc.export_to_markdown()     print(md)   if __name__ == \"__main__\":     main()"},{"location":"examples/gpu_standard_pipeline/","title":"Standard pipeline","text":"<p>What this example does</p> <ul> <li>Run a conversion using the best setup for GPU for the standard pipeline</li> </ul> <p>Requirements</p> <ul> <li>Python 3.9+</li> <li>Install Docling: <code>pip install docling</code></li> </ul> <p>How to run</p> <ul> <li><code>python docs/examples/gpu_standard_pipeline.py</code></li> </ul> <p>This example is part of a set of GPU optimization strategies. Read more about it in GPU support</p> In\u00a0[\u00a0]: Copied! <pre>import datetime\nimport logging\nimport time\nfrom pathlib import Path\n\nimport numpy as np\nfrom pydantic import TypeAdapter\n\nfrom docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.pipeline_options import (\n    ThreadedPdfPipelineOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.threaded_standard_pdf_pipeline import ThreadedStandardPdfPipeline\nfrom docling.utils.profiling import ProfilingItem\n\n_log = logging.getLogger(__name__)\n\n\ndef main():\n    logging.getLogger(\"docling\").setLevel(logging.WARNING)\n    _log.setLevel(logging.INFO)\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    # input_doc_path = data_folder / \"pdf\" / \"2305.03393v1.pdf\"  # 14 pages\n    input_doc_path = data_folder / \"pdf\" / \"redp5110_sampled.pdf\"  # 18 pages\n\n    pipeline_options = ThreadedPdfPipelineOptions(\n        accelerator_options=AcceleratorOptions(\n            device=AcceleratorDevice.CUDA,\n        ),\n        ocr_batch_size=4,\n        layout_batch_size=64,\n        table_batch_size=4,\n    )\n    pipeline_options.do_ocr = False\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_cls=ThreadedStandardPdfPipeline,\n                pipeline_options=pipeline_options,\n            )\n        }\n    )\n\n    start_time = time.time()\n    doc_converter.initialize_pipeline(InputFormat.PDF)\n    init_runtime = time.time() - start_time\n    _log.info(f\"Pipeline initialized in {init_runtime:.2f} seconds.\")\n\n    start_time = time.time()\n    conv_result = doc_converter.convert(input_doc_path)\n    pipeline_runtime = time.time() - start_time\n    assert conv_result.status == ConversionStatus.SUCCESS\n\n    num_pages = len(conv_result.pages)\n    _log.info(f\"Document converted in {pipeline_runtime:.2f} seconds.\")\n    _log.info(f\"  {num_pages / pipeline_runtime:.2f} pages/second.\")\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import datetime import logging import time from pathlib import Path  import numpy as np from pydantic import TypeAdapter  from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.pipeline_options import (     ThreadedPdfPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.threaded_standard_pdf_pipeline import ThreadedStandardPdfPipeline from docling.utils.profiling import ProfilingItem  _log = logging.getLogger(__name__)   def main():     logging.getLogger(\"docling\").setLevel(logging.WARNING)     _log.setLevel(logging.INFO)      data_folder = Path(__file__).parent / \"../../tests/data\"     # input_doc_path = data_folder / \"pdf\" / \"2305.03393v1.pdf\"  # 14 pages     input_doc_path = data_folder / \"pdf\" / \"redp5110_sampled.pdf\"  # 18 pages      pipeline_options = ThreadedPdfPipelineOptions(         accelerator_options=AcceleratorOptions(             device=AcceleratorDevice.CUDA,         ),         ocr_batch_size=4,         layout_batch_size=64,         table_batch_size=4,     )     pipeline_options.do_ocr = False      doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_cls=ThreadedStandardPdfPipeline,                 pipeline_options=pipeline_options,             )         }     )      start_time = time.time()     doc_converter.initialize_pipeline(InputFormat.PDF)     init_runtime = time.time() - start_time     _log.info(f\"Pipeline initialized in {init_runtime:.2f} seconds.\")      start_time = time.time()     conv_result = doc_converter.convert(input_doc_path)     pipeline_runtime = time.time() - start_time     assert conv_result.status == ConversionStatus.SUCCESS      num_pages = len(conv_result.pages)     _log.info(f\"Document converted in {pipeline_runtime:.2f} seconds.\")     _log.info(f\"  {num_pages / pipeline_runtime:.2f} pages/second.\")   if __name__ == \"__main__\":     main()"},{"location":"examples/gpu_standard_pipeline/#example-code","title":"Example code\u00b6","text":""},{"location":"examples/gpu_vlm_pipeline/","title":"VLM pipeline","text":"<p>What this example does</p> <ul> <li>Run a conversion using the best setup for GPU using VLM models</li> </ul> <p>Requirements</p> <ul> <li>Python 3.10+</li> <li>Install Docling: <code>pip install docling</code></li> <li>Install vLLM: <code>pip install vllm</code></li> </ul> <p>How to run</p> <ul> <li><code>python docs/examples/gpu_vlm_pipeline.py</code></li> </ul> <p>This example is part of a set of GPU optimization strategies. Read more about it in GPU support</p> In\u00a0[\u00a0]: Copied! <pre>import datetime\nimport logging\nimport time\nfrom pathlib import Path\n\nimport numpy as np\nfrom pydantic import TypeAdapter\n\nfrom docling.datamodel import vlm_model_specs\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.pipeline_options import (\n    VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat\nfrom docling.datamodel.settings import settings\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\nfrom docling.utils.profiling import ProfilingItem\n\n_log = logging.getLogger(__name__)\n\n\ndef main():\n    logging.getLogger(\"docling\").setLevel(logging.WARNING)\n    _log.setLevel(logging.INFO)\n\n    BATCH_SIZE = 64\n\n    settings.perf.page_batch_size = BATCH_SIZE\n    settings.debug.profile_pipeline_timings = True\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    # input_doc_path = data_folder / \"pdf\" / \"2305.03393v1.pdf\"  # 14 pages\n    input_doc_path = data_folder / \"pdf\" / \"redp5110_sampled.pdf\"  # 18 pages\n\n    vlm_options = ApiVlmOptions(\n        url=\"http://localhost:8000/v1/chat/completions\",  # LM studio defaults to port 1234, VLLM to 8000\n        params=dict(\n            model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id,\n            max_tokens=4096,\n            skip_special_tokens=True,\n        ),\n        prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt,\n        timeout=90,\n        scale=2.0,\n        temperature=0.0,\n        concurrency=BATCH_SIZE,\n        stop_strings=[\"&lt;/doctag&gt;\", \"&lt;|end_of_text|&gt;\"],\n        response_format=ResponseFormat.DOCTAGS,\n    )\n\n    pipeline_options = VlmPipelineOptions(\n        vlm_options=vlm_options,\n        enable_remote_services=True,  # required when using a remote inference service.\n    )\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_cls=VlmPipeline,\n                pipeline_options=pipeline_options,\n            ),\n        }\n    )\n\n    start_time = time.time()\n    doc_converter.initialize_pipeline(InputFormat.PDF)\n    end_time = time.time() - start_time\n    _log.info(f\"Pipeline initialized in {end_time:.2f} seconds.\")\n\n    now = datetime.datetime.now()\n    conv_result = doc_converter.convert(input_doc_path)\n    assert conv_result.status == ConversionStatus.SUCCESS\n\n    num_pages = len(conv_result.pages)\n    pipeline_runtime = conv_result.timings[\"pipeline_total\"].times[0]\n    _log.info(f\"Document converted in {pipeline_runtime:.2f} seconds.\")\n    _log.info(f\"  [efficiency]: {num_pages / pipeline_runtime:.2f} pages/second.\")\n    for stage in (\"page_init\", \"vlm\"):\n        values = np.array(conv_result.timings[stage].times)\n        _log.info(\n            f\"  [{stage}]: {np.min(values):.2f} / {np.median(values):.2f} / {np.max(values):.2f} seconds/page\"\n        )\n\n    TimingsT = TypeAdapter(dict[str, ProfilingItem])\n    timings_file = Path(f\"result-timings-gpu-vlm-{now:%Y-%m-%d_%H-%M-%S}.json\")\n    with timings_file.open(\"wb\") as fp:\n        r = TimingsT.dump_json(conv_result.timings, indent=2)\n        fp.write(r)\n    _log.info(f\"Profile details in {timings_file}.\")\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import datetime import logging import time from pathlib import Path  import numpy as np from pydantic import TypeAdapter  from docling.datamodel import vlm_model_specs from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.pipeline_options import (     VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat from docling.datamodel.settings import settings from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline from docling.utils.profiling import ProfilingItem  _log = logging.getLogger(__name__)   def main():     logging.getLogger(\"docling\").setLevel(logging.WARNING)     _log.setLevel(logging.INFO)      BATCH_SIZE = 64      settings.perf.page_batch_size = BATCH_SIZE     settings.debug.profile_pipeline_timings = True      data_folder = Path(__file__).parent / \"../../tests/data\"     # input_doc_path = data_folder / \"pdf\" / \"2305.03393v1.pdf\"  # 14 pages     input_doc_path = data_folder / \"pdf\" / \"redp5110_sampled.pdf\"  # 18 pages      vlm_options = ApiVlmOptions(         url=\"http://localhost:8000/v1/chat/completions\",  # LM studio defaults to port 1234, VLLM to 8000         params=dict(             model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id,             max_tokens=4096,             skip_special_tokens=True,         ),         prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt,         timeout=90,         scale=2.0,         temperature=0.0,         concurrency=BATCH_SIZE,         stop_strings=[\"\", \"&lt;|end_of_text|&gt;\"],         response_format=ResponseFormat.DOCTAGS,     )      pipeline_options = VlmPipelineOptions(         vlm_options=vlm_options,         enable_remote_services=True,  # required when using a remote inference service.     )      doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_cls=VlmPipeline,                 pipeline_options=pipeline_options,             ),         }     )      start_time = time.time()     doc_converter.initialize_pipeline(InputFormat.PDF)     end_time = time.time() - start_time     _log.info(f\"Pipeline initialized in {end_time:.2f} seconds.\")      now = datetime.datetime.now()     conv_result = doc_converter.convert(input_doc_path)     assert conv_result.status == ConversionStatus.SUCCESS      num_pages = len(conv_result.pages)     pipeline_runtime = conv_result.timings[\"pipeline_total\"].times[0]     _log.info(f\"Document converted in {pipeline_runtime:.2f} seconds.\")     _log.info(f\"  [efficiency]: {num_pages / pipeline_runtime:.2f} pages/second.\")     for stage in (\"page_init\", \"vlm\"):         values = np.array(conv_result.timings[stage].times)         _log.info(             f\"  [{stage}]: {np.min(values):.2f} / {np.median(values):.2f} / {np.max(values):.2f} seconds/page\"         )      TimingsT = TypeAdapter(dict[str, ProfilingItem])     timings_file = Path(f\"result-timings-gpu-vlm-{now:%Y-%m-%d_%H-%M-%S}.json\")     with timings_file.open(\"wb\") as fp:         r = TimingsT.dump_json(conv_result.timings, indent=2)         fp.write(r)     _log.info(f\"Profile details in {timings_file}.\")   if __name__ == \"__main__\":     main()"},{"location":"examples/gpu_vlm_pipeline/#start-models-with-vllm","title":"Start models with vllm\u00b6","text":"<pre>vllm serve ibm-granite/granite-docling-258M \\\n  --host 127.0.0.1 --port 8000 \\\n  --max-num-seqs 512 \\\n  --max-num-batched-tokens 8192 \\\n  --enable-chunked-prefill \\\n  --gpu-memory-utilization 0.9\n</pre>"},{"location":"examples/gpu_vlm_pipeline/#example-code","title":"Example code\u00b6","text":""},{"location":"examples/granitedocling_repetition_stopping/","title":"Granitedocling repetition stopping","text":"<p>Experimental VLM pipeline with custom repetition stopping criteria.</p> <p>This script demonstrates the use of custom stopping criteria that detect repetitive location coordinate patterns in generated text and stop generation when such patterns are found.</p> <p>What this example does</p> <ul> <li>Uses the GraniteDocling model with custom repetition stopping criteria injected</li> <li>Processes a PDF document or image and monitors for repetitive coordinate patterns</li> <li>Stops generation early when repetitive patterns are detected</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import logging\n\nfrom docling.datamodel import vlm_model_specs\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import VlmPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.models.utils.generation_utils import (\n    DocTagsRepetitionStopper,\n)\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\nlogging.basicConfig(level=logging.INFO, format=\"%(levelname)s:%(name)s:%(message)s\")\n\n\n# Set up logging to see when repetition stopping is triggered\nlogging.basicConfig(level=logging.INFO)\n\n# Replace with a local path if preferred.\n# source = \"https://ibm.biz/docling-page-with-table\" # Example that shows no repetitions.\nsource = \"tests/data_scanned/old_newspaper.png\"  # Example that creates repetitions.\nprint(f\"Processing document: {source}\")\n\n###### USING GRANITEDOCLING WITH CUSTOM REPETITION STOPPING\n\n## Using standard Huggingface Transformers (most portable, slowest)\ncustom_vlm_options = vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.model_copy()\n\n# Uncomment this to use MLX-accelerated version on Apple Silicon\n# custom_vlm_options = vlm_model_specs.GRANITEDOCLING_MLX.model_copy() # use this for Apple Silicon\n\n\n# Create custom VLM options with repetition stopping criteria\ncustom_vlm_options.custom_stopping_criteria = [\n    DocTagsRepetitionStopper(N=32)\n]  # check for repetitions for every 32 new tokens decoded.\n\npipeline_options = VlmPipelineOptions(\n    vlm_options=custom_vlm_options,\n)\n\nconverter = DocumentConverter(\n    format_options={\n        InputFormat.IMAGE: PdfFormatOption(\n            pipeline_cls=VlmPipeline,\n            pipeline_options=pipeline_options,\n        ),\n    }\n)\n\ndoc = converter.convert(source=source).document\n\nprint(doc.export_to_markdown())\n\n## Using a remote VLM inference service (for example VLLM) - uncomment to use\n\n# custom_vlm_options = ApiVlmOptions(\n#     url=\"http://localhost:8000/v1/chat/completions\",  # LM studio defaults to port 1234, VLLM to 8000\n#     params=dict(\n#         model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id,\n#         max_tokens=8192,\n#         skip_special_tokens=True,  # needed for VLLM\n#     ),\n#     headers={\n#         \"Authorization\": \"Bearer YOUR_API_KEY\",\n#     },\n#     prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt,\n#     timeout=90,\n#     scale=2.0,\n#     temperature=0.0,\n#     response_format=ResponseFormat.DOCTAGS,\n#     custom_stopping_criteria=[\n#         DocTagsRepetitionStopper(N=1)\n#     ],  # check for repetitions for every new chunk of the response stream\n# )\n\n\n# pipeline_options = VlmPipelineOptions(\n#     vlm_options=custom_vlm_options,\n#     enable_remote_services=True, # required when using a remote inference service.\n# )\n\n# converter = DocumentConverter(\n#     format_options={\n#         InputFormat.IMAGE: PdfFormatOption(\n#             pipeline_cls=VlmPipeline,\n#             pipeline_options=pipeline_options,\n#         ),\n#     }\n# )\n\n# doc = converter.convert(source=source).document\n\n# print(doc.export_to_markdown())\n</pre>  import logging  from docling.datamodel import vlm_model_specs from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import VlmPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.models.utils.generation_utils import (     DocTagsRepetitionStopper, ) from docling.pipeline.vlm_pipeline import VlmPipeline  logging.basicConfig(level=logging.INFO, format=\"%(levelname)s:%(name)s:%(message)s\")   # Set up logging to see when repetition stopping is triggered logging.basicConfig(level=logging.INFO)  # Replace with a local path if preferred. # source = \"https://ibm.biz/docling-page-with-table\" # Example that shows no repetitions. source = \"tests/data_scanned/old_newspaper.png\"  # Example that creates repetitions. print(f\"Processing document: {source}\")  ###### USING GRANITEDOCLING WITH CUSTOM REPETITION STOPPING  ## Using standard Huggingface Transformers (most portable, slowest) custom_vlm_options = vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.model_copy()  # Uncomment this to use MLX-accelerated version on Apple Silicon # custom_vlm_options = vlm_model_specs.GRANITEDOCLING_MLX.model_copy() # use this for Apple Silicon   # Create custom VLM options with repetition stopping criteria custom_vlm_options.custom_stopping_criteria = [     DocTagsRepetitionStopper(N=32) ]  # check for repetitions for every 32 new tokens decoded.  pipeline_options = VlmPipelineOptions(     vlm_options=custom_vlm_options, )  converter = DocumentConverter(     format_options={         InputFormat.IMAGE: PdfFormatOption(             pipeline_cls=VlmPipeline,             pipeline_options=pipeline_options,         ),     } )  doc = converter.convert(source=source).document  print(doc.export_to_markdown())  ## Using a remote VLM inference service (for example VLLM) - uncomment to use  # custom_vlm_options = ApiVlmOptions( #     url=\"http://localhost:8000/v1/chat/completions\",  # LM studio defaults to port 1234, VLLM to 8000 #     params=dict( #         model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id, #         max_tokens=8192, #         skip_special_tokens=True,  # needed for VLLM #     ), #     headers={ #         \"Authorization\": \"Bearer YOUR_API_KEY\", #     }, #     prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt, #     timeout=90, #     scale=2.0, #     temperature=0.0, #     response_format=ResponseFormat.DOCTAGS, #     custom_stopping_criteria=[ #         DocTagsRepetitionStopper(N=1) #     ],  # check for repetitions for every new chunk of the response stream # )   # pipeline_options = VlmPipelineOptions( #     vlm_options=custom_vlm_options, #     enable_remote_services=True, # required when using a remote inference service. # )  # converter = DocumentConverter( #     format_options={ #         InputFormat.IMAGE: PdfFormatOption( #             pipeline_cls=VlmPipeline, #             pipeline_options=pipeline_options, #         ), #     } # )  # doc = converter.convert(source=source).document  # print(doc.export_to_markdown())"},{"location":"examples/hybrid_chunking/","title":"Hybrid chunking","text":"<p>Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking.</p> <p>For more details, see here.</p> In\u00a0[1]: Copied! <pre>%pip install -qU pip docling transformers\n</pre> %pip install -qU pip docling transformers <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>DOC_SOURCE = \"../../tests/data/md/wiki.md\"\n</pre> DOC_SOURCE = \"../../tests/data/md/wiki.md\" <p>We first convert the document:</p> In\u00a0[3]: Copied! <pre>from docling.document_converter import DocumentConverter\n\ndoc = DocumentConverter().convert(source=DOC_SOURCE).document\n</pre> from docling.document_converter import DocumentConverter  doc = DocumentConverter().convert(source=DOC_SOURCE).document <p>For a basic chunking scenario, we can just instantiate a <code>HybridChunker</code>, which will use the default parameters.</p> In\u00a0[4]: Copied! <pre>from docling.chunking import HybridChunker\n\nchunker = HybridChunker()\nchunk_iter = chunker.chunk(dl_doc=doc)\n</pre> from docling.chunking import HybridChunker  chunker = HybridChunker() chunk_iter = chunker.chunk(dl_doc=doc) <pre>Token indices sequence length is longer than the specified maximum sequence length for this model (531 &gt; 512). Running this sequence through the model will result in indexing errors\n</pre> <p>\ud83d\udc49 NOTE: As you see above, using the <code>HybridChunker</code> can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here.</p> <p>Note that the text you would typically want to embed is the context-enriched one as returned by the <code>contextualize()</code> method:</p> In\u00a0[5]: Copied! <pre>for i, chunk in enumerate(chunk_iter):\n    print(f\"=== {i} ===\")\n    print(f\"chunk.text:\\n{f'{chunk.text[:300]}\u2026'!r}\")\n\n    enriched_text = chunker.contextualize(chunk=chunk)\n    print(f\"chunker.contextualize(chunk):\\n{f'{enriched_text[:300]}\u2026'!r}\")\n\n    print()\n</pre> for i, chunk in enumerate(chunk_iter):     print(f\"=== {i} ===\")     print(f\"chunk.text:\\n{f'{chunk.text[:300]}\u2026'!r}\")      enriched_text = chunker.contextualize(chunk=chunk)     print(f\"chunker.contextualize(chunk):\\n{f'{enriched_text[:300]}\u2026'!r}\")      print() <pre>=== 0 ===\nchunk.text:\n'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Aver\u2026'\nchunker.contextualize(chunk):\n'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial \u2026'\n\n=== 1 ===\nchunk.text:\n'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willa\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1910s\u20131950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889\u2026'\n\n=== 2 ===\nchunk.text:\n'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson,\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1910s\u20131950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John \u2026'\n\n=== 3 ===\nchunk.text:\n'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1960s\u20131980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.\u2026'\n\n</pre> In\u00a0[6]: Copied! <pre>from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\nfrom transformers import AutoTokenizer\n\nfrom docling.chunking import HybridChunker\n\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nMAX_TOKENS = 64  # set to a small number for illustrative purposes\n\ntokenizer = HuggingFaceTokenizer(\n    tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n    max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case\n)\n</pre> from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer from transformers import AutoTokenizer  from docling.chunking import HybridChunker  EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" MAX_TOKENS = 64  # set to a small number for illustrative purposes  tokenizer = HuggingFaceTokenizer(     tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),     max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case ) <p>\ud83d\udc49 Alternatively, OpenAI tokenizers can be used as shown in the example below (uncomment to use \u2014 requires installing <code>docling-core[chunking-openai]</code>):</p> In\u00a0[7]: Copied! <pre># import tiktoken\n\n# from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer\n\n# tokenizer = OpenAITokenizer(\n#     tokenizer=tiktoken.encoding_for_model(\"gpt-4o\"),\n#     max_tokens=128 * 1024,  # context window length required for OpenAI tokenizers\n# )\n</pre> # import tiktoken  # from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer  # tokenizer = OpenAITokenizer( #     tokenizer=tiktoken.encoding_for_model(\"gpt-4o\"), #     max_tokens=128 * 1024,  # context window length required for OpenAI tokenizers # ) <p>We can now instantiate our chunker:</p> In\u00a0[8]: Copied! <pre>chunker = HybridChunker(\n    tokenizer=tokenizer,\n    merge_peers=True,  # optional, defaults to True\n)\nchunk_iter = chunker.chunk(dl_doc=doc)\nchunks = list(chunk_iter)\n</pre> chunker = HybridChunker(     tokenizer=tokenizer,     merge_peers=True,  # optional, defaults to True ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) <p>Points to notice looking at the output chunks below:</p> <ul> <li>Where possible, we fit the limit of 64 tokens for the metadata-enriched serialization form (see chunk 2)</li> <li>Where needed, we stop before the limit, e.g. see cases of 63 as it would otherwise run into a comma (see chunk 6)</li> <li>Where possible, we merge undersized peer chunks (see chunk 0)</li> <li>\"Tail\" chunks trailing right after merges may still be undersized (see chunk 8)</li> </ul> In\u00a0[9]: Copied! <pre>for i, chunk in enumerate(chunks):\n    print(f\"=== {i} ===\")\n    txt_tokens = tokenizer.count_tokens(chunk.text)\n    print(f\"chunk.text ({txt_tokens} tokens):\\n{chunk.text!r}\")\n\n    ser_txt = chunker.contextualize(chunk=chunk)\n    ser_tokens = tokenizer.count_tokens(ser_txt)\n    print(f\"chunker.contextualize(chunk) ({ser_tokens} tokens):\\n{ser_txt!r}\")\n\n    print()\n</pre> for i, chunk in enumerate(chunks):     print(f\"=== {i} ===\")     txt_tokens = tokenizer.count_tokens(chunk.text)     print(f\"chunk.text ({txt_tokens} tokens):\\n{chunk.text!r}\")      ser_txt = chunker.contextualize(chunk=chunk)     ser_tokens = tokenizer.count_tokens(ser_txt)     print(f\"chunker.contextualize(chunk) ({ser_tokens} tokens):\\n{ser_txt!r}\")      print() <pre>=== 0 ===\nchunk.text (55 tokens):\n'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\nchunker.contextualize(chunk) (56 tokens):\n'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\n\n=== 1 ===\nchunk.text (45 tokens):\n'IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\nchunker.contextualize(chunk) (46 tokens):\n'IBM\\nIBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\n\n=== 2 ===\nchunk.text (63 tokens):\n'IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nIBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\n\n=== 3 ===\nchunk.text (44 tokens):\n\"IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\nchunker.contextualize(chunk) (45 tokens):\n\"IBM\\nIBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n\n=== 4 ===\nchunk.text (63 tokens):\n'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, \u2014 its DOS software provided by Microsoft, \u2014 which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, \u2014 its DOS software provided by Microsoft, \u2014 which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\n\n=== 5 ===\nchunk.text (61 tokens):\n'IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\nchunker.contextualize(chunk) (62 tokens):\n'IBM\\nIBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n\n=== 6 ===\nchunk.text (62 tokens):\n\"As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\nchunker.contextualize(chunk) (63 tokens):\n\"IBM\\nAs one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\n\n=== 7 ===\nchunk.text (63 tokens):\n'language, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nlanguage, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\n\n=== 8 ===\nchunk.text (5 tokens):\n'Awards.[16]'\nchunker.contextualize(chunk) (6 tokens):\n'IBM\\nAwards.[16]'\n\n=== 9 ===\nchunk.text (56 tokens):\n'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\nchunker.contextualize(chunk) (60 tokens):\n'IBM\\n1910s\u20131950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\n\n=== 10 ===\nchunk.text (60 tokens):\n\"(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\nchunker.contextualize(chunk) (64 tokens):\n\"IBM\\n1910s\u20131950s\\n(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\n\n=== 11 ===\nchunk.text (59 tokens):\n'Computing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\nchunker.contextualize(chunk) (63 tokens):\n'IBM\\n1910s\u20131950s\\nComputing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\n\n=== 12 ===\nchunk.text (13 tokens):\n'D.C.; and Toronto, Canada.[22]'\nchunker.contextualize(chunk) (17 tokens):\n'IBM\\n1910s\u20131950s\\nD.C.; and Toronto, Canada.[22]'\n\n=== 13 ===\nchunk.text (60 tokens):\n'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\n1910s\u20131950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\n\n=== 14 ===\nchunk.text (59 tokens):\n\"on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\nchunker.contextualize(chunk) (63 tokens):\n\"IBM\\n1910s\u20131950s\\non Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\n\n=== 15 ===\nchunk.text (23 tokens):\n\"practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\nchunker.contextualize(chunk) (27 tokens):\n\"IBM\\n1910s\u20131950s\\npractices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\n\n=== 16 ===\nchunk.text (59 tokens):\n'He implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\nchunker.contextualize(chunk) (63 tokens):\n'IBM\\n1910s\u20131950s\\nHe implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\n\n=== 17 ===\nchunk.text (60 tokens):\n'\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\n1910s\u20131950s\\n\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\n\n=== 18 ===\nchunk.text (57 tokens):\n'clumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\nchunker.contextualize(chunk) (61 tokens):\n'IBM\\n1910s\u20131950s\\nclumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\n\n=== 19 ===\nchunk.text (21 tokens):\n'1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\nchunker.contextualize(chunk) (25 tokens):\n'IBM\\n1910s\u20131950s\\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n\n=== 20 ===\nchunk.text (22 tokens):\n'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\nchunker.contextualize(chunk) (26 tokens):\n'IBM\\n1960s\u20131980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\n\n</pre>"},{"location":"examples/hybrid_chunking/#hybrid-chunking","title":"Hybrid chunking\u00b6","text":""},{"location":"examples/hybrid_chunking/#overview","title":"Overview\u00b6","text":""},{"location":"examples/hybrid_chunking/#setup","title":"Setup\u00b6","text":""},{"location":"examples/hybrid_chunking/#basic-usage","title":"Basic usage\u00b6","text":""},{"location":"examples/hybrid_chunking/#configuring-tokenization","title":"Configuring tokenization\u00b6","text":"<p>For more control on the chunking, we can parametrize tokenization as shown below.</p> <p>In a RAG / retrieval context, it is important to make sure that the chunker and embedding model are using the same tokenizer.</p> <p>\ud83d\udc49 HuggingFace transformers tokenizers can be used as shown in the following example:</p>"},{"location":"examples/inspect_picture_content/","title":"Inspect picture content","text":"<p>Inspect the contents associated with each picture in a converted document.</p> <p>What this example does</p> <ul> <li>Converts a PDF and iterates over each PictureItem.</li> <li>Prints the caption and the textual items contained within the picture region.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/inspect_picture_content.py</code>.</li> </ul> <p>Notes</p> <ul> <li>Uncomment <code>picture.get_image(doc).show()</code> to visually inspect each picture.</li> <li>Adjust <code>source</code> to point to a different PDF if desired.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>from docling_core.types.doc import TextItem\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n# Change this to a local path if desired\nsource = \"tests/data/pdf/amt_handbook_sample.pdf\"\n\npipeline_options = PdfPipelineOptions()\n# Higher scale yields sharper crops when inspecting picture content.\npipeline_options.images_scale = 2\npipeline_options.generate_page_images = True\n\ndoc_converter = DocumentConverter(\n    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}\n)\n\nresult = doc_converter.convert(source)\n\ndoc = result.document\n\nfor picture in doc.pictures:\n    # picture.get_image(doc).show()  # display the picture\n    print(picture.caption_text(doc), \" contains these elements:\")\n\n    for item, level in doc.iterate_items(root=picture, traverse_pictures=True):\n        if isinstance(item, TextItem):\n            print(item.text)\n\n    print(\"\\n\")\n</pre>  from docling_core.types.doc import TextItem  from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption  # Change this to a local path if desired source = \"tests/data/pdf/amt_handbook_sample.pdf\"  pipeline_options = PdfPipelineOptions() # Higher scale yields sharper crops when inspecting picture content. pipeline_options.images_scale = 2 pipeline_options.generate_page_images = True  doc_converter = DocumentConverter(     format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} )  result = doc_converter.convert(source)  doc = result.document  for picture in doc.pictures:     # picture.get_image(doc).show()  # display the picture     print(picture.caption_text(doc), \" contains these elements:\")      for item, level in doc.iterate_items(root=picture, traverse_pictures=True):         if isinstance(item, TextItem):             print(item.text)      print(\"\\n\")"},{"location":"examples/minimal/","title":"Simple conversion","text":"<p>What this example does</p> <ul> <li>Converts a single source (URL or local file path) to a unified Docling document and prints Markdown to stdout.</li> </ul> <p>Requirements</p> <ul> <li>Python 3.9+</li> <li>Install Docling: <code>pip install docling</code></li> </ul> <p>How to run</p> <ul> <li>Use the default sample URL: <code>python docs/examples/minimal.py</code></li> <li>To use your own file or URL, edit the <code>source</code> variable below.</li> </ul> <p>Notes</p> <ul> <li>The converter auto-detects supported formats (PDF, DOCX, HTML, PPTX, images, etc.).</li> <li>For batch processing or saving outputs to files, see <code>docs/examples/batch_convert.py</code>.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>from docling.document_converter import DocumentConverter\n\n# Change this to a local path or another URL if desired.\n# Note: using the default URL requires network access; if offline, provide a\n# local file path (e.g., Path(\"/path/to/file.pdf\")).\nsource = \"https://arxiv.org/pdf/2408.09869\"\n\nconverter = DocumentConverter()\nresult = converter.convert(source)\n\n# Print Markdown to stdout.\nprint(result.document.export_to_markdown())\n</pre>  from docling.document_converter import DocumentConverter  # Change this to a local path or another URL if desired. # Note: using the default URL requires network access; if offline, provide a # local file path (e.g., Path(\"/path/to/file.pdf\")). source = \"https://arxiv.org/pdf/2408.09869\"  converter = DocumentConverter() result = converter.convert(source)  # Print Markdown to stdout. print(result.document.export_to_markdown())"},{"location":"examples/minimal_asr_pipeline/","title":"ASR pipeline with Whisper","text":"<p>Minimal ASR pipeline example: transcribe an audio file to Markdown text.</p> <p>What this example does</p> <ul> <li>Configures the ASR pipeline with a default model spec and converts one audio file.</li> <li>Prints the recognized speech segments in Markdown with timestamps.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling with ASR extras and any audio dependencies (ffmpeg, etc.).</li> <li>Ensure your environment can download or access the configured ASR model.</li> <li>Some formats require ffmpeg codecs; install ffmpeg and ensure it's on PATH.</li> </ul> <p>How to run</p> <ul> <li>From the repository root, run: <code>python docs/examples/minimal_asr_pipeline.py</code>.</li> <li>The script prints the transcription to stdout.</li> </ul> <p>Customizing the model</p> <ul> <li>The script automatically selects the best model for your hardware (MLX Whisper for Apple Silicon, native Whisper otherwise).</li> <li>Edit <code>get_asr_converter()</code> to manually override <code>pipeline_options.asr_options</code> with any model from <code>asr_model_specs</code>.</li> <li>Keep <code>InputFormat.AUDIO</code> and <code>AsrPipeline</code> unchanged for a minimal setup.</li> </ul> <p>Input audio</p> <ul> <li>Defaults to <code>tests/data/audio/sample_10s.mp3</code>. Update <code>audio_path</code> to your own file if needed.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>from pathlib import Path\n\nfrom docling_core.types.doc import DoclingDocument\n\nfrom docling.datamodel import asr_model_specs\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import AsrPipelineOptions\nfrom docling.document_converter import AudioFormatOption, DocumentConverter\nfrom docling.pipeline.asr_pipeline import AsrPipeline\n\n\ndef get_asr_converter():\n    \"\"\"Create a DocumentConverter configured for ASR with automatic model selection.\n\n    Uses `asr_model_specs.WHISPER_TURBO` which automatically selects the best\n    implementation for your hardware:\n    - MLX Whisper Turbo for Apple Silicon (M1/M2/M3) with mlx-whisper installed\n    - Native Whisper Turbo as fallback\n\n    You can swap in another model spec from `docling.datamodel.asr_model_specs`\n    to experiment with different model sizes.\n    \"\"\"\n    pipeline_options = AsrPipelineOptions()\n    pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO\n\n    converter = DocumentConverter(\n        format_options={\n            InputFormat.AUDIO: AudioFormatOption(\n                pipeline_cls=AsrPipeline,\n                pipeline_options=pipeline_options,\n            )\n        }\n    )\n    return converter\n\n\ndef asr_pipeline_conversion(audio_path: Path) -&gt; DoclingDocument:\n    \"\"\"Run the ASR pipeline and return a `DoclingDocument` transcript.\"\"\"\n    # Check if the test audio file exists\n    assert audio_path.exists(), f\"Test audio file not found: {audio_path}\"\n\n    converter = get_asr_converter()\n\n    # Convert the audio file\n    result: ConversionResult = converter.convert(audio_path)\n\n    # Verify conversion was successful\n    assert result.status == ConversionStatus.SUCCESS, (\n        f\"Conversion failed with status: {result.status}\"\n    )\n    return result.document\n\n\nif __name__ == \"__main__\":\n    audio_path = Path(\"tests/data/audio/sample_10s.mp3\")\n\n    doc = asr_pipeline_conversion(audio_path=audio_path)\n    print(doc.export_to_markdown())\n\n    # Expected output:\n    #\n    # [time: 0.0-4.0]  Shakespeare on Scenery by Oscar Wilde\n    #\n    # [time: 5.28-9.96]  This is a LibriVox recording. All LibriVox recordings are in the public domain.\n</pre>  from pathlib import Path  from docling_core.types.doc import DoclingDocument  from docling.datamodel import asr_model_specs from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import AsrPipelineOptions from docling.document_converter import AudioFormatOption, DocumentConverter from docling.pipeline.asr_pipeline import AsrPipeline   def get_asr_converter():     \"\"\"Create a DocumentConverter configured for ASR with automatic model selection.      Uses `asr_model_specs.WHISPER_TURBO` which automatically selects the best     implementation for your hardware:     - MLX Whisper Turbo for Apple Silicon (M1/M2/M3) with mlx-whisper installed     - Native Whisper Turbo as fallback      You can swap in another model spec from `docling.datamodel.asr_model_specs`     to experiment with different model sizes.     \"\"\"     pipeline_options = AsrPipelineOptions()     pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO      converter = DocumentConverter(         format_options={             InputFormat.AUDIO: AudioFormatOption(                 pipeline_cls=AsrPipeline,                 pipeline_options=pipeline_options,             )         }     )     return converter   def asr_pipeline_conversion(audio_path: Path) -&gt; DoclingDocument:     \"\"\"Run the ASR pipeline and return a `DoclingDocument` transcript.\"\"\"     # Check if the test audio file exists     assert audio_path.exists(), f\"Test audio file not found: {audio_path}\"      converter = get_asr_converter()      # Convert the audio file     result: ConversionResult = converter.convert(audio_path)      # Verify conversion was successful     assert result.status == ConversionStatus.SUCCESS, (         f\"Conversion failed with status: {result.status}\"     )     return result.document   if __name__ == \"__main__\":     audio_path = Path(\"tests/data/audio/sample_10s.mp3\")      doc = asr_pipeline_conversion(audio_path=audio_path)     print(doc.export_to_markdown())      # Expected output:     #     # [time: 0.0-4.0]  Shakespeare on Scenery by Oscar Wilde     #     # [time: 5.28-9.96]  This is a LibriVox recording. All LibriVox recordings are in the public domain."},{"location":"examples/minimal_vlm_pipeline/","title":"VLM pipeline with GraniteDocling","text":"<p>Minimal VLM pipeline example: convert a PDF using a vision-language model.</p> <p>What this example does</p> <ul> <li>Runs the VLM-powered pipeline on a PDF (by URL) and prints Markdown output.</li> <li>Shows two setups: default (Transformers/GraniteDocling) and macOS MPS/MLX.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling with VLM extras and the appropriate backend (Transformers or MLX).</li> <li>Ensure your environment can download model weights (e.g., from Hugging Face).</li> </ul> <p>How to run</p> <ul> <li>From the repository root, run: <code>python docs/examples/minimal_vlm_pipeline.py</code>.</li> <li>The script prints the converted Markdown to stdout.</li> </ul> <p>Notes</p> <ul> <li><code>source</code> may be a local path or a URL to a PDF.</li> <li>The second section demonstrates macOS MPS acceleration via MLX (<code>vlm_model_specs.GRANITEDOCLING_MLX</code>).</li> <li>For more configurations and model comparisons, see <code>docs/examples/compare_vlm_models.py</code>.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel import vlm_model_specs\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n    VlmPipelineOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\n# Convert a public arXiv PDF; replace with a local path if preferred.\nsource = \"https://arxiv.org/pdf/2501.17887\"\n\n###### USING SIMPLE DEFAULT VALUES\n# - GraniteDocling model\n# - Using the transformers framework\n\nconverter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(\n            pipeline_cls=VlmPipeline,\n        ),\n    }\n)\n\ndoc = converter.convert(source=source).document\n\nprint(doc.export_to_markdown())\n\n\n###### USING MACOS MPS ACCELERATOR\n# Demonstrates using MLX on macOS with MPS acceleration (macOS only).\n# For more options see the `compare_vlm_models.py` example.\n\npipeline_options = VlmPipelineOptions(\n    vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,\n)\n\nconverter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(\n            pipeline_cls=VlmPipeline,\n            pipeline_options=pipeline_options,\n        ),\n    }\n)\n\ndoc = converter.convert(source=source).document\n\nprint(doc.export_to_markdown())\n</pre>  from docling.datamodel import vlm_model_specs from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import (     VlmPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline  # Convert a public arXiv PDF; replace with a local path if preferred. source = \"https://arxiv.org/pdf/2501.17887\"  ###### USING SIMPLE DEFAULT VALUES # - GraniteDocling model # - Using the transformers framework  converter = DocumentConverter(     format_options={         InputFormat.PDF: PdfFormatOption(             pipeline_cls=VlmPipeline,         ),     } )  doc = converter.convert(source=source).document  print(doc.export_to_markdown())   ###### USING MACOS MPS ACCELERATOR # Demonstrates using MLX on macOS with MPS acceleration (macOS only). # For more options see the `compare_vlm_models.py` example.  pipeline_options = VlmPipelineOptions(     vlm_options=vlm_model_specs.GRANITEDOCLING_MLX, )  converter = DocumentConverter(     format_options={         InputFormat.PDF: PdfFormatOption(             pipeline_cls=VlmPipeline,             pipeline_options=pipeline_options,         ),     } )  doc = converter.convert(source=source).document  print(doc.export_to_markdown())"},{"location":"examples/mlx_whisper_example/","title":"Mlx whisper example","text":"In\u00a0[\u00a0]: Copied! <pre>\"\"\"\nExample script demonstrating MLX Whisper integration for Apple Silicon.\n\nThis script shows how to use the MLX Whisper models for speech recognition\non Apple Silicon devices with optimized performance.\n\"\"\"\n</pre> \"\"\" Example script demonstrating MLX Whisper integration for Apple Silicon.  This script shows how to use the MLX Whisper models for speech recognition on Apple Silicon devices with optimized performance. \"\"\" In\u00a0[\u00a0]: Copied! <pre>import argparse\nimport sys\nfrom pathlib import Path\n</pre> import argparse import sys from pathlib import Path In\u00a0[\u00a0]: Copied! <pre># Add the repository root to the path so we can import docling\nsys.path.insert(0, str(Path(__file__).parent.parent.parent))\n</pre> # Add the repository root to the path so we can import docling sys.path.insert(0, str(Path(__file__).parent.parent.parent)) In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.asr_model_specs import (\n    WHISPER_BASE,\n    WHISPER_LARGE,\n    WHISPER_MEDIUM,\n    WHISPER_SMALL,\n    WHISPER_TINY,\n    WHISPER_TURBO,\n)\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import AsrPipelineOptions\nfrom docling.document_converter import AudioFormatOption, DocumentConverter\nfrom docling.pipeline.asr_pipeline import AsrPipeline\n</pre> from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.asr_model_specs import (     WHISPER_BASE,     WHISPER_LARGE,     WHISPER_MEDIUM,     WHISPER_SMALL,     WHISPER_TINY,     WHISPER_TURBO, ) from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import AsrPipelineOptions from docling.document_converter import AudioFormatOption, DocumentConverter from docling.pipeline.asr_pipeline import AsrPipeline In\u00a0[\u00a0]: Copied! <pre>def transcribe_audio_with_mlx_whisper(audio_file_path: str, model_size: str = \"base\"):\n    \"\"\"\n    Transcribe audio using Whisper models with automatic MLX optimization for Apple Silicon.\n\n    Args:\n        audio_file_path: Path to the audio file to transcribe\n        model_size: Size of the Whisper model to use\n                  (\"tiny\", \"base\", \"small\", \"medium\", \"large\", \"turbo\")\n                  Note: MLX optimization is automatically used on Apple Silicon when available\n\n    Returns:\n        The transcribed text\n    \"\"\"\n    # Select the appropriate Whisper model (automatically uses MLX on Apple Silicon)\n    model_map = {\n        \"tiny\": WHISPER_TINY,\n        \"base\": WHISPER_BASE,\n        \"small\": WHISPER_SMALL,\n        \"medium\": WHISPER_MEDIUM,\n        \"large\": WHISPER_LARGE,\n        \"turbo\": WHISPER_TURBO,\n    }\n\n    if model_size not in model_map:\n        raise ValueError(\n            f\"Invalid model size: {model_size}. Choose from: {list(model_map.keys())}\"\n        )\n\n    asr_options = model_map[model_size]\n\n    # Configure accelerator options for Apple Silicon\n    accelerator_options = AcceleratorOptions(device=AcceleratorDevice.MPS)\n\n    # Create pipeline options\n    pipeline_options = AsrPipelineOptions(\n        asr_options=asr_options,\n        accelerator_options=accelerator_options,\n    )\n\n    # Create document converter with MLX Whisper configuration\n    converter = DocumentConverter(\n        format_options={\n            InputFormat.AUDIO: AudioFormatOption(\n                pipeline_cls=AsrPipeline,\n                pipeline_options=pipeline_options,\n            )\n        }\n    )\n\n    # Run transcription\n    result = converter.convert(Path(audio_file_path))\n\n    if result.status.value == \"success\":\n        # Extract text from the document\n        text_content = []\n        for item in result.document.texts:\n            text_content.append(item.text)\n\n        return \"\\n\".join(text_content)\n    else:\n        raise RuntimeError(f\"Transcription failed: {result.status}\")\n</pre> def transcribe_audio_with_mlx_whisper(audio_file_path: str, model_size: str = \"base\"):     \"\"\"     Transcribe audio using Whisper models with automatic MLX optimization for Apple Silicon.      Args:         audio_file_path: Path to the audio file to transcribe         model_size: Size of the Whisper model to use                   (\"tiny\", \"base\", \"small\", \"medium\", \"large\", \"turbo\")                   Note: MLX optimization is automatically used on Apple Silicon when available      Returns:         The transcribed text     \"\"\"     # Select the appropriate Whisper model (automatically uses MLX on Apple Silicon)     model_map = {         \"tiny\": WHISPER_TINY,         \"base\": WHISPER_BASE,         \"small\": WHISPER_SMALL,         \"medium\": WHISPER_MEDIUM,         \"large\": WHISPER_LARGE,         \"turbo\": WHISPER_TURBO,     }      if model_size not in model_map:         raise ValueError(             f\"Invalid model size: {model_size}. Choose from: {list(model_map.keys())}\"         )      asr_options = model_map[model_size]      # Configure accelerator options for Apple Silicon     accelerator_options = AcceleratorOptions(device=AcceleratorDevice.MPS)      # Create pipeline options     pipeline_options = AsrPipelineOptions(         asr_options=asr_options,         accelerator_options=accelerator_options,     )      # Create document converter with MLX Whisper configuration     converter = DocumentConverter(         format_options={             InputFormat.AUDIO: AudioFormatOption(                 pipeline_cls=AsrPipeline,                 pipeline_options=pipeline_options,             )         }     )      # Run transcription     result = converter.convert(Path(audio_file_path))      if result.status.value == \"success\":         # Extract text from the document         text_content = []         for item in result.document.texts:             text_content.append(item.text)          return \"\\n\".join(text_content)     else:         raise RuntimeError(f\"Transcription failed: {result.status}\") In\u00a0[\u00a0]: Copied! <pre>def parse_args():\n    \"\"\"Parse command line arguments.\"\"\"\n    parser = argparse.ArgumentParser(\n        description=\"MLX Whisper example for Apple Silicon speech recognition\",\n        formatter_class=argparse.RawDescriptionHelpFormatter,\n        epilog=\"\"\"\nExamples:\n\n# Use default test audio file\npython mlx_whisper_example.py\n\n# Use your own audio file\npython mlx_whisper_example.py --audio /path/to/your/audio.mp3\n\n# Use specific model size\npython mlx_whisper_example.py --audio audio.wav --model tiny\n\n# Use default test file with specific model\npython mlx_whisper_example.py --model turbo\n        \"\"\",\n    )\n\n    parser.add_argument(\n        \"--audio\",\n        type=str,\n        help=\"Path to audio file for transcription (default: tests/data/audio/sample_10s.mp3)\",\n    )\n\n    parser.add_argument(\n        \"--model\",\n        type=str,\n        choices=[\"tiny\", \"base\", \"small\", \"medium\", \"large\", \"turbo\"],\n        default=\"base\",\n        help=\"Whisper model size to use (default: base)\",\n    )\n\n    return parser.parse_args()\n</pre> def parse_args():     \"\"\"Parse command line arguments.\"\"\"     parser = argparse.ArgumentParser(         description=\"MLX Whisper example for Apple Silicon speech recognition\",         formatter_class=argparse.RawDescriptionHelpFormatter,         epilog=\"\"\" Examples:  # Use default test audio file python mlx_whisper_example.py  # Use your own audio file python mlx_whisper_example.py --audio /path/to/your/audio.mp3  # Use specific model size python mlx_whisper_example.py --audio audio.wav --model tiny  # Use default test file with specific model python mlx_whisper_example.py --model turbo         \"\"\",     )      parser.add_argument(         \"--audio\",         type=str,         help=\"Path to audio file for transcription (default: tests/data/audio/sample_10s.mp3)\",     )      parser.add_argument(         \"--model\",         type=str,         choices=[\"tiny\", \"base\", \"small\", \"medium\", \"large\", \"turbo\"],         default=\"base\",         help=\"Whisper model size to use (default: base)\",     )      return parser.parse_args() In\u00a0[\u00a0]: Copied! <pre>def main():\n    \"\"\"Main function to demonstrate MLX Whisper usage.\"\"\"\n    args = parse_args()\n\n    # Determine audio file path\n    if args.audio:\n        audio_file_path = args.audio\n    else:\n        # Use default test audio file if no audio file specified\n        default_audio = (\n            Path(__file__).parent.parent.parent\n            / \"tests\"\n            / \"data\"\n            / \"audio\"\n            / \"sample_10s.mp3\"\n        )\n        if default_audio.exists():\n            audio_file_path = str(default_audio)\n            print(\"No audio file specified, using default test file:\")\n            print(f\"  Audio file: {audio_file_path}\")\n            print(f\"  Model size: {args.model}\")\n            print()\n        else:\n            print(\"Error: No audio file specified and default test file not found.\")\n            print(\n                \"Please specify an audio file with --audio or ensure tests/data/audio/sample_10s.mp3 exists.\"\n            )\n            sys.exit(1)\n\n    if not Path(audio_file_path).exists():\n        print(f\"Error: Audio file '{audio_file_path}' not found.\")\n        sys.exit(1)\n\n    try:\n        print(f\"Transcribing '{audio_file_path}' using Whisper {args.model} model...\")\n        print(\n            \"Note: MLX optimization is automatically used on Apple Silicon when available.\"\n        )\n        print()\n\n        transcribed_text = transcribe_audio_with_mlx_whisper(\n            audio_file_path, args.model\n        )\n\n        print(\"Transcription Result:\")\n        print(\"=\" * 50)\n        print(transcribed_text)\n        print(\"=\" * 50)\n\n    except ImportError as e:\n        print(f\"Error: {e}\")\n        print(\"Please install mlx-whisper: pip install mlx-whisper\")\n        print(\"Or install with uv: uv sync --extra asr\")\n        sys.exit(1)\n    except Exception as e:\n        print(f\"Error during transcription: {e}\")\n        sys.exit(1)\n</pre> def main():     \"\"\"Main function to demonstrate MLX Whisper usage.\"\"\"     args = parse_args()      # Determine audio file path     if args.audio:         audio_file_path = args.audio     else:         # Use default test audio file if no audio file specified         default_audio = (             Path(__file__).parent.parent.parent             / \"tests\"             / \"data\"             / \"audio\"             / \"sample_10s.mp3\"         )         if default_audio.exists():             audio_file_path = str(default_audio)             print(\"No audio file specified, using default test file:\")             print(f\"  Audio file: {audio_file_path}\")             print(f\"  Model size: {args.model}\")             print()         else:             print(\"Error: No audio file specified and default test file not found.\")             print(                 \"Please specify an audio file with --audio or ensure tests/data/audio/sample_10s.mp3 exists.\"             )             sys.exit(1)      if not Path(audio_file_path).exists():         print(f\"Error: Audio file '{audio_file_path}' not found.\")         sys.exit(1)      try:         print(f\"Transcribing '{audio_file_path}' using Whisper {args.model} model...\")         print(             \"Note: MLX optimization is automatically used on Apple Silicon when available.\"         )         print()          transcribed_text = transcribe_audio_with_mlx_whisper(             audio_file_path, args.model         )          print(\"Transcription Result:\")         print(\"=\" * 50)         print(transcribed_text)         print(\"=\" * 50)      except ImportError as e:         print(f\"Error: {e}\")         print(\"Please install mlx-whisper: pip install mlx-whisper\")         print(\"Or install with uv: uv sync --extra asr\")         sys.exit(1)     except Exception as e:         print(f\"Error during transcription: {e}\")         sys.exit(1) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n    main()\n</pre> if __name__ == \"__main__\":     main()"},{"location":"examples/parquet_images/","title":"Parquet benchmark","text":"<p>What this example does</p> <ul> <li>Run a batch conversion on a parquet file with an image column.</li> </ul> <p>Requirements</p> <ul> <li>Python 3.9+</li> <li>Install Docling: <code>pip install docling</code></li> </ul> <p>How to run</p> <ul> <li><code>python docs/examples/parquet_images.py FILE</code></li> </ul> <p>The parquet file should be in the format similar to the ViDoRe V3 dataset. https://huggingface.co/collections/vidore/vidore-benchmark-v3</p> <p>For example:</p> <ul> <li>https://huggingface.co/datasets/vidore/vidore_v3_hr/blob/main/corpus/test-00000-of-00001.parquet</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import io\nimport time\nfrom pathlib import Path\nfrom typing import Annotated, Literal\n\nimport pyarrow.parquet as pq\nimport typer\nfrom PIL import Image\n\nfrom docling.datamodel import vlm_model_specs\nfrom docling.datamodel.base_models import ConversionStatus, DocumentStream, InputFormat\nfrom docling.datamodel.pipeline_options import (\n    PdfPipelineOptions,\n    PipelineOptions,\n    RapidOcrOptions,\n    VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat\nfrom docling.datamodel.settings import settings\nfrom docling.document_converter import DocumentConverter, ImageFormatOption\nfrom docling.pipeline.base_pipeline import ConvertPipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\n\ndef process_document(\n    images: list[Image.Image], chunk_idx: int, doc_converter: DocumentConverter\n):\n    \"\"\"Builds a tall image and sends it through Docling.\"\"\"\n\n    print(f\"\\n--- Processing chunk {chunk_idx} with {len(images)} images ---\")\n\n    # Convert images to mode RGB (TIFF pages must match)\n    rgb_images = [im.convert(\"RGB\") for im in images]\n\n    # First image is the base frame\n    first = rgb_images[0]\n    rest = rgb_images[1:]\n\n    # Create multi-page TIFF using PIL frames\n    buf = io.BytesIO()\n    first.save(\n        buf,\n        format=\"TIFF\",\n        save_all=True,\n        append_images=rest,\n        compression=\"tiff_deflate\",  # good compression, optional\n    )\n    buf.seek(0)\n\n    # Docling conversion\n    doc_stream = DocumentStream(name=f\"doc_{chunk_idx}.tiff\", stream=buf)\n\n    start_time = time.time()\n    conv_result = doc_converter.convert(doc_stream)\n    runtime = time.time() - start_time\n\n    assert conv_result.status == ConversionStatus.SUCCESS\n\n    pages = len(conv_result.pages)\n    print(\n        f\"Chunk {chunk_idx} converted in {runtime:.2f} sec ({pages / runtime:.2f} pages/s).\"\n    )\n\n\ndef run(\n    filename: Annotated[Path, typer.Argument()] = Path(\n        \"docs/examples/data/vidore_v3_hr-slice.parquet\"\n    ),\n    doc_size: int = 192,\n    batch_size: int = 64,\n    pipeline: Literal[\"standard\", \"vlm\"] = \"standard\",\n):\n    if pipeline == \"standard\":\n        pipeline_cls: type[ConvertPipeline] = StandardPdfPipeline\n        pipeline_options: PipelineOptions = PdfPipelineOptions(\n            # ocr_options=RapidOcrOptions(backend=\"openvino\"),\n            ocr_batch_size=batch_size,\n            layout_batch_size=batch_size,\n            table_batch_size=4,\n        )\n    elif pipeline == \"vlm\":\n        settings.perf.page_batch_size = batch_size\n        pipeline_cls = VlmPipeline\n        vlm_options = ApiVlmOptions(\n            url=\"http://localhost:8000/v1/chat/completions\",\n            params=dict(\n                model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id,\n                max_tokens=4096,\n                skip_special_tokens=True,\n            ),\n            prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt,\n            timeout=90,\n            scale=1.0,\n            temperature=0.0,\n            concurrency=batch_size,\n            stop_strings=[\"&lt;/doctag&gt;\", \"&lt;|end_of_text|&gt;\"],\n            response_format=ResponseFormat.DOCTAGS,\n        )\n        pipeline_options = VlmPipelineOptions(\n            vlm_options=vlm_options,\n            enable_remote_services=True,  # required when using a remote inference service.\n        )\n    else:\n        raise RuntimeError(f\"Pipeline {pipeline} not available.\")\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.IMAGE: ImageFormatOption(\n                pipeline_cls=pipeline_cls,\n                pipeline_options=pipeline_options,\n            )\n        }\n    )\n\n    start_time = time.time()\n    doc_converter.initialize_pipeline(InputFormat.IMAGE)\n    init_runtime = time.time() - start_time\n    print(f\"Pipeline initialized in {init_runtime:.2f} seconds.\")\n\n    # ------------------------------------------------------------\n    # Open parquet file in streaming mode\n    # ------------------------------------------------------------\n    pf = pq.ParquetFile(filename)\n\n    image_buffer = []  # holds up to doc_size images\n    chunk_idx = 0\n\n    # ------------------------------------------------------------\n    # Stream batches from parquet\n    # ------------------------------------------------------------\n    for batch in pf.iter_batches(batch_size=batch_size, columns=[\"image\"]):\n        col = batch.column(\"image\")\n\n        # Extract Python objects (PIL images)\n        # Arrow stores them as Python objects inside an ObjectArray\n        for i in range(len(col)):\n            img_dict = col[i].as_py()  # {\"bytes\": ..., \"path\": ...}\n            pil_image = Image.open(io.BytesIO(img_dict[\"bytes\"]))\n            image_buffer.append(pil_image)\n\n            # If enough images gathered \u2192 process one doc\n            if len(image_buffer) == doc_size:\n                process_document(image_buffer, chunk_idx, doc_converter)\n                image_buffer.clear()\n                chunk_idx += 1\n\n    # ------------------------------------------------------------\n    # Process trailing images (last partial chunk)\n    # ------------------------------------------------------------\n    if image_buffer:\n        process_document(image_buffer, chunk_idx, doc_converter)\n\n\nif __name__ == \"__main__\":\n    typer.run(run)\n</pre>  import io import time from pathlib import Path from typing import Annotated, Literal  import pyarrow.parquet as pq import typer from PIL import Image  from docling.datamodel import vlm_model_specs from docling.datamodel.base_models import ConversionStatus, DocumentStream, InputFormat from docling.datamodel.pipeline_options import (     PdfPipelineOptions,     PipelineOptions,     RapidOcrOptions,     VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat from docling.datamodel.settings import settings from docling.document_converter import DocumentConverter, ImageFormatOption from docling.pipeline.base_pipeline import ConvertPipeline from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline from docling.pipeline.vlm_pipeline import VlmPipeline   def process_document(     images: list[Image.Image], chunk_idx: int, doc_converter: DocumentConverter ):     \"\"\"Builds a tall image and sends it through Docling.\"\"\"      print(f\"\\n--- Processing chunk {chunk_idx} with {len(images)} images ---\")      # Convert images to mode RGB (TIFF pages must match)     rgb_images = [im.convert(\"RGB\") for im in images]      # First image is the base frame     first = rgb_images[0]     rest = rgb_images[1:]      # Create multi-page TIFF using PIL frames     buf = io.BytesIO()     first.save(         buf,         format=\"TIFF\",         save_all=True,         append_images=rest,         compression=\"tiff_deflate\",  # good compression, optional     )     buf.seek(0)      # Docling conversion     doc_stream = DocumentStream(name=f\"doc_{chunk_idx}.tiff\", stream=buf)      start_time = time.time()     conv_result = doc_converter.convert(doc_stream)     runtime = time.time() - start_time      assert conv_result.status == ConversionStatus.SUCCESS      pages = len(conv_result.pages)     print(         f\"Chunk {chunk_idx} converted in {runtime:.2f} sec ({pages / runtime:.2f} pages/s).\"     )   def run(     filename: Annotated[Path, typer.Argument()] = Path(         \"docs/examples/data/vidore_v3_hr-slice.parquet\"     ),     doc_size: int = 192,     batch_size: int = 64,     pipeline: Literal[\"standard\", \"vlm\"] = \"standard\", ):     if pipeline == \"standard\":         pipeline_cls: type[ConvertPipeline] = StandardPdfPipeline         pipeline_options: PipelineOptions = PdfPipelineOptions(             # ocr_options=RapidOcrOptions(backend=\"openvino\"),             ocr_batch_size=batch_size,             layout_batch_size=batch_size,             table_batch_size=4,         )     elif pipeline == \"vlm\":         settings.perf.page_batch_size = batch_size         pipeline_cls = VlmPipeline         vlm_options = ApiVlmOptions(             url=\"http://localhost:8000/v1/chat/completions\",             params=dict(                 model=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.repo_id,                 max_tokens=4096,                 skip_special_tokens=True,             ),             prompt=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS.prompt,             timeout=90,             scale=1.0,             temperature=0.0,             concurrency=batch_size,             stop_strings=[\"\", \"&lt;|end_of_text|&gt;\"],             response_format=ResponseFormat.DOCTAGS,         )         pipeline_options = VlmPipelineOptions(             vlm_options=vlm_options,             enable_remote_services=True,  # required when using a remote inference service.         )     else:         raise RuntimeError(f\"Pipeline {pipeline} not available.\")      doc_converter = DocumentConverter(         format_options={             InputFormat.IMAGE: ImageFormatOption(                 pipeline_cls=pipeline_cls,                 pipeline_options=pipeline_options,             )         }     )      start_time = time.time()     doc_converter.initialize_pipeline(InputFormat.IMAGE)     init_runtime = time.time() - start_time     print(f\"Pipeline initialized in {init_runtime:.2f} seconds.\")      # ------------------------------------------------------------     # Open parquet file in streaming mode     # ------------------------------------------------------------     pf = pq.ParquetFile(filename)      image_buffer = []  # holds up to doc_size images     chunk_idx = 0      # ------------------------------------------------------------     # Stream batches from parquet     # ------------------------------------------------------------     for batch in pf.iter_batches(batch_size=batch_size, columns=[\"image\"]):         col = batch.column(\"image\")          # Extract Python objects (PIL images)         # Arrow stores them as Python objects inside an ObjectArray         for i in range(len(col)):             img_dict = col[i].as_py()  # {\"bytes\": ..., \"path\": ...}             pil_image = Image.open(io.BytesIO(img_dict[\"bytes\"]))             image_buffer.append(pil_image)              # If enough images gathered \u2192 process one doc             if len(image_buffer) == doc_size:                 process_document(image_buffer, chunk_idx, doc_converter)                 image_buffer.clear()                 chunk_idx += 1      # ------------------------------------------------------------     # Process trailing images (last partial chunk)     # ------------------------------------------------------------     if image_buffer:         process_document(image_buffer, chunk_idx, doc_converter)   if __name__ == \"__main__\":     typer.run(run)"},{"location":"examples/parquet_images/#start-models-with-vllm","title":"Start models with vllm\u00b6","text":"<pre>vllm serve ibm-granite/granite-docling-258M \\\n  --host 127.0.0.1 --port 8000 \\\n  --max-num-seqs 512 \\\n  --max-num-batched-tokens 8192 \\\n  --enable-chunked-prefill \\\n  --gpu-memory-utilization 0.9\n</pre>"},{"location":"examples/pictures_description/","title":"Annotate picture with local VLM","text":"In\u00a0[\u00a0]: Copied! <pre>%pip install -q docling[vlm] ipython\n</pre> %pip install -q docling[vlm] ipython <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[1]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[2]: Copied! <pre># The source document\nDOC_SOURCE = \"https://arxiv.org/pdf/2501.17887\"\n</pre> # The source document DOC_SOURCE = \"https://arxiv.org/pdf/2501.17887\" In\u00a0[3]: Copied! <pre>from docling.datamodel.pipeline_options import granite_picture_description\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = (\n    granite_picture_description  # &lt;-- the model choice\n)\npipeline_options.picture_description_options.prompt = (\n    \"Describe the image in three sentences. Be consise and accurate.\"\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(\n            pipeline_options=pipeline_options,\n        )\n    }\n)\ndoc = converter.convert(DOC_SOURCE).document\n</pre> from docling.datamodel.pipeline_options import granite_picture_description  pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = (     granite_picture_description  # &lt;-- the model choice ) pipeline_options.picture_description_options.prompt = (     \"Describe the image in three sentences. Be consise and accurate.\" ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True  converter = DocumentConverter(     format_options={         InputFormat.PDF: PdfFormatOption(             pipeline_options=pipeline_options,         )     } ) doc = converter.convert(DOC_SOURCE).document <pre>Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.\n</pre> <pre>Loading checkpoint shards:   0%|          | 0/2 [00:00&lt;?, ?it/s]</pre> In\u00a0[4]: Copied! <pre>from docling_core.types.doc.document import PictureDescriptionData\nfrom IPython import display\n\nhtml_buffer = []\n# display the first 5 pictures and their captions and annotations:\nfor pic in doc.pictures[:5]:\n    html_item = (\n        f\"&lt;h3&gt;Picture &lt;code&gt;{pic.self_ref}&lt;/code&gt;&lt;/h3&gt;\"\n        f'&lt;img src=\"{pic.image.uri!s}\" /&gt;&lt;br /&gt;'\n        f\"&lt;h4&gt;Caption&lt;/h4&gt;{pic.caption_text(doc=doc)}&lt;br /&gt;\"\n    )\n    for annotation in pic.annotations:\n        if not isinstance(annotation, PictureDescriptionData):\n            continue\n        html_item += (\n            f\"&lt;h4&gt;Annotations ({annotation.provenance})&lt;/h4&gt;{annotation.text}&lt;br /&gt;\\n\"\n        )\n    html_buffer.append(html_item)\ndisplay.HTML(\"&lt;hr /&gt;\".join(html_buffer))\n</pre> from docling_core.types.doc.document import PictureDescriptionData from IPython import display  html_buffer = [] # display the first 5 pictures and their captions and annotations: for pic in doc.pictures[:5]:     html_item = (         f\"Picture <code>{pic.self_ref}</code>\"         f''         f\"Caption{pic.caption_text(doc=doc)}\"     )     for annotation in pic.annotations:         if not isinstance(annotation, PictureDescriptionData):             continue         html_item += (             f\"Annotations ({annotation.provenance}){annotation.text}\\n\"         )     html_buffer.append(html_item) display.HTML(\"\".join(html_buffer)) Out[4]: Picture <code>#/pictures/0</code>CaptionFigure 1: Sketch of Docling's pipelines and usage model. Both PDF pipeline and simple pipeline build up a DoclingDocument representation, which can be further enriched. Downstream applications can utilize Docling's API to inspect, export, or chunk the document for various purposes.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a poster with some text and images. Picture <code>#/pictures/1</code>CaptionFigure 2: Dataset categories and sample counts for documents and pages.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a pie chart. In the pie chart we can see the categories and the number of documents in each category. Picture <code>#/pictures/2</code>CaptionFigure 3: Distribution of conversion times for all documents, ordered by number of pages in a document, on all system configurations. Every dot represents one document. Log/log scale is used to even the spacing, since both number of pages and conversion times have long-tail distributions.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a graph. On the x-axis we can see the number of pages. On the y-axis we can see the seconds. Picture <code>#/pictures/3</code>CaptionFigure 4: Contributions of PDF backend and AI models to the conversion time of a page (in seconds per page). Lower is better. Left: Ranges of time contributions for each model to pages it was applied on (i.e., OCR was applied only on pages with bitmaps, table structure was applied only on pages with tables). Right: Average time contribution to a page in the benchmark dataset (factoring in zero-time contribution for OCR and table structure models on pages without bitmaps or tables) .Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a bar chart and a line chart. In the bar chart we can see the values of Pdf Parse, OCR, Layout, Table Structure, Page Total and Page. In the line chart we can see the values of Pdf Parse, OCR, Layout, Table Structure, Page Total and Page. Picture <code>#/pictures/4</code>CaptionFigure 5: Conversion time in seconds per page on our dataset in three scenarios, across all assets and system configurations. Lower bars are better. The configuration includes OCR and table structure recognition ( fast table option on Docling and MinerU, hi res in unstructured, as shown in table 1).Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a bar chart. In the chart we can see the CPU, Max, GPU, and sec/page. In\u00a0[7]: Copied! <pre>from docling.datamodel.pipeline_options import smolvlm_picture_description\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = (\n    smolvlm_picture_description  # &lt;-- the model choice\n)\npipeline_options.picture_description_options.prompt = (\n    \"Describe the image in three sentences. Be consise and accurate.\"\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(\n            pipeline_options=pipeline_options,\n        )\n    }\n)\ndoc = converter.convert(DOC_SOURCE).document\n</pre> from docling.datamodel.pipeline_options import smolvlm_picture_description  pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = (     smolvlm_picture_description  # &lt;-- the model choice ) pipeline_options.picture_description_options.prompt = (     \"Describe the image in three sentences. Be consise and accurate.\" ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True  converter = DocumentConverter(     format_options={         InputFormat.PDF: PdfFormatOption(             pipeline_options=pipeline_options,         )     } ) doc = converter.convert(DOC_SOURCE).document In\u00a0[6]: Copied! <pre>from docling_core.types.doc.document import PictureDescriptionData\nfrom IPython import display\n\nhtml_buffer = []\n# display the first 5 pictures and their captions and annotations:\nfor pic in doc.pictures[:5]:\n    html_item = (\n        f\"&lt;h3&gt;Picture &lt;code&gt;{pic.self_ref}&lt;/code&gt;&lt;/h3&gt;\"\n        f'&lt;img src=\"{pic.image.uri!s}\" /&gt;&lt;br /&gt;'\n        f\"&lt;h4&gt;Caption&lt;/h4&gt;{pic.caption_text(doc=doc)}&lt;br /&gt;\"\n    )\n    for annotation in pic.annotations:\n        if not isinstance(annotation, PictureDescriptionData):\n            continue\n        html_item += (\n            f\"&lt;h4&gt;Annotations ({annotation.provenance})&lt;/h4&gt;{annotation.text}&lt;br /&gt;\\n\"\n        )\n    html_buffer.append(html_item)\ndisplay.HTML(\"&lt;hr /&gt;\".join(html_buffer))\n</pre> from docling_core.types.doc.document import PictureDescriptionData from IPython import display  html_buffer = [] # display the first 5 pictures and their captions and annotations: for pic in doc.pictures[:5]:     html_item = (         f\"Picture <code>{pic.self_ref}</code>\"         f''         f\"Caption{pic.caption_text(doc=doc)}\"     )     for annotation in pic.annotations:         if not isinstance(annotation, PictureDescriptionData):             continue         html_item += (             f\"Annotations ({annotation.provenance}){annotation.text}\\n\"         )     html_buffer.append(html_item) display.HTML(\"\".join(html_buffer)) Out[6]: Picture <code>#/pictures/0</code>CaptionFigure 1: Sketch of Docling's pipelines and usage model. Both PDF pipeline and simple pipeline build up a DoclingDocument representation, which can be further enriched. Downstream applications can utilize Docling's API to inspect, export, or chunk the document for various purposes.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)This is a page that has different types of documents on it. Picture <code>#/pictures/1</code>CaptionFigure 2: Dataset categories and sample counts for documents and pages.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)Here is a page-by-page list of documents per category: - Science - Articles - Law and Regulations - Articles - Misc. Picture <code>#/pictures/2</code>CaptionFigure 3: Distribution of conversion times for all documents, ordered by number of pages in a document, on all system configurations. Every dot represents one document. Log/log scale is used to even the spacing, since both number of pages and conversion times have long-tail distributions.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)The image is a bar chart that shows the number of pages of a website as a function of the number of pages of the website. The x-axis represents the number of pages, ranging from 100 to 10,000. The y-axis represents the number of pages, ranging from 100 to 10,000. The chart is labeled \"Number of pages\" and has a legend at the top of the chart that indicates the number of pages.  The chart shows a clear trend: as the number of pages increases, the number of pages decreases. This is evident from the following points:  - The number of pages increases from 100 to 1000. - The number of pages decreases from 1000 to 10,000. - The number of pages increases from 10,000 to 10,000. Picture <code>#/pictures/3</code>CaptionFigure 4: Contributions of PDF backend and AI models to the conversion time of a page (in seconds per page). Lower is better. Left: Ranges of time contributions for each model to pages it was applied on (i.e., OCR was applied only on pages with bitmaps, table structure was applied only on pages with tables). Right: Average time contribution to a page in the benchmark dataset (factoring in zero-time contribution for OCR and table structure models on pages without bitmaps or tables) .Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)bar chart with different colored bars representing different data points. Picture <code>#/pictures/4</code>CaptionFigure 5: Conversion time in seconds per page on our dataset in three scenarios, across all assets and system configurations. Lower bars are better. The configuration includes OCR and table structure recognition ( fast table option on Docling and MinerU, hi res in unstructured, as shown in table 1).Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)A bar chart with the following information:  - The x-axis represents the number of pages, ranging from 0 to 14. - The y-axis represents the page count, ranging from 0 to 14. - The chart has three categories: Marker, Unstructured, and Detailed. - The x-axis is labeled \"see/page.\" - The y-axis is labeled \"Page Count.\" - The chart shows that the Marker category has the highest number of pages, followed by the Unstructured category, and then the Detailed category. In\u00a0[8]: Copied! <pre>from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = PictureDescriptionVlmOptions(\n    repo_id=\"\",  # &lt;-- add here the Hugging Face repo_id of your favorite VLM\n    prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(\n            pipeline_options=pipeline_options,\n        )\n    }\n)\n\n# Uncomment to run:\n# doc = converter.convert(DOC_SOURCE).document\n</pre> from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions  pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = PictureDescriptionVlmOptions(     repo_id=\"\",  # &lt;-- add here the Hugging Face repo_id of your favorite VLM     prompt=\"Describe the image in three sentences. Be consise and accurate.\", ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True  converter = DocumentConverter(     format_options={         InputFormat.PDF: PdfFormatOption(             pipeline_options=pipeline_options,         )     } )  # Uncomment to run: # doc = converter.convert(DOC_SOURCE).document In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/pictures_description/#describe-pictures-with-granite-vision","title":"Describe pictures with Granite Vision\u00b6","text":"<p>This section will run locally the ibm-granite/granite-vision-3.1-2b-preview model to describe the pictures of the document.</p>"},{"location":"examples/pictures_description/#describe-pictures-with-smolvlm","title":"Describe pictures with SmolVLM\u00b6","text":"<p>This section will run locally the HuggingFaceTB/SmolVLM-256M-Instruct model to describe the pictures of the document.</p>"},{"location":"examples/pictures_description/#use-other-vision-models","title":"Use other vision models\u00b6","text":"<p>The examples above can also be reproduced using other vision model. The Docling options <code>PictureDescriptionVlmOptions</code> allows to specify your favorite vision model from the Hugging Face Hub.</p>"},{"location":"examples/pictures_description_api/","title":"Annotate picture with remote VLM","text":"<p>Describe pictures using a remote VLM API (vLLM, LM Studio, or watsonx.ai).</p> <p>What this example does</p> <ul> <li>Configures <code>PictureDescriptionApiOptions</code> for local or cloud providers.</li> <li>Converts a PDF, then prints each picture's caption and annotations.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling and <code>python-dotenv</code> if loading env vars from a <code>.env</code> file.</li> <li>For local providers: ensure vLLM or LM Studio is running.</li> <li>For watsonx.ai: set <code>WX_API_KEY</code> and <code>WX_PROJECT_ID</code> in the environment.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/pictures_description_api.py</code>.</li> <li>Uncomment exactly one provider config and set <code>enable_remote_services=True</code> (already set).</li> </ul> <p>Notes</p> <ul> <li>vLLM default endpoint: <code>http://localhost:8000/v1/chat/completions</code>.</li> <li>LM Studio default endpoint: <code>http://localhost:1234/v1/chat/completions</code>.</li> <li>Calling remote APIs sends page images/text to the provider; review privacy and costs. For local testing, LM Studio runs everything on your machine.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import logging\nimport os\nfrom pathlib import Path\n\nimport requests\nfrom docling_core.types.doc import PictureItem\nfrom dotenv import load_dotenv\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n    PdfPipelineOptions,\n    PictureDescriptionApiOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n### Example of PictureDescriptionApiOptions definitions\n\n#### Using vLLM\n# Models can be launched via:\n# $ vllm serve MODEL_NAME\n\n\ndef vllm_local_options(model: str):\n    options = PictureDescriptionApiOptions(\n        url=\"http://localhost:8000/v1/chat/completions\",\n        params=dict(\n            model=model,\n            seed=42,\n            max_completion_tokens=200,\n        ),\n        prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n        timeout=90,\n    )\n    return options\n\n\n#### Using LM Studio\n\n\ndef lms_local_options(model: str):\n    options = PictureDescriptionApiOptions(\n        url=\"http://localhost:1234/v1/chat/completions\",\n        params=dict(\n            model=model,\n            seed=42,\n            max_completion_tokens=200,\n        ),\n        prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n        timeout=90,\n    )\n    return options\n\n\n#### Using a cloud service like IBM watsonx.ai\n\n\ndef watsonx_vlm_options():\n    load_dotenv()\n    api_key = os.environ.get(\"WX_API_KEY\")\n    project_id = os.environ.get(\"WX_PROJECT_ID\")\n\n    def _get_iam_access_token(api_key: str) -&gt; str:\n        res = requests.post(\n            url=\"https://iam.cloud.ibm.com/identity/token\",\n            headers={\n                \"Content-Type\": \"application/x-www-form-urlencoded\",\n            },\n            data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&amp;apikey={api_key}\",\n        )\n        res.raise_for_status()\n        api_out = res.json()\n        print(f\"{api_out=}\")\n        return api_out[\"access_token\"]\n\n    # Background information in case the model_id is updated:\n    # [1] Official list of models: https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx\n    # [2] Info on granite vision 3.3: https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models-ibm.html?context=wx#granite-vision-3-3-2b\n\n    options = PictureDescriptionApiOptions(\n        url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\",\n        params=dict(\n            model_id=\"ibm/granite-vision-3-3-2b\",\n            project_id=project_id,\n            parameters=dict(\n                max_new_tokens=400,\n            ),\n        ),\n        headers={\n            \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key),\n        },\n        prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n        timeout=60,\n    )\n    return options\n\n\n### Usage and conversion\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO)\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n    pipeline_options = PdfPipelineOptions(\n        enable_remote_services=True  # &lt;-- this is required!\n    )\n    pipeline_options.do_picture_description = True\n\n    # The PictureDescriptionApiOptions() allows to interface with APIs supporting\n    # the multi-modal chat interface. Here follow a few example on how to configure those.\n    #\n    # One possibility is self-hosting model, e.g. via VLLM.\n    # $ vllm serve MODEL_NAME\n    # Then PictureDescriptionApiOptions can point to the localhost endpoint.\n\n    # Example for the Granite Vision model:\n    # (uncomment the following lines)\n    # pipeline_options.picture_description_options = vllm_local_options(\n    #     model=\"ibm-granite/granite-vision-3.3-2b\"\n    # )\n\n    # Example for the SmolVLM model:\n    # (uncomment the following lines)\n    # pipeline_options.picture_description_options = vllm_local_options(\n    #     model=\"HuggingFaceTB/SmolVLM-256M-Instruct\"\n    # )\n\n    # For using models on LM Studio using the built-in GGUF or MLX runtimes, e.g. the SmolVLM model:\n    # (uncomment the following lines)\n    pipeline_options.picture_description_options = lms_local_options(\n        model=\"smolvlm-256m-instruct\"\n    )\n\n    # Another possibility is using online services, e.g. watsonx.ai.\n    # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID.\n    # (uncomment the following lines)\n    # pipeline_options.picture_description_options = watsonx_vlm_options()\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_options=pipeline_options,\n            )\n        }\n    )\n    result = doc_converter.convert(input_doc_path)\n\n    for element, _level in result.document.iterate_items():\n        if isinstance(element, PictureItem):\n            print(\n                f\"Picture {element.self_ref}\\n\"\n                f\"Caption: {element.caption_text(doc=result.document)}\\n\"\n                f\"Annotations: {element.annotations}\"\n            )\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import logging import os from pathlib import Path  import requests from docling_core.types.doc import PictureItem from dotenv import load_dotenv  from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import (     PdfPipelineOptions,     PictureDescriptionApiOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption  ### Example of PictureDescriptionApiOptions definitions  #### Using vLLM # Models can be launched via: # $ vllm serve MODEL_NAME   def vllm_local_options(model: str):     options = PictureDescriptionApiOptions(         url=\"http://localhost:8000/v1/chat/completions\",         params=dict(             model=model,             seed=42,             max_completion_tokens=200,         ),         prompt=\"Describe the image in three sentences. Be consise and accurate.\",         timeout=90,     )     return options   #### Using LM Studio   def lms_local_options(model: str):     options = PictureDescriptionApiOptions(         url=\"http://localhost:1234/v1/chat/completions\",         params=dict(             model=model,             seed=42,             max_completion_tokens=200,         ),         prompt=\"Describe the image in three sentences. Be consise and accurate.\",         timeout=90,     )     return options   #### Using a cloud service like IBM watsonx.ai   def watsonx_vlm_options():     load_dotenv()     api_key = os.environ.get(\"WX_API_KEY\")     project_id = os.environ.get(\"WX_PROJECT_ID\")      def _get_iam_access_token(api_key: str) -&gt; str:         res = requests.post(             url=\"https://iam.cloud.ibm.com/identity/token\",             headers={                 \"Content-Type\": \"application/x-www-form-urlencoded\",             },             data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&amp;apikey={api_key}\",         )         res.raise_for_status()         api_out = res.json()         print(f\"{api_out=}\")         return api_out[\"access_token\"]      # Background information in case the model_id is updated:     # [1] Official list of models: https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx     # [2] Info on granite vision 3.3: https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models-ibm.html?context=wx#granite-vision-3-3-2b      options = PictureDescriptionApiOptions(         url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\",         params=dict(             model_id=\"ibm/granite-vision-3-3-2b\",             project_id=project_id,             parameters=dict(                 max_new_tokens=400,             ),         ),         headers={             \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key),         },         prompt=\"Describe the image in three sentences. Be consise and accurate.\",         timeout=60,     )     return options   ### Usage and conversion   def main():     logging.basicConfig(level=logging.INFO)      data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2206.01062.pdf\"      pipeline_options = PdfPipelineOptions(         enable_remote_services=True  # &lt;-- this is required!     )     pipeline_options.do_picture_description = True      # The PictureDescriptionApiOptions() allows to interface with APIs supporting     # the multi-modal chat interface. Here follow a few example on how to configure those.     #     # One possibility is self-hosting model, e.g. via VLLM.     # $ vllm serve MODEL_NAME     # Then PictureDescriptionApiOptions can point to the localhost endpoint.      # Example for the Granite Vision model:     # (uncomment the following lines)     # pipeline_options.picture_description_options = vllm_local_options(     #     model=\"ibm-granite/granite-vision-3.3-2b\"     # )      # Example for the SmolVLM model:     # (uncomment the following lines)     # pipeline_options.picture_description_options = vllm_local_options(     #     model=\"HuggingFaceTB/SmolVLM-256M-Instruct\"     # )      # For using models on LM Studio using the built-in GGUF or MLX runtimes, e.g. the SmolVLM model:     # (uncomment the following lines)     pipeline_options.picture_description_options = lms_local_options(         model=\"smolvlm-256m-instruct\"     )      # Another possibility is using online services, e.g. watsonx.ai.     # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID.     # (uncomment the following lines)     # pipeline_options.picture_description_options = watsonx_vlm_options()      doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_options=pipeline_options,             )         }     )     result = doc_converter.convert(input_doc_path)      for element, _level in result.document.iterate_items():         if isinstance(element, PictureItem):             print(                 f\"Picture {element.self_ref}\\n\"                 f\"Caption: {element.caption_text(doc=result.document)}\\n\"                 f\"Annotations: {element.annotations}\"             )   if __name__ == \"__main__\":     main()"},{"location":"examples/pii_obfuscate/","title":"Detect and obfuscate PII","text":"<p>Detect and obfuscate PII using a Hugging Face NER model.</p> <p>What this example does</p> <ul> <li>Converts a PDF and saves original Markdown with embedded images.</li> <li>Runs a HF token-classification pipeline (NER) to detect PII-like entities.</li> <li>Obfuscates occurrences in TextItem and TableItem by stable, type-based IDs.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling. Install Transformers: <code>pip install transformers</code>.</li> <li>Optional (advanced): Install GLiNER for richer PII labels: <code>pip install gliner</code> If needed for CPU-only envs: <code>pip install torch --extra-index-url https://download.pytorch.org/whl/cpu</code></li> <li>Optionally, set <code>HF_MODEL</code> to a different NER/PII model.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/pii_obfuscate.py</code>.</li> <li>To use GLiNER instead of HF pipeline: python docs/examples/pii_obfuscate.py --engine gliner or set env var <code>PII_ENGINE=gliner</code>.</li> <li>The script writes original and obfuscated Markdown to <code>scratch/</code>.</li> </ul> <p>Notes</p> <ul> <li>This is a simple demonstration. For production PII detection, consider specialized models/pipelines and thorough evaluation.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import argparse\nimport logging\nimport os\nimport re\nfrom pathlib import Path\nfrom typing import Dict, List, Tuple\n\nfrom docling_core.types.doc import ImageRefMode, TableItem, TextItem\nfrom tabulate import tabulate\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n_log = logging.getLogger(__name__)\n\nIMAGE_RESOLUTION_SCALE = 2.0\nHF_MODEL = \"dslim/bert-base-NER\"  # Swap with another HF NER/PII model if desired, eg https://huggingface.co/urchade/gliner_multi_pii-v1 looks very promising too!\nGLINER_MODEL = \"urchade/gliner_multi_pii-v1\"\n\n\ndef _build_simple_ner_pipeline():\n    \"\"\"Create a Hugging Face token-classification pipeline for NER.\n\n    Returns a callable like: ner(text) -&gt; List[dict]\n    \"\"\"\n    try:\n        from transformers import (\n            AutoModelForTokenClassification,\n            AutoTokenizer,\n            pipeline,\n        )\n    except Exception:\n        _log.error(\"Transformers not installed. Please run: pip install transformers\")\n        raise\n\n    tokenizer = AutoTokenizer.from_pretrained(HF_MODEL)\n    model = AutoModelForTokenClassification.from_pretrained(HF_MODEL)\n    ner = pipeline(\n        \"token-classification\",\n        model=model,\n        tokenizer=tokenizer,\n        aggregation_strategy=\"simple\",  # groups subwords into complete entities\n        # Note: modern Transformers returns `start`/`end` when possible with aggregation\n    )\n    return ner\n\n\nclass SimplePiiObfuscator:\n    \"\"\"Tracks PII strings and replaces them with stable IDs per entity type.\"\"\"\n\n    def __init__(self, ner_callable):\n        self.ner = ner_callable\n        self.entity_map: Dict[str, str] = {}\n        self.counters: Dict[str, int] = {\n            \"person\": 0,\n            \"org\": 0,\n            \"location\": 0,\n            \"misc\": 0,\n        }\n        # Map model labels to our coarse types\n        self.label_map = {\n            \"PER\": \"person\",\n            \"PERSON\": \"person\",\n            \"ORG\": \"org\",\n            \"ORGANIZATION\": \"org\",\n            \"LOC\": \"location\",\n            \"LOCATION\": \"location\",\n            \"GPE\": \"location\",\n            # Fallbacks\n            \"MISC\": \"misc\",\n            \"O\": \"misc\",\n        }\n        # Only obfuscate these by default. Adjust as needed.\n        self.allowed_types = {\"person\", \"org\", \"location\"}\n\n    def _next_id(self, typ: str) -&gt; str:\n        self.counters[typ] += 1\n        return f\"{typ}-{self.counters[typ]}\"\n\n    def _normalize(self, s: str) -&gt; str:\n        return re.sub(r\"\\s+\", \" \", s).strip()\n\n    def _extract_entities(self, text: str) -&gt; List[Tuple[str, str]]:\n        \"\"\"Run NER and return a list of (surface_text, type) to obfuscate.\"\"\"\n        if not text:\n            return []\n        results = self.ner(text)\n        # Collect normalized items with optional span info\n        items = []\n        for r in results:\n            raw_label = r.get(\"entity_group\") or r.get(\"entity\") or \"MISC\"\n            label = self.label_map.get(raw_label, \"misc\")\n            if label not in self.allowed_types:\n                continue\n            start = r.get(\"start\")\n            end = r.get(\"end\")\n            word = self._normalize(r.get(\"word\") or r.get(\"text\") or \"\")\n            items.append({\"label\": label, \"start\": start, \"end\": end, \"word\": word})\n\n        found: List[Tuple[str, str]] = []\n        # If the pipeline provides character spans, merge consecutive/overlapping\n        # entities of the same type into a single span, then take the substring\n        # from the original text. This handles cases like subword tokenization\n        # where multiple adjacent pieces belong to the same named entity.\n        have_spans = any(i[\"start\"] is not None and i[\"end\"] is not None for i in items)\n        if have_spans:\n            spans = [\n                i for i in items if i[\"start\"] is not None and i[\"end\"] is not None\n            ]\n            # Ensure processing order by start (then end)\n            spans.sort(key=lambda x: (x[\"start\"], x[\"end\"]))\n\n            merged = []\n            for s in spans:\n                if not merged:\n                    merged.append(dict(s))\n                    continue\n                last = merged[-1]\n                if s[\"label\"] == last[\"label\"] and s[\"start\"] &lt;= last[\"end\"]:\n                    # Merge identical, overlapping, or touching spans of same type\n                    last[\"start\"] = min(last[\"start\"], s[\"start\"])\n                    last[\"end\"] = max(last[\"end\"], s[\"end\"])\n                else:\n                    merged.append(dict(s))\n\n            for m in merged:\n                surface = self._normalize(text[m[\"start\"] : m[\"end\"]])\n                if surface:\n                    found.append((surface, m[\"label\"]))\n\n            # Include any items lacking spans as-is (fallback)\n            for i in items:\n                if i[\"start\"] is None or i[\"end\"] is None:\n                    if i[\"word\"]:\n                        found.append((i[\"word\"], i[\"label\"]))\n        else:\n            # Fallback when spans aren't provided: return normalized words\n            for i in items:\n                if i[\"word\"]:\n                    found.append((i[\"word\"], i[\"label\"]))\n        return found\n\n    def obfuscate_text(self, text: str) -&gt; str:\n        if not text:\n            return text\n\n        entities = self._extract_entities(text)\n        if not entities:\n            return text\n\n        # Deduplicate per text, keep stable global mapping\n        unique_words: Dict[str, str] = {}\n        for word, label in entities:\n            if word not in self.entity_map:\n                replacement = self._next_id(label)\n                self.entity_map[word] = replacement\n            unique_words[word] = self.entity_map[word]\n\n        # Replace longer matches first to avoid partial overlaps\n        sorted_pairs = sorted(\n            unique_words.items(), key=lambda x: len(x[0]), reverse=True\n        )\n\n        def replace_once(s: str, old: str, new: str) -&gt; str:\n            # Use simple substring replacement; for stricter matching, use word boundaries\n            # when appropriate (e.g., names). This is a demo, keep it simple.\n            pattern = re.escape(old)\n            return re.sub(pattern, new, s)\n\n        obfuscated = text\n        for old, new in sorted_pairs:\n            obfuscated = replace_once(obfuscated, old, new)\n        return obfuscated\n\n\ndef _build_gliner_model():\n    \"\"\"Create a GLiNER model for PII-like entity extraction.\n\n    Returns a tuple (model, labels) where model.predict_entities(text, labels)\n    yields entities with \"text\" and \"label\" fields.\n    \"\"\"\n    try:\n        from gliner import GLiNER  # type: ignore\n    except Exception:\n        _log.error(\n            \"GLiNER not installed. Please run: pip install gliner torch --extra-index-url https://download.pytorch.org/whl/cpu\"\n        )\n        raise\n\n    model = GLiNER.from_pretrained(GLINER_MODEL)\n    # Curated set of labels for PII detection. Adjust as needed.\n    labels = [\n        # \"work\",\n        \"booking number\",\n        \"personally identifiable information\",\n        \"driver licence\",\n        \"person\",\n        \"full address\",\n        \"company\",\n        # \"actor\",\n        # \"character\",\n        \"email\",\n        \"passport number\",\n        \"Social Security Number\",\n        \"phone number\",\n    ]\n    return model, labels\n\n\nclass AdvancedPIIObfuscator:\n    \"\"\"PII obfuscator powered by GLiNER with fine-grained labels.\n\n    - Uses GLiNER's `predict_entities(text, labels)` to detect entities.\n    - Obfuscates with stable IDs per fine-grained label, e.g. `email-1`.\n    \"\"\"\n\n    def __init__(self, gliner_model, labels: List[str]):\n        self.model = gliner_model\n        self.labels = labels\n        self.entity_map: Dict[str, str] = {}\n        self.counters: Dict[str, int] = {}\n\n    def _normalize(self, s: str) -&gt; str:\n        return re.sub(r\"\\s+\", \" \", s).strip()\n\n    def _norm_label(self, label: str) -&gt; str:\n        return (\n            re.sub(\n                r\"[^a-z0-9_]+\", \"_\", label.lower().replace(\" \", \"_\").replace(\"-\", \"_\")\n            ).strip(\"_\")\n            or \"pii\"\n        )\n\n    def _next_id(self, typ: str) -&gt; str:\n        self.cc(typ)\n        self.counters[typ] += 1\n        return f\"{typ}-{self.counters[typ]}\"\n\n    def cc(self, typ: str) -&gt; None:\n        if typ not in self.counters:\n            self.counters[typ] = 0\n\n    def _extract_entities(self, text: str) -&gt; List[Tuple[str, str]]:\n        if not text:\n            return []\n        results = self.model.predict_entities(\n            text, self.labels\n        )  # expects dicts with text/label\n        found: List[Tuple[str, str]] = []\n        for r in results:\n            label = self._norm_label(str(r.get(\"label\", \"pii\")))\n            surface = self._normalize(str(r.get(\"text\", \"\")))\n            if surface:\n                found.append((surface, label))\n        return found\n\n    def obfuscate_text(self, text: str) -&gt; str:\n        if not text:\n            return text\n        entities = self._extract_entities(text)\n        if not entities:\n            return text\n\n        unique_words: Dict[str, str] = {}\n        for word, label in entities:\n            if word not in self.entity_map:\n                replacement = self._next_id(label)\n                self.entity_map[word] = replacement\n            unique_words[word] = self.entity_map[word]\n\n        sorted_pairs = sorted(\n            unique_words.items(), key=lambda x: len(x[0]), reverse=True\n        )\n\n        def replace_once(s: str, old: str, new: str) -&gt; str:\n            pattern = re.escape(old)\n            return re.sub(pattern, new, s)\n\n        obfuscated = text\n        for old, new in sorted_pairs:\n            obfuscated = replace_once(obfuscated, old, new)\n        return obfuscated\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO)\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n    output_dir = Path(\"scratch\")  # ensure this directory exists before saving\n\n    # Choose engine via CLI flag or env var (default: hf)\n    parser = argparse.ArgumentParser(description=\"PII obfuscation example\")\n    parser.add_argument(\n        \"--engine\",\n        choices=[\"hf\", \"gliner\"],\n        default=os.getenv(\"PII_ENGINE\", \"hf\"),\n        help=\"NER engine: 'hf' (Transformers) or 'gliner' (GLiNER)\",\n    )\n    args = parser.parse_args()\n\n    # Ensure output dir exists\n    output_dir.mkdir(parents=True, exist_ok=True)\n\n    # Keep and generate images so Markdown can embed them\n    pipeline_options = PdfPipelineOptions()\n    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n    pipeline_options.generate_page_images = True\n    pipeline_options.generate_picture_images = True\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n        }\n    )\n\n    conv_res = doc_converter.convert(input_doc_path)\n    conv_doc = conv_res.document\n    doc_filename = conv_res.input.file.name\n\n    # Save markdown with embedded pictures in original text\n    md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\"\n    conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n    # Build NER pipeline and obfuscator\n    if args.engine == \"gliner\":\n        _log.info(\"Using GLiNER-based AdvancedPIIObfuscator\")\n        gliner_model, gliner_labels = _build_gliner_model()\n        obfuscator = AdvancedPIIObfuscator(gliner_model, gliner_labels)\n    else:\n        _log.info(\"Using HF Transformers-based SimplePiiObfuscator\")\n        ner = _build_simple_ner_pipeline()\n        obfuscator = SimplePiiObfuscator(ner)\n\n    for element, _level in conv_res.document.iterate_items():\n        if isinstance(element, TextItem):\n            element.orig = element.text\n            element.text = obfuscator.obfuscate_text(element.text)\n            # print(element.orig, \" =&gt; \", element.text)\n\n        elif isinstance(element, TableItem):\n            for cell in element.data.table_cells:\n                cell.text = obfuscator.obfuscate_text(cell.text)\n\n    # Save markdown with embedded pictures and obfuscated text\n    md_filename = output_dir / f\"{doc_filename}-with-images-pii-obfuscated.md\"\n    conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n    # Optional: log mapping summary\n    if obfuscator.entity_map:\n        data = []\n        for key, val in obfuscator.entity_map.items():\n            data.append([key, val])\n\n        _log.info(\n            f\"Obfuscated entities:\\n\\n{tabulate(data)}\",\n        )\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import argparse import logging import os import re from pathlib import Path from typing import Dict, List, Tuple  from docling_core.types.doc import ImageRefMode, TableItem, TextItem from tabulate import tabulate  from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption  _log = logging.getLogger(__name__)  IMAGE_RESOLUTION_SCALE = 2.0 HF_MODEL = \"dslim/bert-base-NER\"  # Swap with another HF NER/PII model if desired, eg https://huggingface.co/urchade/gliner_multi_pii-v1 looks very promising too! GLINER_MODEL = \"urchade/gliner_multi_pii-v1\"   def _build_simple_ner_pipeline():     \"\"\"Create a Hugging Face token-classification pipeline for NER.      Returns a callable like: ner(text) -&gt; List[dict]     \"\"\"     try:         from transformers import (             AutoModelForTokenClassification,             AutoTokenizer,             pipeline,         )     except Exception:         _log.error(\"Transformers not installed. Please run: pip install transformers\")         raise      tokenizer = AutoTokenizer.from_pretrained(HF_MODEL)     model = AutoModelForTokenClassification.from_pretrained(HF_MODEL)     ner = pipeline(         \"token-classification\",         model=model,         tokenizer=tokenizer,         aggregation_strategy=\"simple\",  # groups subwords into complete entities         # Note: modern Transformers returns `start`/`end` when possible with aggregation     )     return ner   class SimplePiiObfuscator:     \"\"\"Tracks PII strings and replaces them with stable IDs per entity type.\"\"\"      def __init__(self, ner_callable):         self.ner = ner_callable         self.entity_map: Dict[str, str] = {}         self.counters: Dict[str, int] = {             \"person\": 0,             \"org\": 0,             \"location\": 0,             \"misc\": 0,         }         # Map model labels to our coarse types         self.label_map = {             \"PER\": \"person\",             \"PERSON\": \"person\",             \"ORG\": \"org\",             \"ORGANIZATION\": \"org\",             \"LOC\": \"location\",             \"LOCATION\": \"location\",             \"GPE\": \"location\",             # Fallbacks             \"MISC\": \"misc\",             \"O\": \"misc\",         }         # Only obfuscate these by default. Adjust as needed.         self.allowed_types = {\"person\", \"org\", \"location\"}      def _next_id(self, typ: str) -&gt; str:         self.counters[typ] += 1         return f\"{typ}-{self.counters[typ]}\"      def _normalize(self, s: str) -&gt; str:         return re.sub(r\"\\s+\", \" \", s).strip()      def _extract_entities(self, text: str) -&gt; List[Tuple[str, str]]:         \"\"\"Run NER and return a list of (surface_text, type) to obfuscate.\"\"\"         if not text:             return []         results = self.ner(text)         # Collect normalized items with optional span info         items = []         for r in results:             raw_label = r.get(\"entity_group\") or r.get(\"entity\") or \"MISC\"             label = self.label_map.get(raw_label, \"misc\")             if label not in self.allowed_types:                 continue             start = r.get(\"start\")             end = r.get(\"end\")             word = self._normalize(r.get(\"word\") or r.get(\"text\") or \"\")             items.append({\"label\": label, \"start\": start, \"end\": end, \"word\": word})          found: List[Tuple[str, str]] = []         # If the pipeline provides character spans, merge consecutive/overlapping         # entities of the same type into a single span, then take the substring         # from the original text. This handles cases like subword tokenization         # where multiple adjacent pieces belong to the same named entity.         have_spans = any(i[\"start\"] is not None and i[\"end\"] is not None for i in items)         if have_spans:             spans = [                 i for i in items if i[\"start\"] is not None and i[\"end\"] is not None             ]             # Ensure processing order by start (then end)             spans.sort(key=lambda x: (x[\"start\"], x[\"end\"]))              merged = []             for s in spans:                 if not merged:                     merged.append(dict(s))                     continue                 last = merged[-1]                 if s[\"label\"] == last[\"label\"] and s[\"start\"] &lt;= last[\"end\"]:                     # Merge identical, overlapping, or touching spans of same type                     last[\"start\"] = min(last[\"start\"], s[\"start\"])                     last[\"end\"] = max(last[\"end\"], s[\"end\"])                 else:                     merged.append(dict(s))              for m in merged:                 surface = self._normalize(text[m[\"start\"] : m[\"end\"]])                 if surface:                     found.append((surface, m[\"label\"]))              # Include any items lacking spans as-is (fallback)             for i in items:                 if i[\"start\"] is None or i[\"end\"] is None:                     if i[\"word\"]:                         found.append((i[\"word\"], i[\"label\"]))         else:             # Fallback when spans aren't provided: return normalized words             for i in items:                 if i[\"word\"]:                     found.append((i[\"word\"], i[\"label\"]))         return found      def obfuscate_text(self, text: str) -&gt; str:         if not text:             return text          entities = self._extract_entities(text)         if not entities:             return text          # Deduplicate per text, keep stable global mapping         unique_words: Dict[str, str] = {}         for word, label in entities:             if word not in self.entity_map:                 replacement = self._next_id(label)                 self.entity_map[word] = replacement             unique_words[word] = self.entity_map[word]          # Replace longer matches first to avoid partial overlaps         sorted_pairs = sorted(             unique_words.items(), key=lambda x: len(x[0]), reverse=True         )          def replace_once(s: str, old: str, new: str) -&gt; str:             # Use simple substring replacement; for stricter matching, use word boundaries             # when appropriate (e.g., names). This is a demo, keep it simple.             pattern = re.escape(old)             return re.sub(pattern, new, s)          obfuscated = text         for old, new in sorted_pairs:             obfuscated = replace_once(obfuscated, old, new)         return obfuscated   def _build_gliner_model():     \"\"\"Create a GLiNER model for PII-like entity extraction.      Returns a tuple (model, labels) where model.predict_entities(text, labels)     yields entities with \"text\" and \"label\" fields.     \"\"\"     try:         from gliner import GLiNER  # type: ignore     except Exception:         _log.error(             \"GLiNER not installed. Please run: pip install gliner torch --extra-index-url https://download.pytorch.org/whl/cpu\"         )         raise      model = GLiNER.from_pretrained(GLINER_MODEL)     # Curated set of labels for PII detection. Adjust as needed.     labels = [         # \"work\",         \"booking number\",         \"personally identifiable information\",         \"driver licence\",         \"person\",         \"full address\",         \"company\",         # \"actor\",         # \"character\",         \"email\",         \"passport number\",         \"Social Security Number\",         \"phone number\",     ]     return model, labels   class AdvancedPIIObfuscator:     \"\"\"PII obfuscator powered by GLiNER with fine-grained labels.      - Uses GLiNER's `predict_entities(text, labels)` to detect entities.     - Obfuscates with stable IDs per fine-grained label, e.g. `email-1`.     \"\"\"      def __init__(self, gliner_model, labels: List[str]):         self.model = gliner_model         self.labels = labels         self.entity_map: Dict[str, str] = {}         self.counters: Dict[str, int] = {}      def _normalize(self, s: str) -&gt; str:         return re.sub(r\"\\s+\", \" \", s).strip()      def _norm_label(self, label: str) -&gt; str:         return (             re.sub(                 r\"[^a-z0-9_]+\", \"_\", label.lower().replace(\" \", \"_\").replace(\"-\", \"_\")             ).strip(\"_\")             or \"pii\"         )      def _next_id(self, typ: str) -&gt; str:         self.cc(typ)         self.counters[typ] += 1         return f\"{typ}-{self.counters[typ]}\"      def cc(self, typ: str) -&gt; None:         if typ not in self.counters:             self.counters[typ] = 0      def _extract_entities(self, text: str) -&gt; List[Tuple[str, str]]:         if not text:             return []         results = self.model.predict_entities(             text, self.labels         )  # expects dicts with text/label         found: List[Tuple[str, str]] = []         for r in results:             label = self._norm_label(str(r.get(\"label\", \"pii\")))             surface = self._normalize(str(r.get(\"text\", \"\")))             if surface:                 found.append((surface, label))         return found      def obfuscate_text(self, text: str) -&gt; str:         if not text:             return text         entities = self._extract_entities(text)         if not entities:             return text          unique_words: Dict[str, str] = {}         for word, label in entities:             if word not in self.entity_map:                 replacement = self._next_id(label)                 self.entity_map[word] = replacement             unique_words[word] = self.entity_map[word]          sorted_pairs = sorted(             unique_words.items(), key=lambda x: len(x[0]), reverse=True         )          def replace_once(s: str, old: str, new: str) -&gt; str:             pattern = re.escape(old)             return re.sub(pattern, new, s)          obfuscated = text         for old, new in sorted_pairs:             obfuscated = replace_once(obfuscated, old, new)         return obfuscated   def main():     logging.basicConfig(level=logging.INFO)      data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2206.01062.pdf\"     output_dir = Path(\"scratch\")  # ensure this directory exists before saving      # Choose engine via CLI flag or env var (default: hf)     parser = argparse.ArgumentParser(description=\"PII obfuscation example\")     parser.add_argument(         \"--engine\",         choices=[\"hf\", \"gliner\"],         default=os.getenv(\"PII_ENGINE\", \"hf\"),         help=\"NER engine: 'hf' (Transformers) or 'gliner' (GLiNER)\",     )     args = parser.parse_args()      # Ensure output dir exists     output_dir.mkdir(parents=True, exist_ok=True)      # Keep and generate images so Markdown can embed them     pipeline_options = PdfPipelineOptions()     pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE     pipeline_options.generate_page_images = True     pipeline_options.generate_picture_images = True      doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)         }     )      conv_res = doc_converter.convert(input_doc_path)     conv_doc = conv_res.document     doc_filename = conv_res.input.file.name      # Save markdown with embedded pictures in original text     md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\"     conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)      # Build NER pipeline and obfuscator     if args.engine == \"gliner\":         _log.info(\"Using GLiNER-based AdvancedPIIObfuscator\")         gliner_model, gliner_labels = _build_gliner_model()         obfuscator = AdvancedPIIObfuscator(gliner_model, gliner_labels)     else:         _log.info(\"Using HF Transformers-based SimplePiiObfuscator\")         ner = _build_simple_ner_pipeline()         obfuscator = SimplePiiObfuscator(ner)      for element, _level in conv_res.document.iterate_items():         if isinstance(element, TextItem):             element.orig = element.text             element.text = obfuscator.obfuscate_text(element.text)             # print(element.orig, \" =&gt; \", element.text)          elif isinstance(element, TableItem):             for cell in element.data.table_cells:                 cell.text = obfuscator.obfuscate_text(cell.text)      # Save markdown with embedded pictures and obfuscated text     md_filename = output_dir / f\"{doc_filename}-with-images-pii-obfuscated.md\"     conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)      # Optional: log mapping summary     if obfuscator.entity_map:         data = []         for key, val in obfuscator.entity_map.items():             data.append([key, val])          _log.info(             f\"Obfuscated entities:\\n\\n{tabulate(data)}\",         )   if __name__ == \"__main__\":     main()"},{"location":"examples/post_process_ocr_with_vlm/","title":"Post process ocr with vlm","text":"In\u00a0[\u00a0]: Copied! <pre>import argparse\nimport logging\nimport os\nimport re\nfrom collections.abc import Iterable\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\nfrom typing import Any, Optional, Union\n</pre> import argparse import logging import os import re from collections.abc import Iterable from concurrent.futures import ThreadPoolExecutor from pathlib import Path from typing import Any, Optional, Union In\u00a0[\u00a0]: Copied! <pre>import numpy as np\nfrom docling_core.types.doc import (\n    DoclingDocument,\n    ImageRefMode,\n    NodeItem,\n    TextItem,\n)\nfrom docling_core.types.doc.document import (\n    ContentLayer,\n    DocItem,\n    FormItem,\n    GraphCell,\n    KeyValueItem,\n    PictureItem,\n    RichTableCell,\n    TableCell,\n    TableItem,\n)\nfrom PIL import Image, ImageFilter\nfrom PIL.ImageOps import crop\nfrom pydantic import BaseModel, ConfigDict\nfrom tqdm import tqdm\n</pre> import numpy as np from docling_core.types.doc import (     DoclingDocument,     ImageRefMode,     NodeItem,     TextItem, ) from docling_core.types.doc.document import (     ContentLayer,     DocItem,     FormItem,     GraphCell,     KeyValueItem,     PictureItem,     RichTableCell,     TableCell,     TableItem, ) from PIL import Image, ImageFilter from PIL.ImageOps import crop from pydantic import BaseModel, ConfigDict from tqdm import tqdm In\u00a0[\u00a0]: Copied! <pre>from docling.backend.json.docling_json_backend import DoclingJSONBackend\nfrom docling.datamodel.accelerator_options import AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import (\n    ConvertPipelineOptions,\n    PdfPipelineOptions,\n    PictureDescriptionApiOptions,\n)\nfrom docling.document_converter import DocumentConverter, FormatOption, PdfFormatOption\nfrom docling.exceptions import OperationNotAllowed\nfrom docling.models.base_model import BaseModelWithOptions, GenericEnrichmentModel\nfrom docling.pipeline.simple_pipeline import SimplePipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\nfrom docling.utils.api_image_request import api_image_request\nfrom docling.utils.profiling import ProfilingScope, TimeRecorder\nfrom docling.utils.utils import chunkify\n</pre> from docling.backend.json.docling_json_backend import DoclingJSONBackend from docling.datamodel.accelerator_options import AcceleratorOptions from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import (     ConvertPipelineOptions,     PdfPipelineOptions,     PictureDescriptionApiOptions, ) from docling.document_converter import DocumentConverter, FormatOption, PdfFormatOption from docling.exceptions import OperationNotAllowed from docling.models.base_model import BaseModelWithOptions, GenericEnrichmentModel from docling.pipeline.simple_pipeline import SimplePipeline from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline from docling.utils.api_image_request import api_image_request from docling.utils.profiling import ProfilingScope, TimeRecorder from docling.utils.utils import chunkify <p>Example on how to apply to Docling Document OCR as a post-processing with \"nanonets-ocr2-3b\" via LM Studio Requires LM Studio running inference server with \"nanonets-ocr2-3b\" model pre-loaded To run: uv run python docs/examples/post_process_ocr_with_vlm.py</p> In\u00a0[\u00a0]: Copied! <pre>LM_STUDIO_URL = \"http://localhost:1234/v1/chat/completions\"\nLM_STUDIO_MODEL = \"nanonets-ocr2-3b\"\n</pre> LM_STUDIO_URL = \"http://localhost:1234/v1/chat/completions\" LM_STUDIO_MODEL = \"nanonets-ocr2-3b\" In\u00a0[\u00a0]: Copied! <pre>DEFAULT_PROMPT = \"Extract the text from the above document as if you were reading it naturally. Output pure text, no html and no markdown. Pay attention on line breaks and don't miss text after line break. Put all text in one line.\"\nVERBOSE = True\nSHOW_IMAGE = False\nSHOW_EMPTY_CROPS = False\nSHOW_NONEMPTY_CROPS = False\nPRINT_RESULT_MARKDOWN = False\n</pre> DEFAULT_PROMPT = \"Extract the text from the above document as if you were reading it naturally. Output pure text, no html and no markdown. Pay attention on line breaks and don't miss text after line break. Put all text in one line.\" VERBOSE = True SHOW_IMAGE = False SHOW_EMPTY_CROPS = False SHOW_NONEMPTY_CROPS = False PRINT_RESULT_MARKDOWN = False In\u00a0[\u00a0]: Copied! <pre>def is_empty_fast_with_lines_pil(\n    pil_img: Image.Image,\n    downscale_max_side: int = 48,  # 64\n    grad_threshold: float = 15.0,  # how strong a gradient must be to count as edge\n    min_line_coverage: float = 0.6,  # line must cover 60% of height/width\n    max_allowed_lines: int = 10,  # allow up to this many strong lines (default 4)\n    edge_fraction_threshold: float = 0.0035,\n):\n    \"\"\"\n    Fast 'empty' detector using only PIL + NumPy.\n\n    Treats an image as empty if:\n      - It has very few edges overall, OR\n      - Edges can be explained by at most `max_allowed_lines` long vertical/horizontal lines.\n\n    Returns:\n      (is_empty: bool, remaining_edge_fraction: float, debug: dict)\n    \"\"\"\n\n    # 1) Convert to grayscale\n    gray = pil_img.convert(\"L\")\n\n    # 2) Aggressive downscale, keeping aspect ratio\n    w0, h0 = gray.size\n    max_side = max(w0, h0)\n    if max_side &gt; downscale_max_side:\n        # scale = downscale_max_side / max_side\n        # new_w = max(1, int(w0 * scale))\n        # new_h = max(1, int(h0 * scale))\n\n        new_w = downscale_max_side\n        new_h = downscale_max_side\n\n        gray = gray.resize((new_w, new_h), resample=Image.BILINEAR)\n\n    w, h = gray.size\n    if w == 0 or h == 0:\n        return True, 0.0, {\"reason\": \"zero_size\"}\n\n    # 3) Small blur to reduce noise\n    gray = gray.filter(ImageFilter.BoxBlur(1))\n\n    # 4) Convert to NumPy\n    arr = np.asarray(\n        gray, dtype=np.float32\n    )  # shape (h, w) in PIL, but note: PIL size is (w, h)\n    H, W = arr.shape\n\n    # 5) Compute simple gradients (forward differences)\n    gx = np.zeros_like(arr)\n    gy = np.zeros_like(arr)\n\n    gx[:, :-1] = arr[:, 1:] - arr[:, :-1]  # horizontal differences\n    gy[:-1, :] = arr[1:, :] - arr[:-1, :]  # vertical differences\n\n    mag = np.hypot(gx, gy)  # gradient magnitude\n\n    # 6) Threshold gradients to get edges (boolean mask)\n    edges = mag &gt; grad_threshold\n    edge_fraction = edges.mean()\n\n    # Quick early-exit: almost no edges =&gt; empty\n    if edge_fraction &lt; edge_fraction_threshold:\n        return True, float(edge_fraction), {\"reason\": \"few_edges\"}\n\n    # 7) Detect strong vertical &amp; horizontal lines via edge sums\n    col_sum = edges.sum(axis=0)  # per column\n    row_sum = edges.sum(axis=1)  # per row\n\n    # Line must have edge pixels in at least `min_line_coverage` of the dimension\n    vert_line_cols = np.where(col_sum &gt;= min_line_coverage * H)[0]\n    horiz_line_rows = np.where(row_sum &gt;= min_line_coverage * W)[0]\n\n    num_lines = len(vert_line_cols) + len(horiz_line_rows)\n\n    # If we have more long lines than allowed =&gt; non-empty\n    if num_lines &gt; max_allowed_lines:\n        return (\n            False,\n            float(edge_fraction),\n            {\n                \"reason\": \"too_many_lines\",\n                \"num_lines\": int(num_lines),\n                \"edge_fraction\": float(edge_fraction),\n            },\n        )\n\n    # 8) Mask out those lines and recompute remaining edges\n    line_mask = np.zeros_like(edges, dtype=bool)\n    if len(vert_line_cols) &gt; 0:\n        line_mask[:, vert_line_cols] = True\n    if len(horiz_line_rows) &gt; 0:\n        line_mask[horiz_line_rows, :] = True\n\n    remaining_edges = edges &amp; ~line_mask\n    remaining_edge_fraction = remaining_edges.mean()\n\n    is_empty = remaining_edge_fraction &lt; edge_fraction_threshold\n\n    debug = {\n        \"original_edge_fraction\": float(edge_fraction),\n        \"remaining_edge_fraction\": float(remaining_edge_fraction),\n        \"num_vert_lines\": len(vert_line_cols),\n        \"num_horiz_lines\": len(horiz_line_rows),\n    }\n    return is_empty, float(remaining_edge_fraction), debug\n</pre> def is_empty_fast_with_lines_pil(     pil_img: Image.Image,     downscale_max_side: int = 48,  # 64     grad_threshold: float = 15.0,  # how strong a gradient must be to count as edge     min_line_coverage: float = 0.6,  # line must cover 60% of height/width     max_allowed_lines: int = 10,  # allow up to this many strong lines (default 4)     edge_fraction_threshold: float = 0.0035, ):     \"\"\"     Fast 'empty' detector using only PIL + NumPy.      Treats an image as empty if:       - It has very few edges overall, OR       - Edges can be explained by at most `max_allowed_lines` long vertical/horizontal lines.      Returns:       (is_empty: bool, remaining_edge_fraction: float, debug: dict)     \"\"\"      # 1) Convert to grayscale     gray = pil_img.convert(\"L\")      # 2) Aggressive downscale, keeping aspect ratio     w0, h0 = gray.size     max_side = max(w0, h0)     if max_side &gt; downscale_max_side:         # scale = downscale_max_side / max_side         # new_w = max(1, int(w0 * scale))         # new_h = max(1, int(h0 * scale))          new_w = downscale_max_side         new_h = downscale_max_side          gray = gray.resize((new_w, new_h), resample=Image.BILINEAR)      w, h = gray.size     if w == 0 or h == 0:         return True, 0.0, {\"reason\": \"zero_size\"}      # 3) Small blur to reduce noise     gray = gray.filter(ImageFilter.BoxBlur(1))      # 4) Convert to NumPy     arr = np.asarray(         gray, dtype=np.float32     )  # shape (h, w) in PIL, but note: PIL size is (w, h)     H, W = arr.shape      # 5) Compute simple gradients (forward differences)     gx = np.zeros_like(arr)     gy = np.zeros_like(arr)      gx[:, :-1] = arr[:, 1:] - arr[:, :-1]  # horizontal differences     gy[:-1, :] = arr[1:, :] - arr[:-1, :]  # vertical differences      mag = np.hypot(gx, gy)  # gradient magnitude      # 6) Threshold gradients to get edges (boolean mask)     edges = mag &gt; grad_threshold     edge_fraction = edges.mean()      # Quick early-exit: almost no edges =&gt; empty     if edge_fraction &lt; edge_fraction_threshold:         return True, float(edge_fraction), {\"reason\": \"few_edges\"}      # 7) Detect strong vertical &amp; horizontal lines via edge sums     col_sum = edges.sum(axis=0)  # per column     row_sum = edges.sum(axis=1)  # per row      # Line must have edge pixels in at least `min_line_coverage` of the dimension     vert_line_cols = np.where(col_sum &gt;= min_line_coverage * H)[0]     horiz_line_rows = np.where(row_sum &gt;= min_line_coverage * W)[0]      num_lines = len(vert_line_cols) + len(horiz_line_rows)      # If we have more long lines than allowed =&gt; non-empty     if num_lines &gt; max_allowed_lines:         return (             False,             float(edge_fraction),             {                 \"reason\": \"too_many_lines\",                 \"num_lines\": int(num_lines),                 \"edge_fraction\": float(edge_fraction),             },         )      # 8) Mask out those lines and recompute remaining edges     line_mask = np.zeros_like(edges, dtype=bool)     if len(vert_line_cols) &gt; 0:         line_mask[:, vert_line_cols] = True     if len(horiz_line_rows) &gt; 0:         line_mask[horiz_line_rows, :] = True      remaining_edges = edges &amp; ~line_mask     remaining_edge_fraction = remaining_edges.mean()      is_empty = remaining_edge_fraction &lt; edge_fraction_threshold      debug = {         \"original_edge_fraction\": float(edge_fraction),         \"remaining_edge_fraction\": float(remaining_edge_fraction),         \"num_vert_lines\": len(vert_line_cols),         \"num_horiz_lines\": len(horiz_line_rows),     }     return is_empty, float(remaining_edge_fraction), debug In\u00a0[\u00a0]: Copied! <pre>def remove_break_lines(text: str) -&gt; str:\n    # Replace any newline types with a single space\n    cleaned = re.sub(r\"[\\r\\n]+\", \" \", text)\n    # Collapse multiple spaces into one\n    cleaned = re.sub(r\"\\s+\", \" \", cleaned)\n    return cleaned.strip()\n</pre> def remove_break_lines(text: str) -&gt; str:     # Replace any newline types with a single space     cleaned = re.sub(r\"[\\r\\n]+\", \" \", text)     # Collapse multiple spaces into one     cleaned = re.sub(r\"\\s+\", \" \", cleaned)     return cleaned.strip() In\u00a0[\u00a0]: Copied! <pre>def safe_crop(img: Image.Image, bbox):\n    left, top, right, bottom = bbox\n    # Clamp to image boundaries\n    left = max(0, min(left, img.width))\n    top = max(0, min(top, img.height))\n    right = max(0, min(right, img.width))\n    bottom = max(0, min(bottom, img.height))\n    return img.crop((left, top, right, bottom))\n</pre> def safe_crop(img: Image.Image, bbox):     left, top, right, bottom = bbox     # Clamp to image boundaries     left = max(0, min(left, img.width))     top = max(0, min(top, img.height))     right = max(0, min(right, img.width))     bottom = max(0, min(bottom, img.height))     return img.crop((left, top, right, bottom)) In\u00a0[\u00a0]: Copied! <pre>def no_long_repeats(s: str, threshold: int) -&gt; bool:\n    \"\"\"\n    Returns False if the string `s` contains more than `threshold`\n    identical characters in a row, otherwise True.\n    \"\"\"\n    pattern = r\"(.)\\1{\" + str(threshold) + \",}\"\n    return re.search(pattern, s) is None\n</pre> def no_long_repeats(s: str, threshold: int) -&gt; bool:     \"\"\"     Returns False if the string `s` contains more than `threshold`     identical characters in a row, otherwise True.     \"\"\"     pattern = r\"(.)\\1{\" + str(threshold) + \",}\"     return re.search(pattern, s) is None In\u00a0[\u00a0]: Copied! <pre>class PostOcrEnrichmentElement(BaseModel):\n    model_config = ConfigDict(arbitrary_types_allowed=True)\n\n    item: Union[DocItem, TableCell, RichTableCell, GraphCell]\n    image: list[\n        Image.Image\n    ]  # Needs to be an a list of images for multi-provenance elements\n</pre> class PostOcrEnrichmentElement(BaseModel):     model_config = ConfigDict(arbitrary_types_allowed=True)      item: Union[DocItem, TableCell, RichTableCell, GraphCell]     image: list[         Image.Image     ]  # Needs to be an a list of images for multi-provenance elements In\u00a0[\u00a0]: Copied! <pre>class PostOcrEnrichmentPipelineOptions(ConvertPipelineOptions):\n    api_options: PictureDescriptionApiOptions\n</pre> class PostOcrEnrichmentPipelineOptions(ConvertPipelineOptions):     api_options: PictureDescriptionApiOptions In\u00a0[\u00a0]: Copied! <pre>class PostOcrEnrichmentPipeline(SimplePipeline):\n    def __init__(self, pipeline_options: PostOcrEnrichmentPipelineOptions):\n        super().__init__(pipeline_options)\n        self.pipeline_options: PostOcrEnrichmentPipelineOptions\n\n        self.enrichment_pipe = [\n            PostOcrApiEnrichmentModel(\n                enabled=True,\n                enable_remote_services=True,\n                artifacts_path=None,\n                options=self.pipeline_options.api_options,\n                accelerator_options=AcceleratorOptions(),\n            )\n        ]\n\n    @classmethod\n    def get_default_options(cls) -&gt; PostOcrEnrichmentPipelineOptions:\n        return PostOcrEnrichmentPipelineOptions()\n\n    def _enrich_document(self, conv_res: ConversionResult) -&gt; ConversionResult:\n        def _prepare_elements(\n            conv_res: ConversionResult, model: GenericEnrichmentModel[Any]\n        ) -&gt; Iterable[NodeItem]:\n            for doc_element, _level in conv_res.document.iterate_items(\n                traverse_pictures=True,\n                included_content_layers={\n                    ContentLayer.BODY,\n                    ContentLayer.FURNITURE,\n                },\n            ):  # With all content layers, with traverse_pictures=True\n                prepared_elements = (\n                    model.prepare_element(  # make this one yield multiple items.\n                        conv_res=conv_res, element=doc_element\n                    )\n                )\n                if prepared_elements is not None:\n                    yield prepared_elements\n\n        with TimeRecorder(conv_res, \"doc_enrich\", scope=ProfilingScope.DOCUMENT):\n            for model in self.enrichment_pipe:\n                for element_batch in chunkify(\n                    _prepare_elements(conv_res, model),\n                    model.elements_batch_size,\n                ):\n                    for element in model(\n                        doc=conv_res.document, element_batch=element_batch\n                    ):  # Must exhaust!\n                        pass\n        return conv_res\n</pre> class PostOcrEnrichmentPipeline(SimplePipeline):     def __init__(self, pipeline_options: PostOcrEnrichmentPipelineOptions):         super().__init__(pipeline_options)         self.pipeline_options: PostOcrEnrichmentPipelineOptions          self.enrichment_pipe = [             PostOcrApiEnrichmentModel(                 enabled=True,                 enable_remote_services=True,                 artifacts_path=None,                 options=self.pipeline_options.api_options,                 accelerator_options=AcceleratorOptions(),             )         ]      @classmethod     def get_default_options(cls) -&gt; PostOcrEnrichmentPipelineOptions:         return PostOcrEnrichmentPipelineOptions()      def _enrich_document(self, conv_res: ConversionResult) -&gt; ConversionResult:         def _prepare_elements(             conv_res: ConversionResult, model: GenericEnrichmentModel[Any]         ) -&gt; Iterable[NodeItem]:             for doc_element, _level in conv_res.document.iterate_items(                 traverse_pictures=True,                 included_content_layers={                     ContentLayer.BODY,                     ContentLayer.FURNITURE,                 },             ):  # With all content layers, with traverse_pictures=True                 prepared_elements = (                     model.prepare_element(  # make this one yield multiple items.                         conv_res=conv_res, element=doc_element                     )                 )                 if prepared_elements is not None:                     yield prepared_elements          with TimeRecorder(conv_res, \"doc_enrich\", scope=ProfilingScope.DOCUMENT):             for model in self.enrichment_pipe:                 for element_batch in chunkify(                     _prepare_elements(conv_res, model),                     model.elements_batch_size,                 ):                     for element in model(                         doc=conv_res.document, element_batch=element_batch                     ):  # Must exhaust!                         pass         return conv_res In\u00a0[\u00a0]: Copied! <pre>class PostOcrApiEnrichmentModel(\n    GenericEnrichmentModel[PostOcrEnrichmentElement], BaseModelWithOptions\n):\n    expansion_factor: float = 0.001\n\n    def prepare_element(\n        self, conv_res: ConversionResult, element: NodeItem\n    ) -&gt; Optional[list[PostOcrEnrichmentElement]]:\n        if not self.is_processable(doc=conv_res.document, element=element):\n            return None\n\n        allowed = (DocItem, TableItem, GraphCell)\n        assert isinstance(element, allowed)\n\n        if isinstance(element, (KeyValueItem, FormItem)):\n            # Yield from the graphCells inside here.\n            result = []\n            for c in element.graph.cells:\n                element_prov = c.prov  # Key / Value have only one provenance!\n                bbox = element_prov.bbox\n                page_ix = element_prov.page_no\n                bbox = bbox.scale_to_size(\n                    old_size=conv_res.document.pages[page_ix].size,\n                    new_size=conv_res.document.pages[page_ix].image.size,\n                )\n                expanded_bbox = bbox.expand_by_scale(\n                    x_scale=self.expansion_factor, y_scale=self.expansion_factor\n                ).to_top_left_origin(\n                    page_height=conv_res.document.pages[page_ix].image.size.height\n                )\n\n                good_bbox = True\n                if (\n                    expanded_bbox.l &gt; expanded_bbox.r\n                    or expanded_bbox.t &gt; expanded_bbox.b\n                ):\n                    good_bbox = False\n\n                if good_bbox:\n                    cropped_image = conv_res.document.pages[\n                        page_ix\n                    ].image.pil_image.crop(expanded_bbox.as_tuple())\n\n                    is_empty, rem_frac, debug = is_empty_fast_with_lines_pil(\n                        cropped_image\n                    )\n                    if is_empty:\n                        if SHOW_EMPTY_CROPS:\n                            try:\n                                cropped_image.show()\n                            except Exception as e:\n                                print(f\"Error with image: {e}\")\n                        print(\n                            f\"Detected empty form item image crop: {rem_frac} - {debug}\"\n                        )\n                    else:\n                        result.append(\n                            PostOcrEnrichmentElement(item=c, image=[cropped_image])\n                        )\n            return result\n        elif isinstance(element, TableItem):\n            element_prov = element.prov[0]\n            page_ix = element_prov.page_no\n            result = []\n            for i, row in enumerate(element.data.grid):\n                for j, cell in enumerate(row):\n                    if hasattr(cell, \"bbox\"):\n                        if cell.bbox:\n                            bbox = cell.bbox\n                            bbox = bbox.scale_to_size(\n                                old_size=conv_res.document.pages[page_ix].size,\n                                new_size=conv_res.document.pages[page_ix].image.size,\n                            )\n\n                            expanded_bbox = bbox.expand_by_scale(\n                                x_scale=self.table_cell_expansion_factor,\n                                y_scale=self.table_cell_expansion_factor,\n                            ).to_top_left_origin(\n                                page_height=conv_res.document.pages[\n                                    page_ix\n                                ].image.size.height\n                            )\n\n                            good_bbox = True\n                            if (\n                                expanded_bbox.l &gt; expanded_bbox.r\n                                or expanded_bbox.t &gt; expanded_bbox.b\n                            ):\n                                good_bbox = False\n\n                            if good_bbox:\n                                cropped_image = conv_res.document.pages[\n                                    page_ix\n                                ].image.pil_image.crop(expanded_bbox.as_tuple())\n\n                                is_empty, rem_frac, debug = (\n                                    is_empty_fast_with_lines_pil(cropped_image)\n                                )\n                                if is_empty:\n                                    if SHOW_EMPTY_CROPS:\n                                        try:\n                                            cropped_image.show()\n                                        except Exception as e:\n                                            print(f\"Error with image: {e}\")\n                                    print(\n                                        f\"Detected empty table cell image crop: {rem_frac} - {debug}\"\n                                    )\n                                else:\n                                    if SHOW_NONEMPTY_CROPS:\n                                        cropped_image.show()\n                                    result.append(\n                                        PostOcrEnrichmentElement(\n                                            item=cell, image=[cropped_image]\n                                        )\n                                    )\n            return result\n        else:\n            multiple_crops = []\n            # Crop the image form the page\n            for element_prov in element.prov:\n                # Iterate over provenances\n                bbox = element_prov.bbox\n\n                page_ix = element_prov.page_no\n                bbox = bbox.scale_to_size(\n                    old_size=conv_res.document.pages[page_ix].size,\n                    new_size=conv_res.document.pages[page_ix].image.size,\n                )\n                expanded_bbox = bbox.expand_by_scale(\n                    x_scale=self.expansion_factor, y_scale=self.expansion_factor\n                ).to_top_left_origin(\n                    page_height=conv_res.document.pages[page_ix].image.size.height\n                )\n\n                good_bbox = True\n                if (\n                    expanded_bbox.l &gt; expanded_bbox.r\n                    or expanded_bbox.t &gt; expanded_bbox.b\n                ):\n                    good_bbox = False\n\n                if hasattr(element, \"text\"):\n                    if good_bbox:\n                        cropped_image = conv_res.document.pages[\n                            page_ix\n                        ].image.pil_image.crop(expanded_bbox.as_tuple())\n\n                        is_empty, rem_frac, debug = is_empty_fast_with_lines_pil(\n                            cropped_image\n                        )\n                        if is_empty:\n                            if SHOW_EMPTY_CROPS:\n                                try:\n                                    cropped_image.show()\n                                except Exception as e:\n                                    print(f\"Error with image: {e}\")\n                            print(f\"Detected empty text crop: {rem_frac} - {debug}\")\n                        else:\n                            multiple_crops.append(cropped_image)\n                            if hasattr(element, \"text\"):\n                                print(f\"\\nOLD TEXT: {element.text}\")\n                else:\n                    print(\"Not a text element\")\n            if len(multiple_crops) &gt; 0:\n                # good crops\n                return [PostOcrEnrichmentElement(item=element, image=multiple_crops)]\n            else:\n                # nothing\n                return []\n\n    @classmethod\n    def get_options_type(cls) -&gt; type[PictureDescriptionApiOptions]:\n        return PictureDescriptionApiOptions\n\n    def __init__(\n        self,\n        *,\n        enabled: bool,\n        enable_remote_services: bool,\n        artifacts_path: Optional[Union[Path, str]],\n        options: PictureDescriptionApiOptions,\n        accelerator_options: AcceleratorOptions,\n    ):\n        self.enabled = enabled\n        self.options = options\n        self.concurrency = 2\n        self.expansion_factor = 0.05\n        self.table_cell_expansion_factor = 0.0  # do not modify table cell size\n        self.elements_batch_size = 4\n        self._accelerator_options = accelerator_options\n        self._artifacts_path = (\n            Path(artifacts_path) if isinstance(artifacts_path, str) else artifacts_path\n        )\n\n        if self.enabled and not enable_remote_services:\n            raise OperationNotAllowed(\n                \"Enable remote services by setting pipeline_options.enable_remote_services=True.\"\n            )\n\n    def is_processable(self, doc: DoclingDocument, element: NodeItem) -&gt; bool:\n        return self.enabled\n\n    def _annotate_images(self, images: Iterable[Image.Image]) -&gt; Iterable[str]:\n        def _api_request(image: Image.Image) -&gt; str:\n            res = api_image_request(\n                image=image,\n                prompt=self.options.prompt,\n                url=self.options.url,\n                # timeout=self.options.timeout,\n                timeout=30,\n                headers=self.options.headers,\n                **self.options.params,\n            )\n            return res[0]\n\n        with ThreadPoolExecutor(max_workers=self.concurrency) as executor:\n            yield from executor.map(_api_request, images)\n\n    def __call__(\n        self,\n        doc: DoclingDocument,\n        element_batch: Iterable[ItemAndImageEnrichmentElement],\n    ) -&gt; Iterable[NodeItem]:\n        if not self.enabled:\n            for element in element_batch:\n                yield element.item\n            return\n\n        elements: list[TextItem] = []\n        images: list[Image.Image] = []\n        img_ind_per_element: list[int] = []\n\n        for element_stack in element_batch:\n            for element in element_stack:\n                allowed = (DocItem, TableCell, RichTableCell, GraphCell)\n                assert isinstance(element.item, allowed)\n                for ind, img in enumerate(element.image):\n                    elements.append(element.item)\n                    images.append(img)\n                    # images.append(element.image)\n                    img_ind_per_element.append(ind)\n\n        if not images:\n            return\n\n        outputs = list(self._annotate_images(images))\n\n        for item, output, img_ind in zip(elements, outputs, img_ind_per_element):\n            # Sometimes model can return html tags, which are not strictly needed in our, so it's better to clean them\n            def clean_html_tags(text):\n                for tag in [\n                    \"&lt;table&gt;\",\n                    \"&lt;tr&gt;\",\n                    \"&lt;td&gt;\",\n                    \"&lt;strong&gt;\",\n                    \"&lt;/table&gt;\",\n                    \"&lt;/tr&gt;\",\n                    \"&lt;/td&gt;\",\n                    \"&lt;/strong&gt;\",\n                    \"&lt;th&gt;\",\n                    \"&lt;/th&gt;\",\n                    \"&lt;tbody&gt;\",\n                    \"&lt;tbody&gt;\",\n                    \"&lt;thead&gt;\",\n                    \"&lt;/thead&gt;\",\n                ]:\n                    text = text.replace(tag, \"\")\n                return text\n\n            output = clean_html_tags(output).strip()\n            output = remove_break_lines(output)\n            # The last measure against hallucinations\n            # Detect hallucinated string...\n            if output.startswith(\"The first of these\"):\n                output = \"\"\n\n            if no_long_repeats(output, 50):\n                if VERBOSE:\n                    if isinstance(item, (TextItem)):\n                        print(f\"\\nOLD TEXT: {item.text}\")\n\n                # Re-populate text\n                if isinstance(item, (TextItem, GraphCell)):\n                    if img_ind &gt; 0:\n                        # Concat texts across several provenances\n                        item.text += \" \" + output\n                        # item.orig += \" \" + output\n                    else:\n                        item.text = output\n                        # item.orig = output\n                elif isinstance(item, (TableCell, RichTableCell)):\n                    item.text = output\n                elif isinstance(item, PictureItem):\n                    pass\n                else:\n                    raise ValueError(f\"Unknown item type: {type(item)}\")\n\n                if VERBOSE:\n                    if isinstance(item, (TextItem)):\n                        print(f\"NEW TEXT: {item.text}\")\n\n                # Take care of charspans for relevant types\n                if isinstance(item, GraphCell):\n                    item.prov.charspan = (0, len(item.text))\n                elif isinstance(item, TextItem):\n                    item.prov[0].charspan = (0, len(item.text))\n\n            yield item\n</pre> class PostOcrApiEnrichmentModel(     GenericEnrichmentModel[PostOcrEnrichmentElement], BaseModelWithOptions ):     expansion_factor: float = 0.001      def prepare_element(         self, conv_res: ConversionResult, element: NodeItem     ) -&gt; Optional[list[PostOcrEnrichmentElement]]:         if not self.is_processable(doc=conv_res.document, element=element):             return None          allowed = (DocItem, TableItem, GraphCell)         assert isinstance(element, allowed)          if isinstance(element, (KeyValueItem, FormItem)):             # Yield from the graphCells inside here.             result = []             for c in element.graph.cells:                 element_prov = c.prov  # Key / Value have only one provenance!                 bbox = element_prov.bbox                 page_ix = element_prov.page_no                 bbox = bbox.scale_to_size(                     old_size=conv_res.document.pages[page_ix].size,                     new_size=conv_res.document.pages[page_ix].image.size,                 )                 expanded_bbox = bbox.expand_by_scale(                     x_scale=self.expansion_factor, y_scale=self.expansion_factor                 ).to_top_left_origin(                     page_height=conv_res.document.pages[page_ix].image.size.height                 )                  good_bbox = True                 if (                     expanded_bbox.l &gt; expanded_bbox.r                     or expanded_bbox.t &gt; expanded_bbox.b                 ):                     good_bbox = False                  if good_bbox:                     cropped_image = conv_res.document.pages[                         page_ix                     ].image.pil_image.crop(expanded_bbox.as_tuple())                      is_empty, rem_frac, debug = is_empty_fast_with_lines_pil(                         cropped_image                     )                     if is_empty:                         if SHOW_EMPTY_CROPS:                             try:                                 cropped_image.show()                             except Exception as e:                                 print(f\"Error with image: {e}\")                         print(                             f\"Detected empty form item image crop: {rem_frac} - {debug}\"                         )                     else:                         result.append(                             PostOcrEnrichmentElement(item=c, image=[cropped_image])                         )             return result         elif isinstance(element, TableItem):             element_prov = element.prov[0]             page_ix = element_prov.page_no             result = []             for i, row in enumerate(element.data.grid):                 for j, cell in enumerate(row):                     if hasattr(cell, \"bbox\"):                         if cell.bbox:                             bbox = cell.bbox                             bbox = bbox.scale_to_size(                                 old_size=conv_res.document.pages[page_ix].size,                                 new_size=conv_res.document.pages[page_ix].image.size,                             )                              expanded_bbox = bbox.expand_by_scale(                                 x_scale=self.table_cell_expansion_factor,                                 y_scale=self.table_cell_expansion_factor,                             ).to_top_left_origin(                                 page_height=conv_res.document.pages[                                     page_ix                                 ].image.size.height                             )                              good_bbox = True                             if (                                 expanded_bbox.l &gt; expanded_bbox.r                                 or expanded_bbox.t &gt; expanded_bbox.b                             ):                                 good_bbox = False                              if good_bbox:                                 cropped_image = conv_res.document.pages[                                     page_ix                                 ].image.pil_image.crop(expanded_bbox.as_tuple())                                  is_empty, rem_frac, debug = (                                     is_empty_fast_with_lines_pil(cropped_image)                                 )                                 if is_empty:                                     if SHOW_EMPTY_CROPS:                                         try:                                             cropped_image.show()                                         except Exception as e:                                             print(f\"Error with image: {e}\")                                     print(                                         f\"Detected empty table cell image crop: {rem_frac} - {debug}\"                                     )                                 else:                                     if SHOW_NONEMPTY_CROPS:                                         cropped_image.show()                                     result.append(                                         PostOcrEnrichmentElement(                                             item=cell, image=[cropped_image]                                         )                                     )             return result         else:             multiple_crops = []             # Crop the image form the page             for element_prov in element.prov:                 # Iterate over provenances                 bbox = element_prov.bbox                  page_ix = element_prov.page_no                 bbox = bbox.scale_to_size(                     old_size=conv_res.document.pages[page_ix].size,                     new_size=conv_res.document.pages[page_ix].image.size,                 )                 expanded_bbox = bbox.expand_by_scale(                     x_scale=self.expansion_factor, y_scale=self.expansion_factor                 ).to_top_left_origin(                     page_height=conv_res.document.pages[page_ix].image.size.height                 )                  good_bbox = True                 if (                     expanded_bbox.l &gt; expanded_bbox.r                     or expanded_bbox.t &gt; expanded_bbox.b                 ):                     good_bbox = False                  if hasattr(element, \"text\"):                     if good_bbox:                         cropped_image = conv_res.document.pages[                             page_ix                         ].image.pil_image.crop(expanded_bbox.as_tuple())                          is_empty, rem_frac, debug = is_empty_fast_with_lines_pil(                             cropped_image                         )                         if is_empty:                             if SHOW_EMPTY_CROPS:                                 try:                                     cropped_image.show()                                 except Exception as e:                                     print(f\"Error with image: {e}\")                             print(f\"Detected empty text crop: {rem_frac} - {debug}\")                         else:                             multiple_crops.append(cropped_image)                             if hasattr(element, \"text\"):                                 print(f\"\\nOLD TEXT: {element.text}\")                 else:                     print(\"Not a text element\")             if len(multiple_crops) &gt; 0:                 # good crops                 return [PostOcrEnrichmentElement(item=element, image=multiple_crops)]             else:                 # nothing                 return []      @classmethod     def get_options_type(cls) -&gt; type[PictureDescriptionApiOptions]:         return PictureDescriptionApiOptions      def __init__(         self,         *,         enabled: bool,         enable_remote_services: bool,         artifacts_path: Optional[Union[Path, str]],         options: PictureDescriptionApiOptions,         accelerator_options: AcceleratorOptions,     ):         self.enabled = enabled         self.options = options         self.concurrency = 2         self.expansion_factor = 0.05         self.table_cell_expansion_factor = 0.0  # do not modify table cell size         self.elements_batch_size = 4         self._accelerator_options = accelerator_options         self._artifacts_path = (             Path(artifacts_path) if isinstance(artifacts_path, str) else artifacts_path         )          if self.enabled and not enable_remote_services:             raise OperationNotAllowed(                 \"Enable remote services by setting pipeline_options.enable_remote_services=True.\"             )      def is_processable(self, doc: DoclingDocument, element: NodeItem) -&gt; bool:         return self.enabled      def _annotate_images(self, images: Iterable[Image.Image]) -&gt; Iterable[str]:         def _api_request(image: Image.Image) -&gt; str:             res = api_image_request(                 image=image,                 prompt=self.options.prompt,                 url=self.options.url,                 # timeout=self.options.timeout,                 timeout=30,                 headers=self.options.headers,                 **self.options.params,             )             return res[0]          with ThreadPoolExecutor(max_workers=self.concurrency) as executor:             yield from executor.map(_api_request, images)      def __call__(         self,         doc: DoclingDocument,         element_batch: Iterable[ItemAndImageEnrichmentElement],     ) -&gt; Iterable[NodeItem]:         if not self.enabled:             for element in element_batch:                 yield element.item             return          elements: list[TextItem] = []         images: list[Image.Image] = []         img_ind_per_element: list[int] = []          for element_stack in element_batch:             for element in element_stack:                 allowed = (DocItem, TableCell, RichTableCell, GraphCell)                 assert isinstance(element.item, allowed)                 for ind, img in enumerate(element.image):                     elements.append(element.item)                     images.append(img)                     # images.append(element.image)                     img_ind_per_element.append(ind)          if not images:             return          outputs = list(self._annotate_images(images))          for item, output, img_ind in zip(elements, outputs, img_ind_per_element):             # Sometimes model can return html tags, which are not strictly needed in our, so it's better to clean them             def clean_html_tags(text):                 for tag in [                     \"\",                     \"\",                     \"\",                     \"\",                     \"\",                     \"\",                     \"\",                     \"\",                     \"\",                     \"\",                     \"\",                     \"\",                     \"\",                     \"\",                 ]:                     text = text.replace(tag, \"\")                 return text              output = clean_html_tags(output).strip()             output = remove_break_lines(output)             # The last measure against hallucinations             # Detect hallucinated string...             if output.startswith(\"The first of these\"):                 output = \"\"              if no_long_repeats(output, 50):                 if VERBOSE:                     if isinstance(item, (TextItem)):                         print(f\"\\nOLD TEXT: {item.text}\")                  # Re-populate text                 if isinstance(item, (TextItem, GraphCell)):                     if img_ind &gt; 0:                         # Concat texts across several provenances                         item.text += \" \" + output                         # item.orig += \" \" + output                     else:                         item.text = output                         # item.orig = output                 elif isinstance(item, (TableCell, RichTableCell)):                     item.text = output                 elif isinstance(item, PictureItem):                     pass                 else:                     raise ValueError(f\"Unknown item type: {type(item)}\")                  if VERBOSE:                     if isinstance(item, (TextItem)):                         print(f\"NEW TEXT: {item.text}\")                  # Take care of charspans for relevant types                 if isinstance(item, GraphCell):                     item.prov.charspan = (0, len(item.text))                 elif isinstance(item, TextItem):                     item.prov[0].charspan = (0, len(item.text))              yield item In\u00a0[\u00a0]: Copied! <pre>def convert_pdf(pdf_path: Path, out_intermediate_json: Path):\n    # Let's prepare a Docling document json with embedded page images\n    pipeline_options = PdfPipelineOptions()\n    pipeline_options.generate_page_images = True\n    pipeline_options.generate_picture_images = True\n    # pipeline_options.images_scale = 4.0\n    pipeline_options.images_scale = 2.0\n\n    doc_converter = (\n        DocumentConverter(  # all of the below is optional, has internal defaults.\n            allowed_formats=[InputFormat.PDF],\n            format_options={\n                InputFormat.PDF: PdfFormatOption(\n                    pipeline_cls=StandardPdfPipeline, pipeline_options=pipeline_options\n                )\n            },\n        )\n    )\n\n    if VERBOSE:\n        print(\n            \"Converting PDF to get a Docling document json with embedded page images...\"\n        )\n    conv_result = doc_converter.convert(pdf_path)\n    conv_result.document.save_as_json(\n        filename=out_intermediate_json, image_mode=ImageRefMode.EMBEDDED\n    )\n    if PRINT_RESULT_MARKDOWN:\n        md1 = conv_result.document.export_to_markdown()\n        print(\"*** ORIGINAL MARKDOWN ***\")\n        print(md1)\n</pre> def convert_pdf(pdf_path: Path, out_intermediate_json: Path):     # Let's prepare a Docling document json with embedded page images     pipeline_options = PdfPipelineOptions()     pipeline_options.generate_page_images = True     pipeline_options.generate_picture_images = True     # pipeline_options.images_scale = 4.0     pipeline_options.images_scale = 2.0      doc_converter = (         DocumentConverter(  # all of the below is optional, has internal defaults.             allowed_formats=[InputFormat.PDF],             format_options={                 InputFormat.PDF: PdfFormatOption(                     pipeline_cls=StandardPdfPipeline, pipeline_options=pipeline_options                 )             },         )     )      if VERBOSE:         print(             \"Converting PDF to get a Docling document json with embedded page images...\"         )     conv_result = doc_converter.convert(pdf_path)     conv_result.document.save_as_json(         filename=out_intermediate_json, image_mode=ImageRefMode.EMBEDDED     )     if PRINT_RESULT_MARKDOWN:         md1 = conv_result.document.export_to_markdown()         print(\"*** ORIGINAL MARKDOWN ***\")         print(md1) In\u00a0[\u00a0]: Copied! <pre>def post_process_json(in_json: Path, out_final_json: Path):\n    # Post-Process OCR on top of existing Docling document, per element's bounding box:\n    print(f\"Post-process all bounding boxes with OCR... {os.path.basename(in_json)}\")\n    pipeline_options = PostOcrEnrichmentPipelineOptions(\n        api_options=PictureDescriptionApiOptions(\n            url=LM_STUDIO_URL,\n            prompt=DEFAULT_PROMPT,\n            provenance=\"lm-studio-ocr\",\n            batch_size=4,\n            concurrency=2,\n            scale=2.0,\n            params={\"model\": LM_STUDIO_MODEL},\n        )\n    )\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.JSON_DOCLING: FormatOption(\n                pipeline_cls=PostOcrEnrichmentPipeline,\n                pipeline_options=pipeline_options,\n                backend=DoclingJSONBackend,\n            )\n        }\n    )\n    result = doc_converter.convert(in_json)\n    if SHOW_IMAGE:\n        result.document.pages[1].image.pil_image.show()\n    result.document.save_as_json(out_final_json)\n    if PRINT_RESULT_MARKDOWN:\n        md = result.document.export_to_markdown()\n        print(\"*** MARKDOWN ***\")\n        print(md)\n</pre> def post_process_json(in_json: Path, out_final_json: Path):     # Post-Process OCR on top of existing Docling document, per element's bounding box:     print(f\"Post-process all bounding boxes with OCR... {os.path.basename(in_json)}\")     pipeline_options = PostOcrEnrichmentPipelineOptions(         api_options=PictureDescriptionApiOptions(             url=LM_STUDIO_URL,             prompt=DEFAULT_PROMPT,             provenance=\"lm-studio-ocr\",             batch_size=4,             concurrency=2,             scale=2.0,             params={\"model\": LM_STUDIO_MODEL},         )     )      doc_converter = DocumentConverter(         format_options={             InputFormat.JSON_DOCLING: FormatOption(                 pipeline_cls=PostOcrEnrichmentPipeline,                 pipeline_options=pipeline_options,                 backend=DoclingJSONBackend,             )         }     )     result = doc_converter.convert(in_json)     if SHOW_IMAGE:         result.document.pages[1].image.pil_image.show()     result.document.save_as_json(out_final_json)     if PRINT_RESULT_MARKDOWN:         md = result.document.export_to_markdown()         print(\"*** MARKDOWN ***\")         print(md) In\u00a0[\u00a0]: Copied! <pre>def process_pdf(pdf_path: Path, scratch_dir: Path, out_dir: Path):\n    inter_json = scratch_dir / (pdf_path.stem + \".json\")\n    final_json = out_dir / (pdf_path.stem + \".json\")\n    inter_json.parent.mkdir(parents=True, exist_ok=True)\n    final_json.parent.mkdir(parents=True, exist_ok=True)\n    if final_json.exists() and final_json.stat().st_size &gt; 0:\n        print(f\"Result already found here: '{final_json}', aborting...\")\n        return  # already done\n    convert_pdf(pdf_path, inter_json)\n    post_process_json(inter_json, final_json)\n</pre> def process_pdf(pdf_path: Path, scratch_dir: Path, out_dir: Path):     inter_json = scratch_dir / (pdf_path.stem + \".json\")     final_json = out_dir / (pdf_path.stem + \".json\")     inter_json.parent.mkdir(parents=True, exist_ok=True)     final_json.parent.mkdir(parents=True, exist_ok=True)     if final_json.exists() and final_json.stat().st_size &gt; 0:         print(f\"Result already found here: '{final_json}', aborting...\")         return  # already done     convert_pdf(pdf_path, inter_json)     post_process_json(inter_json, final_json) In\u00a0[\u00a0]: Copied! <pre>def process_json(json_path: Path, out_dir: Path):\n    final_json = out_dir / (json_path.stem + \".json\")\n    final_json.parent.mkdir(parents=True, exist_ok=True)\n    if final_json.exists() and final_json.stat().st_size &gt; 0:\n        return  # already done\n    post_process_json(json_path, final_json)\n</pre> def process_json(json_path: Path, out_dir: Path):     final_json = out_dir / (json_path.stem + \".json\")     final_json.parent.mkdir(parents=True, exist_ok=True)     if final_json.exists() and final_json.stat().st_size &gt; 0:         return  # already done     post_process_json(json_path, final_json) In\u00a0[\u00a0]: Copied! <pre>def filter_jsons_by_ocr_list(jsons, folder):\n    \"\"\"\n    jsons: list[Path] - JSON files\n    folder: Path - folder containing ocr_documents.txt\n    \"\"\"\n    ocr_file = folder / \"ocr_documents.txt\"\n\n    # If the file doesn't exist, return the list unchanged\n    if not ocr_file.exists():\n        return jsons\n\n    # Read file names (strip whitespace, ignore empty lines)\n    with ocr_file.open(\"r\", encoding=\"utf-8\") as f:\n        allowed = {line.strip() for line in f if line.strip()}\n\n    # Keep only JSONs whose stem is in allowed list\n    filtered = [p for p in jsons if p.stem in allowed]\n    return filtered\n</pre> def filter_jsons_by_ocr_list(jsons, folder):     \"\"\"     jsons: list[Path] - JSON files     folder: Path - folder containing ocr_documents.txt     \"\"\"     ocr_file = folder / \"ocr_documents.txt\"      # If the file doesn't exist, return the list unchanged     if not ocr_file.exists():         return jsons      # Read file names (strip whitespace, ignore empty lines)     with ocr_file.open(\"r\", encoding=\"utf-8\") as f:         allowed = {line.strip() for line in f if line.strip()}      # Keep only JSONs whose stem is in allowed list     filtered = [p for p in jsons if p.stem in allowed]     return filtered In\u00a0[\u00a0]: Copied! <pre>def run_jsons(in_path: Path, out_dir: Path):\n    if in_path.is_dir():\n        jsons = sorted(in_path.glob(\"*.json\"))\n        if not jsons:\n            raise SystemExit(\"Folder mode expects one or more .json files\")\n        # Look for ocr_documents.txt, in case found, respect only the jsons\n        filtered_jsons = filter_jsons_by_ocr_list(jsons, in_path)\n        for j in tqdm(filtered_jsons):\n            print(\"\")\n            print(\"Processing file...\")\n            print(j)\n            process_json(j, out_dir)\n    else:\n        raise SystemExit(\"Invalid --in path\")\n</pre> def run_jsons(in_path: Path, out_dir: Path):     if in_path.is_dir():         jsons = sorted(in_path.glob(\"*.json\"))         if not jsons:             raise SystemExit(\"Folder mode expects one or more .json files\")         # Look for ocr_documents.txt, in case found, respect only the jsons         filtered_jsons = filter_jsons_by_ocr_list(jsons, in_path)         for j in tqdm(filtered_jsons):             print(\"\")             print(\"Processing file...\")             print(j)             process_json(j, out_dir)     else:         raise SystemExit(\"Invalid --in path\") In\u00a0[\u00a0]: Copied! <pre>def main():\n    logging.getLogger().setLevel(logging.ERROR)\n    p = argparse.ArgumentParser(description=\"PDF/JSON -&gt; final JSON pipeline\")\n    p.add_argument(\n        \"--in\",\n        dest=\"in_path\",\n        default=\"tests/data/pdf/2305.03393v1-pg9.pdf\",\n        required=False,\n        help=\"Path to a PDF/JSON file or a folder of JSONs\",\n    )\n    p.add_argument(\n        \"--out\",\n        dest=\"out_dir\",\n        default=\"scratch/\",\n        required=False,\n        help=\"Folder for final JSONs (scratch goes inside)\",\n    )\n    args = p.parse_args()\n\n    in_path = Path(args.in_path).expanduser().resolve()\n    out_dir = Path(args.out_dir).expanduser().resolve()\n    print(f\"in_path: {in_path}\")\n    print(f\"out_dir: {out_dir}\")\n    scratch_dir = out_dir / \"temp\"\n\n    if not in_path.exists():\n        raise SystemExit(f\"Input not found: {in_path}\")\n\n    if in_path.is_file():\n        if in_path.suffix.lower() == \".pdf\":\n            process_pdf(in_path, scratch_dir, out_dir)\n        elif in_path.suffix.lower() == \".json\":\n            process_json(in_path, out_dir)\n        else:\n            raise SystemExit(\"Single-file mode expects a .pdf or .json\")\n    else:\n        run_jsons(in_path, out_dir)\n</pre> def main():     logging.getLogger().setLevel(logging.ERROR)     p = argparse.ArgumentParser(description=\"PDF/JSON -&gt; final JSON pipeline\")     p.add_argument(         \"--in\",         dest=\"in_path\",         default=\"tests/data/pdf/2305.03393v1-pg9.pdf\",         required=False,         help=\"Path to a PDF/JSON file or a folder of JSONs\",     )     p.add_argument(         \"--out\",         dest=\"out_dir\",         default=\"scratch/\",         required=False,         help=\"Folder for final JSONs (scratch goes inside)\",     )     args = p.parse_args()      in_path = Path(args.in_path).expanduser().resolve()     out_dir = Path(args.out_dir).expanduser().resolve()     print(f\"in_path: {in_path}\")     print(f\"out_dir: {out_dir}\")     scratch_dir = out_dir / \"temp\"      if not in_path.exists():         raise SystemExit(f\"Input not found: {in_path}\")      if in_path.is_file():         if in_path.suffix.lower() == \".pdf\":             process_pdf(in_path, scratch_dir, out_dir)         elif in_path.suffix.lower() == \".json\":             process_json(in_path, out_dir)         else:             raise SystemExit(\"Single-file mode expects a .pdf or .json\")     else:         run_jsons(in_path, out_dir) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n    main()\n</pre> if __name__ == \"__main__\":     main()"},{"location":"examples/rag_azuresearch/","title":"RAG with Azure AI Search","text":"Step Tech Execution Embedding Azure OpenAI \ud83c\udf10 Remote Vector Store Azure AI Search \ud83c\udf10 Remote Gen AI Azure OpenAI \ud83c\udf10 Remote In\u00a0[\u00a0]: Copied! <pre># If running in a fresh environment (like Google Colab), uncomment and run this single command:\n%pip install \"docling~=2.12\" azure-search-documents==11.5.2 azure-identity openai rich torch python-dotenv\n</pre> # If running in a fresh environment (like Google Colab), uncomment and run this single command: %pip install \"docling~=2.12\" azure-search-documents==11.5.2 azure-identity openai rich torch python-dotenv In\u00a0[1]: Copied! <pre>import os\n\nfrom dotenv import load_dotenv\n\nload_dotenv()\n\n\ndef _get_env(key, default=None):\n    try:\n        from google.colab import userdata\n\n        try:\n            return userdata.get(key)\n        except userdata.SecretNotFoundError:\n            pass\n    except ImportError:\n        pass\n    return os.getenv(key, default)\n\n\nAZURE_SEARCH_ENDPOINT = _get_env(\"AZURE_SEARCH_ENDPOINT\")\nAZURE_SEARCH_KEY = _get_env(\"AZURE_SEARCH_KEY\")  # Ensure this is your Admin Key\nAZURE_SEARCH_INDEX_NAME = _get_env(\"AZURE_SEARCH_INDEX_NAME\", \"docling-rag-sample\")\nAZURE_OPENAI_ENDPOINT = _get_env(\"AZURE_OPENAI_ENDPOINT\")\nAZURE_OPENAI_API_KEY = _get_env(\"AZURE_OPENAI_API_KEY\")\nAZURE_OPENAI_API_VERSION = _get_env(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\")\nAZURE_OPENAI_CHAT_MODEL = _get_env(\n    \"AZURE_OPENAI_CHAT_MODEL\"\n)  # Using a deployed model named \"gpt-4o\"\nAZURE_OPENAI_EMBEDDINGS = _get_env(\n    \"AZURE_OPENAI_EMBEDDINGS\", \"text-embedding-3-small\"\n)  # Using a deployed model named \"text-embeddings-3-small\"\n</pre> import os  from dotenv import load_dotenv  load_dotenv()   def _get_env(key, default=None):     try:         from google.colab import userdata          try:             return userdata.get(key)         except userdata.SecretNotFoundError:             pass     except ImportError:         pass     return os.getenv(key, default)   AZURE_SEARCH_ENDPOINT = _get_env(\"AZURE_SEARCH_ENDPOINT\") AZURE_SEARCH_KEY = _get_env(\"AZURE_SEARCH_KEY\")  # Ensure this is your Admin Key AZURE_SEARCH_INDEX_NAME = _get_env(\"AZURE_SEARCH_INDEX_NAME\", \"docling-rag-sample\") AZURE_OPENAI_ENDPOINT = _get_env(\"AZURE_OPENAI_ENDPOINT\") AZURE_OPENAI_API_KEY = _get_env(\"AZURE_OPENAI_API_KEY\") AZURE_OPENAI_API_VERSION = _get_env(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\") AZURE_OPENAI_CHAT_MODEL = _get_env(     \"AZURE_OPENAI_CHAT_MODEL\" )  # Using a deployed model named \"gpt-4o\" AZURE_OPENAI_EMBEDDINGS = _get_env(     \"AZURE_OPENAI_EMBEDDINGS\", \"text-embedding-3-small\" )  # Using a deployed model named \"text-embeddings-3-small\" In\u00a0[11]: Copied! <pre>from rich.console import Console\nfrom rich.panel import Panel\n\nfrom docling.document_converter import DocumentConverter\n\nconsole = Console()\n\n# This URL points to the Microsoft GraphRAG Research Paper (arXiv: 2404.16130), ~15 pages\nsource_url = \"https://arxiv.org/pdf/2404.16130\"\n\nconsole.print(\n    \"[bold yellow]Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...[/bold yellow]\"\n)\nconverter = DocumentConverter()\nresult = converter.convert(source_url)\n\n# Optional: preview the parsed Markdown\nmd_preview = result.document.export_to_markdown()\nconsole.print(Panel(md_preview[:500] + \"...\", title=\"Docling Markdown Preview\"))\n</pre> from rich.console import Console from rich.panel import Panel  from docling.document_converter import DocumentConverter  console = Console()  # This URL points to the Microsoft GraphRAG Research Paper (arXiv: 2404.16130), ~15 pages source_url = \"https://arxiv.org/pdf/2404.16130\"  console.print(     \"[bold yellow]Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...[/bold yellow]\" ) converter = DocumentConverter() result = converter.convert(source_url)  # Optional: preview the parsed Markdown md_preview = result.document.export_to_markdown() console.print(Panel(md_preview[:500] + \"...\", title=\"Docling Markdown Preview\")) <pre>Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Docling Markdown Preview \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 ## From Local to Global: A Graph RAG Approach to Query-Focused Summarization                                    \u2502\n\u2502                                                                                                                 \u2502\n\u2502 Darren Edge 1\u2020                                                                                                  \u2502\n\u2502                                                                                                                 \u2502\n\u2502 Ha Trinh 1\u2020                                                                                                     \u2502\n\u2502                                                                                                                 \u2502\n\u2502 Newman Cheng 2                                                                                                  \u2502\n\u2502                                                                                                                 \u2502\n\u2502 Joshua Bradley 2                                                                                                \u2502\n\u2502                                                                                                                 \u2502\n\u2502 Alex Chao 3                                                                                                     \u2502\n\u2502                                                                                                                 \u2502\n\u2502 Apurva Mody 3                                                                                                   \u2502\n\u2502                                                                                                                 \u2502\n\u2502 Steven Truitt 2                                                                                                 \u2502\n\u2502                                                                                                                 \u2502\n\u2502 ## Jonathan Larson 1                                                                                            \u2502\n\u2502                                                                                                                 \u2502\n\u2502 1 Microsoft Research 2 Microsoft Strategic Missions and Technologies 3 Microsoft Office of the CTO              \u2502\n\u2502                                                                                                                 \u2502\n\u2502 { daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso } @microsoft.com                   \u2502\n\u2502                                                                                                                 \u2502\n\u2502 \u2020 These authors contributed equally to this work                                                                \u2502\n\u2502                                                                                                                 \u2502\n\u2502 ## Abstract                                                                                                     \u2502\n\u2502                                                                                                                 \u2502\n\u2502 The use of retrieval-augmented gen...                                                                           \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> In\u00a0[22]: Copied! <pre>from docling.chunking import HierarchicalChunker\n\nchunker = HierarchicalChunker()\ndoc_chunks = list(chunker.chunk(result.document))\n\nall_chunks = []\nfor idx, c in enumerate(doc_chunks):\n    chunk_text = c.text\n    all_chunks.append((f\"chunk_{idx}\", chunk_text))\n\nconsole.print(f\"Total chunks from PDF: {len(all_chunks)}\")\n</pre> from docling.chunking import HierarchicalChunker  chunker = HierarchicalChunker() doc_chunks = list(chunker.chunk(result.document))  all_chunks = [] for idx, c in enumerate(doc_chunks):     chunk_text = c.text     all_chunks.append((f\"chunk_{idx}\", chunk_text))  console.print(f\"Total chunks from PDF: {len(all_chunks)}\") <pre>Total chunks from PDF: 106\n</pre> In\u00a0[\u00a0]: Copied! <pre>from azure.core.credentials import AzureKeyCredential\nfrom azure.search.documents.indexes import SearchIndexClient\nfrom azure.search.documents.indexes.models import (\n    AzureOpenAIVectorizer,\n    AzureOpenAIVectorizerParameters,\n    HnswAlgorithmConfiguration,\n    SearchableField,\n    SearchField,\n    SearchFieldDataType,\n    SearchIndex,\n    SimpleField,\n    VectorSearch,\n    VectorSearchProfile,\n)\nfrom rich.console import Console\n\nconsole = Console()\n\nVECTOR_DIM = 1536  # Adjust based on your chosen embeddings model\n\nindex_client = SearchIndexClient(\n    AZURE_SEARCH_ENDPOINT, AzureKeyCredential(AZURE_SEARCH_KEY)\n)\n\n\ndef create_search_index(index_name: str):\n    # Define fields\n    fields = [\n        SimpleField(name=\"chunk_id\", type=SearchFieldDataType.String, key=True),\n        SearchableField(name=\"content\", type=SearchFieldDataType.String),\n        SearchField(\n            name=\"content_vector\",\n            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),\n            searchable=True,\n            filterable=False,\n            sortable=False,\n            facetable=False,\n            vector_search_dimensions=VECTOR_DIM,\n            vector_search_profile_name=\"default\",\n        ),\n    ]\n    # Vector search config with an AzureOpenAIVectorizer\n    vector_search = VectorSearch(\n        algorithms=[HnswAlgorithmConfiguration(name=\"default\")],\n        profiles=[\n            VectorSearchProfile(\n                name=\"default\",\n                algorithm_configuration_name=\"default\",\n                vectorizer_name=\"default\",\n            )\n        ],\n        vectorizers=[\n            AzureOpenAIVectorizer(\n                vectorizer_name=\"default\",\n                parameters=AzureOpenAIVectorizerParameters(\n                    resource_url=AZURE_OPENAI_ENDPOINT,\n                    deployment_name=AZURE_OPENAI_EMBEDDINGS,\n                    model_name=\"text-embedding-3-small\",\n                    api_key=AZURE_OPENAI_API_KEY,\n                ),\n            )\n        ],\n    )\n\n    # Create or update the index\n    new_index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)\n    try:\n        index_client.delete_index(index_name)\n    except Exception:\n        pass\n\n    index_client.create_or_update_index(new_index)\n    console.print(f\"Index '{index_name}' created.\")\n\n\ncreate_search_index(AZURE_SEARCH_INDEX_NAME)\n</pre> from azure.core.credentials import AzureKeyCredential from azure.search.documents.indexes import SearchIndexClient from azure.search.documents.indexes.models import (     AzureOpenAIVectorizer,     AzureOpenAIVectorizerParameters,     HnswAlgorithmConfiguration,     SearchableField,     SearchField,     SearchFieldDataType,     SearchIndex,     SimpleField,     VectorSearch,     VectorSearchProfile, ) from rich.console import Console  console = Console()  VECTOR_DIM = 1536  # Adjust based on your chosen embeddings model  index_client = SearchIndexClient(     AZURE_SEARCH_ENDPOINT, AzureKeyCredential(AZURE_SEARCH_KEY) )   def create_search_index(index_name: str):     # Define fields     fields = [         SimpleField(name=\"chunk_id\", type=SearchFieldDataType.String, key=True),         SearchableField(name=\"content\", type=SearchFieldDataType.String),         SearchField(             name=\"content_vector\",             type=SearchFieldDataType.Collection(SearchFieldDataType.Single),             searchable=True,             filterable=False,             sortable=False,             facetable=False,             vector_search_dimensions=VECTOR_DIM,             vector_search_profile_name=\"default\",         ),     ]     # Vector search config with an AzureOpenAIVectorizer     vector_search = VectorSearch(         algorithms=[HnswAlgorithmConfiguration(name=\"default\")],         profiles=[             VectorSearchProfile(                 name=\"default\",                 algorithm_configuration_name=\"default\",                 vectorizer_name=\"default\",             )         ],         vectorizers=[             AzureOpenAIVectorizer(                 vectorizer_name=\"default\",                 parameters=AzureOpenAIVectorizerParameters(                     resource_url=AZURE_OPENAI_ENDPOINT,                     deployment_name=AZURE_OPENAI_EMBEDDINGS,                     model_name=\"text-embedding-3-small\",                     api_key=AZURE_OPENAI_API_KEY,                 ),             )         ],     )      # Create or update the index     new_index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)     try:         index_client.delete_index(index_name)     except Exception:         pass      index_client.create_or_update_index(new_index)     console.print(f\"Index '{index_name}' created.\")   create_search_index(AZURE_SEARCH_INDEX_NAME) <pre>Index 'docling-rag-sample-2' created.\n</pre> In\u00a0[28]: Copied! <pre>from azure.search.documents import SearchClient\nfrom openai import AzureOpenAI\n\nsearch_client = SearchClient(\n    AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_INDEX_NAME, AzureKeyCredential(AZURE_SEARCH_KEY)\n)\nopenai_client = AzureOpenAI(\n    api_key=AZURE_OPENAI_API_KEY,\n    api_version=AZURE_OPENAI_API_VERSION,\n    azure_endpoint=AZURE_OPENAI_ENDPOINT,\n)\n\n\ndef embed_text(text: str):\n    \"\"\"\n    Helper to generate embeddings with Azure OpenAI.\n    \"\"\"\n    response = openai_client.embeddings.create(\n        input=text, model=AZURE_OPENAI_EMBEDDINGS\n    )\n    return response.data[0].embedding\n\n\nupload_docs = []\nfor chunk_id, chunk_text in all_chunks:\n    embedding_vector = embed_text(chunk_text)\n    upload_docs.append(\n        {\n            \"chunk_id\": chunk_id,\n            \"content\": chunk_text,\n            \"content_vector\": embedding_vector,\n        }\n    )\n\n\nBATCH_SIZE = 50\nfor i in range(0, len(upload_docs), BATCH_SIZE):\n    subset = upload_docs[i : i + BATCH_SIZE]\n    resp = search_client.upload_documents(documents=subset)\n\n    all_succeeded = all(r.succeeded for r in resp)\n    console.print(\n        f\"Uploaded batch {i} -&gt; {i + len(subset)}; all_succeeded: {all_succeeded}, \"\n        f\"first_doc_status_code: {resp[0].status_code}\"\n    )\n\nconsole.print(\"All chunks uploaded to Azure Search.\")\n</pre> from azure.search.documents import SearchClient from openai import AzureOpenAI  search_client = SearchClient(     AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_INDEX_NAME, AzureKeyCredential(AZURE_SEARCH_KEY) ) openai_client = AzureOpenAI(     api_key=AZURE_OPENAI_API_KEY,     api_version=AZURE_OPENAI_API_VERSION,     azure_endpoint=AZURE_OPENAI_ENDPOINT, )   def embed_text(text: str):     \"\"\"     Helper to generate embeddings with Azure OpenAI.     \"\"\"     response = openai_client.embeddings.create(         input=text, model=AZURE_OPENAI_EMBEDDINGS     )     return response.data[0].embedding   upload_docs = [] for chunk_id, chunk_text in all_chunks:     embedding_vector = embed_text(chunk_text)     upload_docs.append(         {             \"chunk_id\": chunk_id,             \"content\": chunk_text,             \"content_vector\": embedding_vector,         }     )   BATCH_SIZE = 50 for i in range(0, len(upload_docs), BATCH_SIZE):     subset = upload_docs[i : i + BATCH_SIZE]     resp = search_client.upload_documents(documents=subset)      all_succeeded = all(r.succeeded for r in resp)     console.print(         f\"Uploaded batch {i} -&gt; {i + len(subset)}; all_succeeded: {all_succeeded}, \"         f\"first_doc_status_code: {resp[0].status_code}\"     )  console.print(\"All chunks uploaded to Azure Search.\") <pre>Uploaded batch 0 -&gt; 50; all_succeeded: True, first_doc_status_code: 201\n</pre> <pre>Uploaded batch 50 -&gt; 100; all_succeeded: True, first_doc_status_code: 201\n</pre> <pre>Uploaded batch 100 -&gt; 106; all_succeeded: True, first_doc_status_code: 201\n</pre> <pre>All chunks uploaded to Azure Search.\n</pre> In\u00a0[29]: Copied! <pre>from typing import Optional\n\nfrom azure.search.documents.models import VectorizableTextQuery\n\n\ndef generate_chat_response(prompt: str, system_message: Optional[str] = None):\n    \"\"\"\n    Generates a single-turn chat response using Azure OpenAI Chat.\n    If you need multi-turn conversation or follow-up queries, you'll have to\n    maintain the messages list externally.\n    \"\"\"\n    messages = []\n    if system_message:\n        messages.append({\"role\": \"system\", \"content\": system_message})\n    messages.append({\"role\": \"user\", \"content\": prompt})\n\n    completion = openai_client.chat.completions.create(\n        model=AZURE_OPENAI_CHAT_MODEL, messages=messages, temperature=0.7\n    )\n    return completion.choices[0].message.content\n\n\nuser_query = \"What are the main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG methods?\"\nuser_embed = embed_text(user_query)\n\nvector_query = VectorizableTextQuery(\n    text=user_query,  # passing in text for a hybrid search\n    k_nearest_neighbors=5,\n    fields=\"content_vector\",\n)\n\nsearch_results = search_client.search(\n    search_text=user_query, vector_queries=[vector_query], select=[\"content\"], top=10\n)\n\nretrieved_chunks = []\nfor result in search_results:\n    snippet = result[\"content\"]\n    retrieved_chunks.append(snippet)\n\ncontext_str = \"\\n---\\n\".join(retrieved_chunks)\nrag_prompt = f\"\"\"\nYou are an AI assistant helping answering questions about Microsoft GraphRAG.\nUse ONLY the text below to answer the user's question.\nIf the answer isn't in the text, say you don't know.\n\nContext:\n{context_str}\n\nQuestion: {user_query}\nAnswer:\n\"\"\"\n\nfinal_answer = generate_chat_response(rag_prompt)\n\nconsole.print(Panel(rag_prompt, title=\"RAG Prompt\", style=\"bold red\"))\nconsole.print(Panel(final_answer, title=\"RAG Response\", style=\"bold green\"))\n</pre> from typing import Optional  from azure.search.documents.models import VectorizableTextQuery   def generate_chat_response(prompt: str, system_message: Optional[str] = None):     \"\"\"     Generates a single-turn chat response using Azure OpenAI Chat.     If you need multi-turn conversation or follow-up queries, you'll have to     maintain the messages list externally.     \"\"\"     messages = []     if system_message:         messages.append({\"role\": \"system\", \"content\": system_message})     messages.append({\"role\": \"user\", \"content\": prompt})      completion = openai_client.chat.completions.create(         model=AZURE_OPENAI_CHAT_MODEL, messages=messages, temperature=0.7     )     return completion.choices[0].message.content   user_query = \"What are the main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG methods?\" user_embed = embed_text(user_query)  vector_query = VectorizableTextQuery(     text=user_query,  # passing in text for a hybrid search     k_nearest_neighbors=5,     fields=\"content_vector\", )  search_results = search_client.search(     search_text=user_query, vector_queries=[vector_query], select=[\"content\"], top=10 )  retrieved_chunks = [] for result in search_results:     snippet = result[\"content\"]     retrieved_chunks.append(snippet)  context_str = \"\\n---\\n\".join(retrieved_chunks) rag_prompt = f\"\"\" You are an AI assistant helping answering questions about Microsoft GraphRAG. Use ONLY the text below to answer the user's question. If the answer isn't in the text, say you don't know.  Context: {context_str}  Question: {user_query} Answer: \"\"\"  final_answer = generate_chat_response(rag_prompt)  console.print(Panel(rag_prompt, title=\"RAG Prompt\", style=\"bold red\")) console.print(Panel(final_answer, title=\"RAG Response\", style=\"bold green\")) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 RAG Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502                                                                                                                 \u2502\n\u2502 You are an AI assistant helping answering questions about Microsoft GraphRAG.                                   \u2502\n\u2502 Use ONLY the text below to answer the user's question.                                                          \u2502\n\u2502 If the answer isn't in the text, say you don't know.                                                            \u2502\n\u2502                                                                                                                 \u2502\n\u2502 Context:                                                                                                        \u2502\n\u2502 Community summaries vs. source texts. When comparing community summaries to source texts using Graph RAG,       \u2502\n\u2502 community summaries generally provided a small but consistent improvement in answer comprehensiveness and       \u2502\n\u2502 diversity, except for root-level summaries. Intermediate-level summaries in the Podcast dataset and low-level   \u2502\n\u2502 community summaries in the News dataset achieved comprehensiveness win rates of 57% and 64%, respectively.      \u2502\n\u2502 Diversity win rates were 57% for Podcast intermediate-level summaries and 60% for News low-level community      \u2502\n\u2502 summaries. Table 3 also illustrates the scalability advantages of Graph RAG compared to source text             \u2502\n\u2502 summarization: for low-level community summaries ( C3 ), Graph RAG required 26-33% fewer context tokens, while  \u2502\n\u2502 for root-level community summaries ( C0 ), it required over 97% fewer tokens. For a modest drop in performance  \u2502\n\u2502 compared with other global methods, root-level Graph RAG offers a highly efficient method for the iterative     \u2502\n\u2502 question answering that characterizes sensemaking activity, while retaining advantages in comprehensiveness     \u2502\n\u2502 (72% win rate) and diversity (62% win rate) over na\u00a8\u0131ve RAG.                                                    \u2502\n\u2502 ---                                                                                                             \u2502\n\u2502 We have presented a global approach to Graph RAG, combining knowledge graph generation, retrieval-augmented     \u2502\n\u2502 generation (RAG), and query-focused summarization (QFS) to support human sensemaking over entire text corpora.  \u2502\n\u2502 Initial evaluations show substantial improvements over a na\u00a8\u0131ve RAG baseline for both the comprehensiveness and \u2502\n\u2502 diversity of answers, as well as favorable comparisons to a global but graph-free approach using map-reduce     \u2502\n\u2502 source text summarization. For situations requiring many global queries over the same dataset, summaries of     \u2502\n\u2502 root-level communities in the entity-based graph index provide a data index that is both superior to na\u00a8\u0131ve RAG \u2502\n\u2502 and achieves competitive performance to other global methods at a fraction of the token cost.                   \u2502\n\u2502 ---                                                                                                             \u2502\n\u2502 Trade-offs of building a graph index . We consistently observed Graph RAG achieve the best headto-head results  \u2502\n\u2502 against other methods, but in many cases the graph-free approach to global summarization of source texts        \u2502\n\u2502 performed competitively. The real-world decision about whether to invest in building a graph index depends on   \u2502\n\u2502 multiple factors, including the compute budget, expected number of lifetime queries per dataset, and value      \u2502\n\u2502 obtained from other aspects of the graph index (including the generic community summaries and the use of other  \u2502\n\u2502 graph-related RAG approaches).                                                                                  \u2502\n\u2502 ---                                                                                                             \u2502\n\u2502 Future work . The graph index, rich text annotations, and hierarchical community structure supporting the       \u2502\n\u2502 current Graph RAG approach offer many possibilities for refinement and adaptation. This includes RAG approaches \u2502\n\u2502 that operate in a more local manner, via embedding-based matching of user queries and graph annotations, as     \u2502\n\u2502 well as the possibility of hybrid RAG schemes that combine embedding-based matching against community reports   \u2502\n\u2502 before employing our map-reduce summarization mechanisms. This 'roll-up' operation could also be extended       \u2502\n\u2502 across more levels of the community hierarchy, as well as implemented as a more exploratory 'drill down'        \u2502\n\u2502 mechanism that follows the information scent contained in higher-level community summaries.                     \u2502\n\u2502 ---                                                                                                             \u2502\n\u2502 Advanced RAG systems include pre-retrieval, retrieval, post-retrieval strategies designed to overcome the       \u2502\n\u2502 drawbacks of Na\u00a8\u0131ve RAG, while Modular RAG systems include patterns for iterative and dynamic cycles of         \u2502\n\u2502 interleaved retrieval and generation (Gao et al., 2023). Our implementation of Graph RAG incorporates multiple  \u2502\n\u2502 concepts related to other systems. For example, our community summaries are a kind of self-memory (Selfmem,     \u2502\n\u2502 Cheng et al., 2024) for generation-augmented retrieval (GAR, Mao et al., 2020) that facilitates future          \u2502\n\u2502 generation cycles, while our parallel generation of community answers from these summaries is a kind of         \u2502\n\u2502 iterative (Iter-RetGen, Shao et al., 2023) or federated (FeB4RAG, Wang et al., 2024) retrieval-generation       \u2502\n\u2502 strategy. Other systems have also combined these concepts for multi-document summarization (CAiRE-COVID, Su et  \u2502\n\u2502 al., 2020) and multi-hop question answering (ITRG, Feng et al., 2023; IR-CoT, Trivedi et al., 2022; DSP,        \u2502\n\u2502 Khattab et al., 2022). Our use of a hierarchical index and summarization also bears resemblance to further      \u2502\n\u2502 approaches, such as generating a hierarchical index of text chunks by clustering the vectors of text embeddings \u2502\n\u2502 (RAPTOR, Sarthi et al., 2024) or generating a 'tree of clarifications' to answer multiple interpretations of    \u2502\n\u2502 ambiguous questions (Kim et al., 2023). However, none of these iterative or hierarchical approaches use the     \u2502\n\u2502 kind of self-generated graph index that enables Graph RAG.                                                      \u2502\n\u2502 ---                                                                                                             \u2502\n\u2502 The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge     \u2502\n\u2502 source enables large language models (LLMs) to answer questions over private and/or previously unseen document  \u2502\n\u2502 collections. However, RAG fails on global questions directed at an entire text corpus, such as 'What are the    \u2502\n\u2502 main themes in the dataset?', since this is inherently a queryfocused summarization (QFS) task, rather than an  \u2502\n\u2502 explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by       \u2502\n\u2502 typical RAGsystems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to   \u2502\n\u2502 question answering over private text corpora that scales with both the generality of user questions and the     \u2502\n\u2502 quantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two        \u2502\n\u2502 stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community      \u2502\n\u2502 summaries for all groups of closely-related entities. Given a question, each community summary is used to       \u2502\n\u2502 generate a partial response, before all partial responses are again summarized in a final response to the user. \u2502\n\u2502 For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG  \u2502\n\u2502 leads to substantial improvements over a na\u00a8\u0131ve RAG baseline for both the comprehensiveness and diversity of    \u2502\n\u2502 generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is \u2502\n\u2502 forthcoming at https://aka . ms/graphrag .                                                                      \u2502\n\u2502 ---                                                                                                             \u2502\n\u2502 Given the multi-stage nature of our Graph RAG mechanism, the multiple conditions we wanted to compare, and the  \u2502\n\u2502 lack of gold standard answers to our activity-based sensemaking questions, we decided to adopt a head-to-head   \u2502\n\u2502 comparison approach using an LLM evaluator. We selected three target metrics capturing qualities that are       \u2502\n\u2502 desirable for sensemaking activities, as well as a control metric (directness) used as a indicator of validity. \u2502\n\u2502 Since directness is effectively in opposition to comprehensiveness and diversity, we would not expect any       \u2502\n\u2502 method to win across all four metrics.                                                                          \u2502\n\u2502 ---                                                                                                             \u2502\n\u2502 Figure 1: Graph RAG pipeline using an LLM-derived graph index of source document text. This index spans nodes   \u2502\n\u2502 (e.g., entities), edges (e.g., relationships), and covariates (e.g., claims) that have been detected,           \u2502\n\u2502 extracted, and summarized by LLM prompts tailored to the domain of the dataset. Community detection (e.g.,      \u2502\n\u2502 Leiden, Traag et al., 2019) is used to partition the graph index into groups of elements (nodes, edges,         \u2502\n\u2502 covariates) that the LLM can summarize in parallel at both indexing time and query time. The 'global answer' to \u2502\n\u2502 a given query is produced using a final round of query-focused summarization over all community summaries       \u2502\n\u2502 reporting relevance to that query.                                                                              \u2502\n\u2502 ---                                                                                                             \u2502\n\u2502 Retrieval-augmented generation (RAG, Lewis et al., 2020) is an established approach to answering user questions \u2502\n\u2502 over entire datasets, but it is designed for situations where these answers are contained locally within        \u2502\n\u2502 regions of text whose retrieval provides sufficient grounding for the generation task. Instead, a more          \u2502\n\u2502 appropriate task framing is query-focused summarization (QFS, Dang, 2006), and in particular, query-focused     \u2502\n\u2502 abstractive summarization that generates natural language summaries and not just concatenated excerpts (Baumel  \u2502\n\u2502 et al., 2018; Laskar et al., 2020; Yao et al., 2017) . In recent years, however, such distinctions between      \u2502\n\u2502 summarization tasks that are abstractive versus extractive, generic versus query-focused, and single-document   \u2502\n\u2502 versus multi-document, have become less relevant. While early applications of the transformer architecture      \u2502\n\u2502 showed substantial improvements on the state-of-the-art for all such summarization tasks (Goodwin et al., 2020; \u2502\n\u2502 Laskar et al., 2022; Liu and Lapata, 2019), these tasks are now trivialized by modern LLMs, including the GPT   \u2502\n\u2502 (Achiam et al., 2023; Brown et al., 2020), Llama (Touvron et al., 2023), and Gemini (Anil et al., 2023) series, \u2502\n\u2502 all of which can use in-context learning to summarize any content provided in their context window.             \u2502\n\u2502 ---                                                                                                             \u2502\n\u2502 community descriptions provide complete coverage of the underlying graph index and the input documents it       \u2502\n\u2502 represents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach:  \u2502\n\u2502 first using each community summary to answer the query independently and in parallel, then summarizing all      \u2502\n\u2502 relevant partial answers into a final global answer.                                                            \u2502\n\u2502                                                                                                                 \u2502\n\u2502 Question: What are the main advantages of using the Graph RAG approach for query-focused summarization compared \u2502\n\u2502 to traditional RAG methods?                                                                                     \u2502\n\u2502 Answer:                                                                                                         \u2502\n\u2502                                                                                                                 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 RAG Response \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 The main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG \u2502\n\u2502 methods include:                                                                                                \u2502\n\u2502                                                                                                                 \u2502\n\u2502 1. **Improved Comprehensiveness and Diversity**: Graph RAG shows substantial improvements over a na\u00efve RAG      \u2502\n\u2502 baseline in terms of the comprehensiveness and diversity of answers. This is particularly beneficial for global \u2502\n\u2502 sensemaking questions over large datasets.                                                                      \u2502\n\u2502                                                                                                                 \u2502\n\u2502 2. **Scalability**: Graph RAG provides scalability advantages, achieving efficient summarization with           \u2502\n\u2502 significantly fewer context tokens required. For instance, it requires 26-33% fewer tokens for low-level        \u2502\n\u2502 community summaries and over 97% fewer tokens for root-level summaries compared to source text summarization.   \u2502\n\u2502                                                                                                                 \u2502\n\u2502 3. **Efficiency in Iterative Question Answering**: Root-level Graph RAG offers a highly efficient method for    \u2502\n\u2502 iterative question answering, which is crucial for sensemaking activities, with only a modest drop in           \u2502\n\u2502 performance compared to other global methods.                                                                   \u2502\n\u2502                                                                                                                 \u2502\n\u2502 4. **Global Query Handling**: It supports handling global queries effectively, as it combines knowledge graph   \u2502\n\u2502 generation, retrieval-augmented generation, and query-focused summarization, making it suitable for sensemaking \u2502\n\u2502 over entire text corpora.                                                                                       \u2502\n\u2502                                                                                                                 \u2502\n\u2502 5. **Hierarchical Indexing and Summarization**: The use of a hierarchical index and summarization allows for    \u2502\n\u2502 efficient processing and summarizing of community summaries into a final global answer, facilitating a          \u2502\n\u2502 comprehensive coverage of the underlying graph index and input documents.                                       \u2502\n\u2502                                                                                                                 \u2502\n\u2502 6. **Reduced Token Cost**: For situations requiring many global queries over the same dataset, Graph RAG        \u2502\n\u2502 achieves competitive performance to other global methods at a fraction of the token cost.                       \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre>"},{"location":"examples/rag_azuresearch/#rag-with-azure-ai-search","title":"RAG with Azure AI Search\u00b6","text":""},{"location":"examples/rag_azuresearch/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"<p>This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:</p> <ul> <li>Docling for document parsing and chunking</li> <li>Azure AI Search for vector indexing and retrieval</li> <li>Azure OpenAI for embeddings and chat completion</li> </ul> <p>This sample demonstrates how to:</p> <ol> <li>Parse a PDF with Docling.</li> <li>Chunk the parsed text.</li> <li>Use Azure OpenAI for embeddings.</li> <li>Index and search in Azure AI Search.</li> <li>Run a retrieval-augmented generation (RAG) query with Azure OpenAI GPT-4o.</li> </ol>"},{"location":"examples/rag_azuresearch/#part-0-prerequisites","title":"Part 0: Prerequisites\u00b6","text":"<ul> <li><p>Azure AI Search resource</p> </li> <li><p>Azure OpenAI resource with a deployed embedding and chat completion model (e.g. <code>text-embedding-3-small</code> and <code>gpt-4o</code>)</p> </li> <li><p>Docling 2.12+ (installs <code>docling_core</code> automatically)  Docling installed (Python 3.8+ environment)</p> </li> <li><p>A GPU-enabled environment is preferred for faster parsing. Docling 2.12 automatically detects GPU if present.</p> <ul> <li>If you only have CPU, parsing large PDFs can be slower.</li> </ul> </li> </ul>"},{"location":"examples/rag_azuresearch/#part-1-parse-the-pdf-with-docling","title":"Part 1: Parse the PDF with Docling\u00b6","text":"<p>We\u2019ll parse the Microsoft GraphRAG Research Paper (~15 pages). Parsing should be relatively quick, even on CPU, but it will be faster on a GPU or MPS device if available.</p> <p>(If you prefer a different document, simply provide a different URL or local file path.)</p>"},{"location":"examples/rag_azuresearch/#part-2-hierarchical-chunking","title":"Part 2: Hierarchical Chunking\u00b6","text":"<p>We convert the <code>Document</code> into smaller chunks for embedding and indexing. The built-in <code>HierarchicalChunker</code> preserves structure.</p>"},{"location":"examples/rag_azuresearch/#part-3-create-azure-ai-search-index-and-push-chunk-embeddings","title":"Part 3: Create Azure AI Search Index and Push Chunk Embeddings\u00b6","text":"<p>We\u2019ll define a vector index in Azure AI Search, then embed each chunk using Azure OpenAI and upload in batches.</p>"},{"location":"examples/rag_azuresearch/#generate-embeddings-and-upload-to-azure-ai-search","title":"Generate Embeddings and Upload to Azure AI Search\u00b6","text":""},{"location":"examples/rag_azuresearch/#part-4-perform-rag-over-pdf","title":"Part 4: Perform RAG over PDF\u00b6","text":"<p>Combine retrieval from Azure AI Search with Azure OpenAI Chat Completions (aka. grounding your LLM)</p>"},{"location":"examples/rag_haystack/","title":"RAG with Haystack","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote <p>This example leverages the Haystack Docling extension, along with Milvus-based document store and retriever instances, as well as sentence-transformers embeddings.</p> <p>The presented <code>DoclingConverter</code> component enables you to:</p> <ul> <li>use various document types in your LLM applications with ease and speed, and</li> <li>leverage Docling's rich format for advanced, document-native grounding.</li> </ul> <p><code>DoclingConverter</code> supports two different export modes:</p> <ul> <li><code>ExportType.MARKDOWN</code>: if you want to capture each input document as a separate Haystack document, or</li> <li><code>ExportType.DOC_CHUNKS</code> (default): if you want to have each input document chunked and to then capture each individual chunk as a separate Haystack document downstream.</li> </ul> <p>The example allows to explore both modes via parameter <code>EXPORT_TYPE</code>; depending on the value set, the ingestion and RAG pipelines are then set up accordingly.</p> <ul> <li>\ud83d\udc49 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.</li> <li>Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var <code>HF_TOKEN</code>.</li> <li>Requirements can be installed as shown below (<code>--no-warn-conflicts</code> meant for Colab's pre-populated Python env; feel free to remove for stricter usage):</li> </ul> In\u00a0[1]: Copied! <pre>%pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling \"pymilvus[milvus-lite]\" milvus-haystack sentence-transformers python-dotenv\n</pre> %pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling \"pymilvus[milvus-lite]\" milvus-haystack sentence-transformers python-dotenv <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom docling_haystack.converter import ExportType\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n    try:\n        from google.colab import userdata\n\n        try:\n            return userdata.get(key)\n        except userdata.SecretNotFoundError:\n            pass\n    except ImportError:\n        pass\n    return os.getenv(key)\n\n\nload_dotenv()\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nPATHS = [\"https://arxiv.org/pdf/2408.09869\"]  # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGENERATION_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nEXPORT_TYPE = ExportType.DOC_CHUNKS\nQUESTION = \"Which are the main AI models in Docling?\"\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\n</pre> import os from pathlib import Path from tempfile import mkdtemp  from docling_haystack.converter import ExportType from dotenv import load_dotenv   def _get_env_from_colab_or_os(key):     try:         from google.colab import userdata          try:             return userdata.get(key)         except userdata.SecretNotFoundError:             pass     except ImportError:         pass     return os.getenv(key)   load_dotenv() HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") PATHS = [\"https://arxiv.org/pdf/2408.09869\"]  # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GENERATION_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" EXPORT_TYPE = ExportType.DOC_CHUNKS QUESTION = \"Which are the main AI models in Docling?\" TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied! <pre>from docling_haystack.converter import DoclingConverter\nfrom haystack import Pipeline\nfrom haystack.components.embedders import (\n    SentenceTransformersDocumentEmbedder,\n    SentenceTransformersTextEmbedder,\n)\nfrom haystack.components.preprocessors import DocumentSplitter\nfrom haystack.components.writers import DocumentWriter\nfrom milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever\n\nfrom docling.chunking import HybridChunker\n\ndocument_store = MilvusDocumentStore(\n    connection_args={\"uri\": MILVUS_URI},\n    drop_old=True,\n    text_field=\"txt\",  # set for preventing conflict with same-name metadata field\n)\n\nidx_pipe = Pipeline()\nidx_pipe.add_component(\n    \"converter\",\n    DoclingConverter(\n        export_type=EXPORT_TYPE,\n        chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n    ),\n)\nidx_pipe.add_component(\n    \"embedder\",\n    SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID),\n)\nidx_pipe.add_component(\"writer\", DocumentWriter(document_store=document_store))\nif EXPORT_TYPE == ExportType.DOC_CHUNKS:\n    idx_pipe.connect(\"converter\", \"embedder\")\nelif EXPORT_TYPE == ExportType.MARKDOWN:\n    idx_pipe.add_component(\n        \"splitter\",\n        DocumentSplitter(split_by=\"sentence\", split_length=1),\n    )\n    idx_pipe.connect(\"converter.documents\", \"splitter.documents\")\n    idx_pipe.connect(\"splitter.documents\", \"embedder.documents\")\nelse:\n    raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\nidx_pipe.connect(\"embedder\", \"writer\")\nidx_pipe.run({\"converter\": {\"paths\": PATHS}})\n</pre> from docling_haystack.converter import DoclingConverter from haystack import Pipeline from haystack.components.embedders import (     SentenceTransformersDocumentEmbedder,     SentenceTransformersTextEmbedder, ) from haystack.components.preprocessors import DocumentSplitter from haystack.components.writers import DocumentWriter from milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever  from docling.chunking import HybridChunker  document_store = MilvusDocumentStore(     connection_args={\"uri\": MILVUS_URI},     drop_old=True,     text_field=\"txt\",  # set for preventing conflict with same-name metadata field )  idx_pipe = Pipeline() idx_pipe.add_component(     \"converter\",     DoclingConverter(         export_type=EXPORT_TYPE,         chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),     ), ) idx_pipe.add_component(     \"embedder\",     SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID), ) idx_pipe.add_component(\"writer\", DocumentWriter(document_store=document_store)) if EXPORT_TYPE == ExportType.DOC_CHUNKS:     idx_pipe.connect(\"converter\", \"embedder\") elif EXPORT_TYPE == ExportType.MARKDOWN:     idx_pipe.add_component(         \"splitter\",         DocumentSplitter(split_by=\"sentence\", split_length=1),     )     idx_pipe.connect(\"converter.documents\", \"splitter.documents\")     idx_pipe.connect(\"splitter.documents\", \"embedder.documents\") else:     raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\") idx_pipe.connect(\"embedder\", \"writer\") idx_pipe.run({\"converter\": {\"paths\": PATHS}}) <pre>Token indices sequence length is longer than the specified maximum sequence length for this model (1041 &gt; 512). Running this sequence through the model will result in indexing errors\n</pre> <pre>Batches:   0%|          | 0/2 [00:00&lt;?, ?it/s]</pre> Out[3]: <pre>{'writer': {'documents_written': 54}}</pre> In\u00a0[4]: Copied! <pre>from haystack.components.builders import AnswerBuilder\nfrom haystack.components.builders.prompt_builder import PromptBuilder\nfrom haystack.components.generators import HuggingFaceAPIGenerator\nfrom haystack.utils import Secret\n\nprompt_template = \"\"\"\n    Given these documents, answer the question.\n    Documents:\n    {% for doc in documents %}\n        {{ doc.content }}\n    {% endfor %}\n    Question: {{query}}\n    Answer:\n    \"\"\"\n\nrag_pipe = Pipeline()\nrag_pipe.add_component(\n    \"embedder\",\n    SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID),\n)\nrag_pipe.add_component(\n    \"retriever\",\n    MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K),\n)\nrag_pipe.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template))\nrag_pipe.add_component(\n    \"llm\",\n    HuggingFaceAPIGenerator(\n        api_type=\"serverless_inference_api\",\n        api_params={\"model\": GENERATION_MODEL_ID},\n        token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None,\n    ),\n)\nrag_pipe.add_component(\"answer_builder\", AnswerBuilder())\nrag_pipe.connect(\"embedder.embedding\", \"retriever\")\nrag_pipe.connect(\"retriever\", \"prompt_builder.documents\")\nrag_pipe.connect(\"prompt_builder\", \"llm\")\nrag_pipe.connect(\"llm.replies\", \"answer_builder.replies\")\nrag_pipe.connect(\"llm.meta\", \"answer_builder.meta\")\nrag_pipe.connect(\"retriever\", \"answer_builder.documents\")\nrag_res = rag_pipe.run(\n    {\n        \"embedder\": {\"text\": QUESTION},\n        \"prompt_builder\": {\"query\": QUESTION},\n        \"answer_builder\": {\"query\": QUESTION},\n    }\n)\n</pre> from haystack.components.builders import AnswerBuilder from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.generators import HuggingFaceAPIGenerator from haystack.utils import Secret  prompt_template = \"\"\"     Given these documents, answer the question.     Documents:     {% for doc in documents %}         {{ doc.content }}     {% endfor %}     Question: {{query}}     Answer:     \"\"\"  rag_pipe = Pipeline() rag_pipe.add_component(     \"embedder\",     SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID), ) rag_pipe.add_component(     \"retriever\",     MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K), ) rag_pipe.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template)) rag_pipe.add_component(     \"llm\",     HuggingFaceAPIGenerator(         api_type=\"serverless_inference_api\",         api_params={\"model\": GENERATION_MODEL_ID},         token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None,     ), ) rag_pipe.add_component(\"answer_builder\", AnswerBuilder()) rag_pipe.connect(\"embedder.embedding\", \"retriever\") rag_pipe.connect(\"retriever\", \"prompt_builder.documents\") rag_pipe.connect(\"prompt_builder\", \"llm\") rag_pipe.connect(\"llm.replies\", \"answer_builder.replies\") rag_pipe.connect(\"llm.meta\", \"answer_builder.meta\") rag_pipe.connect(\"retriever\", \"answer_builder.documents\") rag_res = rag_pipe.run(     {         \"embedder\": {\"text\": QUESTION},         \"prompt_builder\": {\"query\": QUESTION},         \"answer_builder\": {\"query\": QUESTION},     } ) <pre>Batches:   0%|          | 0/1 [00:00&lt;?, ?it/s]</pre> <pre>/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n  warnings.warn(\n</pre> <p>Below we print out the RAG results. If you have used <code>ExportType.DOC_CHUNKS</code>, notice how the sources contain document-level grounding (e.g. page number or bounding box information):</p> In\u00a0[5]: Copied! <pre>from docling.chunking import DocChunk\n\nprint(f\"Question:\\n{QUESTION}\\n\")\nprint(f\"Answer:\\n{rag_res['answer_builder']['answers'][0].data.strip()}\\n\")\nprint(\"Sources:\")\nsources = rag_res[\"answer_builder\"][\"answers\"][0].documents\nfor source in sources:\n    if EXPORT_TYPE == ExportType.DOC_CHUNKS:\n        doc_chunk = DocChunk.model_validate(source.meta[\"dl_meta\"])\n        print(f\"- text: {doc_chunk.text!r}\")\n        if doc_chunk.meta.origin:\n            print(f\"  file: {doc_chunk.meta.origin.filename}\")\n        if doc_chunk.meta.headings:\n            print(f\"  section: {' / '.join(doc_chunk.meta.headings)}\")\n        bbox = doc_chunk.meta.doc_items[0].prov[0].bbox\n        print(\n            f\"  page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, \"\n            f\"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]\"\n        )\n    elif EXPORT_TYPE == ExportType.MARKDOWN:\n        print(repr(source.content))\n    else:\n        raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\n</pre> from docling.chunking import DocChunk  print(f\"Question:\\n{QUESTION}\\n\") print(f\"Answer:\\n{rag_res['answer_builder']['answers'][0].data.strip()}\\n\") print(\"Sources:\") sources = rag_res[\"answer_builder\"][\"answers\"][0].documents for source in sources:     if EXPORT_TYPE == ExportType.DOC_CHUNKS:         doc_chunk = DocChunk.model_validate(source.meta[\"dl_meta\"])         print(f\"- text: {doc_chunk.text!r}\")         if doc_chunk.meta.origin:             print(f\"  file: {doc_chunk.meta.origin.filename}\")         if doc_chunk.meta.headings:             print(f\"  section: {' / '.join(doc_chunk.meta.headings)}\")         bbox = doc_chunk.meta.doc_items[0].prov[0].bbox         print(             f\"  page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, \"             f\"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]\"         )     elif EXPORT_TYPE == ExportType.MARKDOWN:         print(repr(source.content))     else:         raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\") <pre>Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nThe main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, while TableFormer is a state-of-the-art table structure recognition model. These models are provided with pre-trained weights and a separate package for the inference code as docling-ibm-models. They are also used in the open-access deepsearch-experience, a cloud-native service for knowledge exploration tasks. Additionally, Docling plans to extend its model library with a figure-classifier model, an equation-recognition model, a code-recognition model, and more in the future.\n\nSources:\n- text: 'As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.'\n  file: 2408.09869v5.pdf\n  section: 3.2 AI models\n  page: 3, bounding box: [107, 406, 504, 330]\n- text: 'Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading-order and eventually assembles a typed document object which can be serialized to JSON or Markdown.'\n  file: 2408.09869v5.pdf\n  section: 3 Processing pipeline\n  page: 2, bounding box: [107, 273, 504, 176]\n- text: 'Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of content, as well as augment extracted document metadata with additional information. Further investment into testing and optimizing GPU acceleration as well as improving the Docling-native PDF backend are on our roadmap, too.\\nWe encourage everyone to propose or implement additional features and models, and will gladly take your inputs and contributions under review . The codebase of Docling is open for use and contribution, under the MIT license agreement and in alignment with our contributing guidelines included in the Docling repository. If you use Docling in your projects, please consider citing this technical report.'\n  section: 6 Future work and contributions\n  page: 5, bounding box: [106, 323, 504, 258]\n</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/rag_haystack/#rag-with-haystack","title":"RAG with Haystack\u00b6","text":""},{"location":"examples/rag_haystack/#overview","title":"Overview\u00b6","text":""},{"location":"examples/rag_haystack/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_haystack/#indexing-pipeline","title":"Indexing pipeline\u00b6","text":""},{"location":"examples/rag_haystack/#rag-pipeline","title":"RAG pipeline\u00b6","text":""},{"location":"examples/rag_langchain/","title":"RAG with LangChain","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote <p>This example leverages the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings.</p> <p>The presented <code>DoclingLoader</code> component enables you to:</p> <ul> <li>use various document types in your LLM applications with ease and speed, and</li> <li>leverage Docling's rich format for advanced, document-native grounding.</li> </ul> <p><code>DoclingLoader</code> supports two different export modes:</p> <ul> <li><code>ExportType.MARKDOWN</code>: if you want to capture each input document as a separate LangChain document, or</li> <li><code>ExportType.DOC_CHUNKS</code> (default): if you want to have each input document chunked and to then capture each individual chunk as a separate LangChain document downstream.</li> </ul> <p>The example allows exploring both modes via parameter <code>EXPORT_TYPE</code>; depending on the value set, the example pipeline is then set up accordingly.</p> <ul> <li>\ud83d\udc49 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.</li> <li>Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var <code>HF_TOKEN</code>.</li> <li>Requirements can be installed as shown below (<code>--no-warn-conflicts</code> meant for Colab's pre-populated Python env; feel free to remove for stricter usage):</li> </ul> In\u00a0[1]: Copied! <pre>%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain python-dotenv\n</pre> %pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain python-dotenv <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom dotenv import load_dotenv\nfrom langchain_core.prompts import PromptTemplate\nfrom langchain_docling.loader import ExportType\n\n\ndef _get_env_from_colab_or_os(key):\n    try:\n        from google.colab import userdata\n\n        try:\n            return userdata.get(key)\n        except userdata.SecretNotFoundError:\n            pass\n    except ImportError:\n        pass\n    return os.getenv(key)\n\n\nload_dotenv()\n\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nFILE_PATH = [\"https://arxiv.org/pdf/2408.09869\"]  # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nEXPORT_TYPE = ExportType.DOC_CHUNKS\nQUESTION = \"Which are the main AI models in Docling?\"\nPROMPT = PromptTemplate.from_template(\n    \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\",\n)\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\n</pre> import os from pathlib import Path from tempfile import mkdtemp  from dotenv import load_dotenv from langchain_core.prompts import PromptTemplate from langchain_docling.loader import ExportType   def _get_env_from_colab_or_os(key):     try:         from google.colab import userdata          try:             return userdata.get(key)         except userdata.SecretNotFoundError:             pass     except ImportError:         pass     return os.getenv(key)   load_dotenv()  # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"  HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") FILE_PATH = [\"https://arxiv.org/pdf/2408.09869\"]  # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" EXPORT_TYPE = ExportType.DOC_CHUNKS QUESTION = \"Which are the main AI models in Docling?\" PROMPT = PromptTemplate.from_template(     \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\", ) TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied! <pre>from langchain_docling import DoclingLoader\n\nfrom docling.chunking import HybridChunker\n\nloader = DoclingLoader(\n    file_path=FILE_PATH,\n    export_type=EXPORT_TYPE,\n    chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n)\n\ndocs = loader.load()\n</pre> from langchain_docling import DoclingLoader  from docling.chunking import HybridChunker  loader = DoclingLoader(     file_path=FILE_PATH,     export_type=EXPORT_TYPE,     chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), )  docs = loader.load() <pre>Token indices sequence length is longer than the specified maximum sequence length for this model (1041 &gt; 512). Running this sequence through the model will result in indexing errors\n</pre> <p>Note: a message saying <code>\"Token indices sequence length is longer than the specified maximum sequence length...\"</code> can be ignored in this case \u2014 details here.</p> <p>Determining the splits:</p> In\u00a0[4]: Copied! <pre>if EXPORT_TYPE == ExportType.DOC_CHUNKS:\n    splits = docs\nelif EXPORT_TYPE == ExportType.MARKDOWN:\n    from langchain_text_splitters import MarkdownHeaderTextSplitter\n\n    splitter = MarkdownHeaderTextSplitter(\n        headers_to_split_on=[\n            (\"#\", \"Header_1\"),\n            (\"##\", \"Header_2\"),\n            (\"###\", \"Header_3\"),\n        ],\n    )\n    splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]\nelse:\n    raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\n</pre> if EXPORT_TYPE == ExportType.DOC_CHUNKS:     splits = docs elif EXPORT_TYPE == ExportType.MARKDOWN:     from langchain_text_splitters import MarkdownHeaderTextSplitter      splitter = MarkdownHeaderTextSplitter(         headers_to_split_on=[             (\"#\", \"Header_1\"),             (\"##\", \"Header_2\"),             (\"###\", \"Header_3\"),         ],     )     splits = [split for doc in docs for split in splitter.split_text(doc.page_content)] else:     raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\") <p>Inspecting some sample splits:</p> In\u00a0[5]: Copied! <pre>for d in splits[:3]:\n    print(f\"- {d.page_content=}\")\nprint(\"...\")\n</pre> for d in splits[:3]:     print(f\"- {d.page_content=}\") print(\"...\") <pre>- d.page_content='arXiv:2408.09869v5  [cs.CL]  9 Dec 2024'\n- d.page_content='Docling Technical Report\\nVersion 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R\u00a8uschlikon, Switzerland'\n- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n...\n</pre> In\u00a0[6]: Copied! <pre>import json\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom langchain_huggingface.embeddings import HuggingFaceEmbeddings\nfrom langchain_milvus import Milvus\n\nembedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)\n\n\nmilvus_uri = str(Path(mkdtemp()) / \"docling.db\")  # or set as needed\nvectorstore = Milvus.from_documents(\n    documents=splits,\n    embedding=embedding,\n    collection_name=\"docling_demo\",\n    connection_args={\"uri\": milvus_uri},\n    index_params={\"index_type\": \"FLAT\"},\n    drop_old=True,\n)\n</pre> import json from pathlib import Path from tempfile import mkdtemp  from langchain_huggingface.embeddings import HuggingFaceEmbeddings from langchain_milvus import Milvus  embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)   milvus_uri = str(Path(mkdtemp()) / \"docling.db\")  # or set as needed vectorstore = Milvus.from_documents(     documents=splits,     embedding=embedding,     collection_name=\"docling_demo\",     connection_args={\"uri\": milvus_uri},     index_params={\"index_type\": \"FLAT\"},     drop_old=True, ) In\u00a0[7]: Copied! <pre>from langchain.chains import create_retrieval_chain\nfrom langchain.chains.combine_documents import create_stuff_documents_chain\nfrom langchain_huggingface import HuggingFaceEndpoint\n\nretriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K})\nllm = HuggingFaceEndpoint(\n    repo_id=GEN_MODEL_ID,\n    huggingfacehub_api_token=HF_TOKEN,\n)\n\n\ndef clip_text(text, threshold=100):\n    return f\"{text[:threshold]}...\" if len(text) &gt; threshold else text\n</pre> from langchain.chains import create_retrieval_chain from langchain.chains.combine_documents import create_stuff_documents_chain from langchain_huggingface import HuggingFaceEndpoint  retriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K}) llm = HuggingFaceEndpoint(     repo_id=GEN_MODEL_ID,     huggingfacehub_api_token=HF_TOKEN, )   def clip_text(text, threshold=100):     return f\"{text[:threshold]}...\" if len(text) &gt; threshold else text <pre>Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\n</pre> In\u00a0[8]: Copied! <pre>question_answer_chain = create_stuff_documents_chain(llm, PROMPT)\nrag_chain = create_retrieval_chain(retriever, question_answer_chain)\nresp_dict = rag_chain.invoke({\"input\": QUESTION})\n\nclipped_answer = clip_text(resp_dict[\"answer\"], threshold=200)\nprint(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")\nfor i, doc in enumerate(resp_dict[\"context\"]):\n    print()\n    print(f\"Source {i + 1}:\")\n    print(f\"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")\n    for key in doc.metadata:\n        if key != \"pk\":\n            val = doc.metadata.get(key)\n            clipped_val = clip_text(val) if isinstance(val, str) else val\n            print(f\"  {key}: {clipped_val}\")\n</pre> question_answer_chain = create_stuff_documents_chain(llm, PROMPT) rag_chain = create_retrieval_chain(retriever, question_answer_chain) resp_dict = rag_chain.invoke({\"input\": QUESTION})  clipped_answer = clip_text(resp_dict[\"answer\"], threshold=200) print(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\") for i, doc in enumerate(resp_dict[\"context\"]):     print()     print(f\"Source {i + 1}:\")     print(f\"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")     for key in doc.metadata:         if key != \"pk\":             val = doc.metadata.get(key)             clipped_val = clip_text(val) if isinstance(val, str) else val             print(f\"  {key}: {clipped_val}\") <pre>Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nDocling initially releases two AI models, a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art tab...\n\nSource 1:\n  text: \"3.2 AI models\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re...\"\n  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 405.1419982910156, 'r': 504.00299072265625, 'b': 330.7799987792969, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n  source: https://arxiv.org/pdf/2408.09869\n\nSource 2:\n  text: \"3 Processing pipeline\\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ...\"\n  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/26', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 2, 'bbox': {'l': 108.0, 't': 273.01800537109375, 'r': 504.00299072265625, 'b': 176.83799743652344, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 796]}]}], 'headings': ['3 Processing pipeline'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n  source: https://arxiv.org/pdf/2408.09869\n\nSource 3:\n  text: \"6 Future work and contributions\\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ...\"\n  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/76', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 322.468994140625, 'r': 504.00299072265625, 'b': 259.0169982910156, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}, {'self_ref': '#/texts/77', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 251.6540069580078, 'r': 504.00299072265625, 'b': 198.99200439453125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 402]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n  source: https://arxiv.org/pdf/2408.09869\n</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/rag_langchain/#rag-with-langchain","title":"RAG with LangChain\u00b6","text":""},{"location":"examples/rag_langchain/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_langchain/#document-loading","title":"Document loading\u00b6","text":"<p>Now we can instantiate our loader and load documents.</p>"},{"location":"examples/rag_langchain/#ingestion","title":"Ingestion\u00b6","text":""},{"location":"examples/rag_langchain/#rag","title":"RAG\u00b6","text":""},{"location":"examples/rag_llamaindex/","title":"RAG with LlamaIndex","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote <p>This example leverages the official LlamaIndex Docling extension.</p> <p>Presented extensions <code>DoclingReader</code> and <code>DoclingNodeParser</code> enable you to:</p> <ul> <li>use various document types in your LLM applications with ease and speed, and</li> <li>leverage Docling's rich format for advanced, document-native grounding.</li> </ul> <ul> <li>\ud83d\udc49 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.</li> <li>Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var <code>HF_TOKEN</code>.</li> <li>Requirements can be installed as shown below (<code>--no-warn-conflicts</code> meant for Colab's pre-populated Python env; feel free to remove for stricter usage):</li> </ul> In\u00a0[1]: Copied! <pre>%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv\n</pre> %pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\nfrom warnings import filterwarnings\n\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n    try:\n        from google.colab import userdata\n\n        try:\n            return userdata.get(key)\n        except userdata.SecretNotFoundError:\n            pass\n    except ImportError:\n        pass\n    return os.getenv(key)\n\n\nload_dotenv()\n\nfilterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")\nfilterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\")\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n</pre> import os from pathlib import Path from tempfile import mkdtemp from warnings import filterwarnings  from dotenv import load_dotenv   def _get_env_from_colab_or_os(key):     try:         from google.colab import userdata          try:             return userdata.get(key)         except userdata.SecretNotFoundError:             pass     except ImportError:         pass     return os.getenv(key)   load_dotenv()  filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\") filterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\") # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" <p>We can now define the main parameters:</p> In\u00a0[3]: Copied! <pre>from llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n\nEMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\nGEN_MODEL = HuggingFaceInferenceAPI(\n    token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n    model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n)\nSOURCE = \"https://arxiv.org/pdf/2408.09869\"  # Docling Technical Report\nQUERY = \"Which are the main AI models in Docling?\"\n\nembed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\n</pre> from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI  EMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\") MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") GEN_MODEL = HuggingFaceInferenceAPI(     token=_get_env_from_colab_or_os(\"HF_TOKEN\"),     model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", ) SOURCE = \"https://arxiv.org/pdf/2408.09869\"  # Docling Technical Report QUERY = \"Which are the main AI models in Docling?\"  embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\")) <p>To create a simple RAG pipeline, we can:</p> <ul> <li>define a <code>DoclingReader</code>, which by default exports to Markdown, and</li> <li>use a standard node parser for these Markdown-based docs, e.g. a <code>MarkdownNodeParser</code></li> </ul> In\u00a0[4]: Copied! <pre>from llama_index.core import StorageContext, VectorStoreIndex\nfrom llama_index.core.node_parser import MarkdownNodeParser\nfrom llama_index.readers.docling import DoclingReader\nfrom llama_index.vector_stores.milvus import MilvusVectorStore\n\nreader = DoclingReader()\nnode_parser = MarkdownNodeParser()\n\nvector_store = MilvusVectorStore(\n    uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed\n    dim=embed_dim,\n    overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n    documents=reader.load_data(SOURCE),\n    transformations=[node_parser],\n    storage_context=StorageContext.from_defaults(vector_store=vector_store),\n    embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\n</pre> from llama_index.core import StorageContext, VectorStoreIndex from llama_index.core.node_parser import MarkdownNodeParser from llama_index.readers.docling import DoclingReader from llama_index.vector_stores.milvus import MilvusVectorStore  reader = DoclingReader() node_parser = MarkdownNodeParser()  vector_store = MilvusVectorStore(     uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed     dim=embed_dim,     overwrite=True, ) index = VectorStoreIndex.from_documents(     documents=reader.load_data(SOURCE),     transformations=[node_parser],     storage_context=StorageContext.from_defaults(vector_store=vector_store),     embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes]) <pre>Q: Which are the main AI models in Docling?\nA: The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.\n\nSources:\n</pre> <pre>[('3.2 AI models\\n\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n  {'Header_2': '3.2 AI models'}),\n (\"5 Applications\\n\\nThanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.\",\n  {'Header_2': '5 Applications'})]</pre> <p>To leverage Docling's rich native format, we:</p> <ul> <li>create a <code>DoclingReader</code> with JSON export type, and</li> <li>employ a <code>DoclingNodeParser</code> in order to appropriately parse that Docling format.</li> </ul> <p>Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):</p> In\u00a0[5]: Copied! <pre>from llama_index.node_parser.docling import DoclingNodeParser\n\nreader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\nnode_parser = DoclingNodeParser()\n\nvector_store = MilvusVectorStore(\n    uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed\n    dim=embed_dim,\n    overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n    documents=reader.load_data(SOURCE),\n    transformations=[node_parser],\n    storage_context=StorageContext.from_defaults(vector_store=vector_store),\n    embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\n</pre> from llama_index.node_parser.docling import DoclingNodeParser  reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) node_parser = DoclingNodeParser()  vector_store = MilvusVectorStore(     uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed     dim=embed_dim,     overwrite=True, ) index = VectorStoreIndex.from_documents(     documents=reader.load_data(SOURCE),     transformations=[node_parser],     storage_context=StorageContext.from_defaults(vector_store=vector_store),     embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes]) <pre>Q: Which are the main AI models in Docling?\nA: The main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art table structure recognition model.\n\nSources:\n</pre> <pre>[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n  {'schema_name': 'docling_core.transforms.chunker.DocMeta',\n   'version': '1.0.0',\n   'doc_items': [{'self_ref': '#/texts/34',\n     'parent': {'$ref': '#/body'},\n     'children': [],\n     'label': 'text',\n     'prov': [{'page_no': 3,\n       'bbox': {'l': 107.07593536376953,\n        't': 406.1695251464844,\n        'r': 504.1148681640625,\n        'b': 330.2677307128906,\n        'coord_origin': 'BOTTOMLEFT'},\n       'charspan': [0, 608]}]}],\n   'headings': ['3.2 AI models'],\n   'origin': {'mimetype': 'application/pdf',\n    'binary_hash': 14981478401387673002,\n    'filename': '2408.09869v3.pdf'}}),\n ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n  {'schema_name': 'docling_core.transforms.chunker.DocMeta',\n   'version': '1.0.0',\n   'doc_items': [{'self_ref': '#/texts/9',\n     'parent': {'$ref': '#/body'},\n     'children': [],\n     'label': 'text',\n     'prov': [{'page_no': 1,\n       'bbox': {'l': 107.0031967163086,\n        't': 136.7283935546875,\n        'r': 504.04998779296875,\n        'b': 83.30133056640625,\n        'coord_origin': 'BOTTOMLEFT'},\n       'charspan': [0, 488]}]}],\n   'headings': ['1 Introduction'],\n   'origin': {'mimetype': 'application/pdf',\n    'binary_hash': 14981478401387673002,\n    'filename': '2408.09869v3.pdf'}})]</pre> <p>To demonstrate this usage pattern, we first set up a test document directory.</p> In\u00a0[6]: Copied! <pre>from pathlib import Path\nfrom tempfile import mkdtemp\n\nimport requests\n\ntmp_dir_path = Path(mkdtemp())\nr = requests.get(SOURCE)\nwith open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:\n    out_file.write(r.content)\n</pre> from pathlib import Path from tempfile import mkdtemp  import requests  tmp_dir_path = Path(mkdtemp()) r = requests.get(SOURCE) with open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:     out_file.write(r.content) <p>Using the <code>reader</code> and <code>node_parser</code> definitions from any of the above variants, usage with <code>SimpleDirectoryReader</code> then looks as follows:</p> In\u00a0[7]: Copied! <pre>from llama_index.core import SimpleDirectoryReader\n\ndir_reader = SimpleDirectoryReader(\n    input_dir=tmp_dir_path,\n    file_extractor={\".pdf\": reader},\n)\n\nvector_store = MilvusVectorStore(\n    uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed\n    dim=embed_dim,\n    overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n    documents=dir_reader.load_data(SOURCE),\n    transformations=[node_parser],\n    storage_context=StorageContext.from_defaults(vector_store=vector_store),\n    embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\n</pre> from llama_index.core import SimpleDirectoryReader  dir_reader = SimpleDirectoryReader(     input_dir=tmp_dir_path,     file_extractor={\".pdf\": reader}, )  vector_store = MilvusVectorStore(     uri=str(Path(mkdtemp()) / \"docling.db\"),  # or set as needed     dim=embed_dim,     overwrite=True, ) index = VectorStoreIndex.from_documents(     documents=dir_reader.load_data(SOURCE),     transformations=[node_parser],     storage_context=StorageContext.from_defaults(vector_store=vector_store),     embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes]) <pre>Loading files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:11&lt;00:00, 11.27s/file]\n</pre> <pre>Q: Which are the main AI models in Docling?\nA: 1. A layout analysis model, an accurate object-detector for page elements. 2. TableFormer, a state-of-the-art table structure recognition model.\n\nSources:\n</pre> <pre>[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n  {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf',\n   'file_name': '2408.09869.pdf',\n   'file_type': 'application/pdf',\n   'file_size': 5566574,\n   'creation_date': '2024-10-28',\n   'last_modified_date': '2024-10-28',\n   'schema_name': 'docling_core.transforms.chunker.DocMeta',\n   'version': '1.0.0',\n   'doc_items': [{'self_ref': '#/texts/34',\n     'parent': {'$ref': '#/body'},\n     'children': [],\n     'label': 'text',\n     'prov': [{'page_no': 3,\n       'bbox': {'l': 107.07593536376953,\n        't': 406.1695251464844,\n        'r': 504.1148681640625,\n        'b': 330.2677307128906,\n        'coord_origin': 'BOTTOMLEFT'},\n       'charspan': [0, 608]}]}],\n   'headings': ['3.2 AI models'],\n   'origin': {'mimetype': 'application/pdf',\n    'binary_hash': 14981478401387673002,\n    'filename': '2408.09869.pdf'}}),\n ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n  {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf',\n   'file_name': '2408.09869.pdf',\n   'file_type': 'application/pdf',\n   'file_size': 5566574,\n   'creation_date': '2024-10-28',\n   'last_modified_date': '2024-10-28',\n   'schema_name': 'docling_core.transforms.chunker.DocMeta',\n   'version': '1.0.0',\n   'doc_items': [{'self_ref': '#/texts/9',\n     'parent': {'$ref': '#/body'},\n     'children': [],\n     'label': 'text',\n     'prov': [{'page_no': 1,\n       'bbox': {'l': 107.0031967163086,\n        't': 136.7283935546875,\n        'r': 504.04998779296875,\n        'b': 83.30133056640625,\n        'coord_origin': 'BOTTOMLEFT'},\n       'charspan': [0, 488]}]}],\n   'headings': ['1 Introduction'],\n   'origin': {'mimetype': 'application/pdf',\n    'binary_hash': 14981478401387673002,\n    'filename': '2408.09869.pdf'}})]</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/rag_llamaindex/#rag-with-llamaindex","title":"RAG with LlamaIndex\u00b6","text":""},{"location":"examples/rag_llamaindex/#overview","title":"Overview\u00b6","text":""},{"location":"examples/rag_llamaindex/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_llamaindex/#using-markdown-export","title":"Using Markdown export\u00b6","text":""},{"location":"examples/rag_llamaindex/#using-docling-format","title":"Using Docling format\u00b6","text":""},{"location":"examples/rag_llamaindex/#with-simple-directory-reader","title":"With Simple Directory Reader\u00b6","text":""},{"location":"examples/rag_milvus/","title":"RAG with Milvus","text":"In\u00a0[\u00a0]: Copied! <pre>! pip install --upgrade \"pymilvus[milvus-lite]\" docling openai torch\n</pre> ! pip install --upgrade \"pymilvus[milvus-lite]\" docling openai torch <p>If you are using Google Colab, to enable dependencies just installed, you may need to restart the runtime (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu).</p> <p>Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.</p> <p>The code below checks to see if a GPU is available, either via CUDA or MPS.</p> In\u00a0[1]: Copied! <pre>import torch\n\n# Check if GPU or MPS is available\nif torch.cuda.is_available():\n    device = torch.device(\"cuda\")\n    print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n    device = torch.device(\"mps\")\n    print(\"MPS GPU is enabled.\")\nelse:\n    raise OSError(\n        \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n    )\n</pre> import torch  # Check if GPU or MPS is available if torch.cuda.is_available():     device = torch.device(\"cuda\")     print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available():     device = torch.device(\"mps\")     print(\"MPS GPU is enabled.\") else:     raise OSError(         \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"     ) <pre>MPS GPU is enabled.\n</pre> In\u00a0[2]: Copied! <pre>import os\n\nos.environ[\"OPENAI_API_KEY\"] = \"sk-***********\"\n</pre> import os  os.environ[\"OPENAI_API_KEY\"] = \"sk-***********\" In\u00a0[3]: Copied! <pre>from openai import OpenAI\n\nopenai_client = OpenAI()\n</pre> from openai import OpenAI  openai_client = OpenAI() <p>Define a function to generate text embeddings using OpenAI client. We use the text-embedding-3-small model as an example.</p> In\u00a0[4]: Copied! <pre>def emb_text(text):\n    return (\n        openai_client.embeddings.create(input=text, model=\"text-embedding-3-small\")\n        .data[0]\n        .embedding\n    )\n</pre> def emb_text(text):     return (         openai_client.embeddings.create(input=text, model=\"text-embedding-3-small\")         .data[0]         .embedding     ) <p>Generate a test embedding and print its dimension and first few elements.</p> In\u00a0[5]: Copied! <pre>test_embedding = emb_text(\"This is a test\")\nembedding_dim = len(test_embedding)\nprint(embedding_dim)\nprint(test_embedding[:10])\n</pre> test_embedding = emb_text(\"This is a test\") embedding_dim = len(test_embedding) print(embedding_dim) print(test_embedding[:10]) <pre>1536\n[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]\n</pre> <p>In this tutorial, we will use a Markdown file (source) as the input. We will process the document using a HierarchicalChunker provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks.</p> In\u00a0[6]: Copied! <pre>from docling_core.transforms.chunker import HierarchicalChunker\n\nfrom docling.document_converter import DocumentConverter\n\nconverter = DocumentConverter()\nchunker = HierarchicalChunker()\n\n# Convert the input file to Docling Document\nsource = \"https://milvus.io/docs/overview.md\"\ndoc = converter.convert(source).document\n\n# Perform hierarchical chunking\ntexts = [chunk.text for chunk in chunker.chunk(doc)]\n</pre> from docling_core.transforms.chunker import HierarchicalChunker  from docling.document_converter import DocumentConverter  converter = DocumentConverter() chunker = HierarchicalChunker()  # Convert the input file to Docling Document source = \"https://milvus.io/docs/overview.md\" doc = converter.convert(source).document  # Perform hierarchical chunking texts = [chunk.text for chunk in chunker.chunk(doc)] In\u00a0[7]: Copied! <pre>from pymilvus import MilvusClient\n\nmilvus_client = MilvusClient(uri=\"./milvus_demo.db\")\ncollection_name = \"my_rag_collection\"\n</pre> from pymilvus import MilvusClient  milvus_client = MilvusClient(uri=\"./milvus_demo.db\") collection_name = \"my_rag_collection\" <p>As for the argument of <code>MilvusClient</code>:</p> <ul> <li>Setting the <code>uri</code> as a local file, e.g.<code>./milvus.db</code>, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file.</li> <li>If you have large scale of data, you can set up a more performant Milvus server on docker or kubernetes. In this setup, please use the server uri, e.g.<code>http://localhost:19530</code>, as your <code>uri</code>.</li> <li>If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust the <code>uri</code> and <code>token</code>, which correspond to the Public Endpoint and Api key in Zilliz Cloud.</li> </ul> <p>Check if the collection already exists and drop it if it does.</p> In\u00a0[8]: Copied! <pre>if milvus_client.has_collection(collection_name):\n    milvus_client.drop_collection(collection_name)\n</pre> if milvus_client.has_collection(collection_name):     milvus_client.drop_collection(collection_name) <p>Create a new collection with specified parameters.</p> <p>If we don\u2019t specify any field information, Milvus will automatically create a default <code>id</code> field for primary key, and a <code>vector</code> field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values.</p> In\u00a0[9]: Copied! <pre>milvus_client.create_collection(\n    collection_name=collection_name,\n    dimension=embedding_dim,\n    metric_type=\"IP\",  # Inner product distance\n    consistency_level=\"Strong\",  # Supported values are (`\"Strong\"`, `\"Session\"`, `\"Bounded\"`, `\"Eventually\"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details.\n)\n</pre> milvus_client.create_collection(     collection_name=collection_name,     dimension=embedding_dim,     metric_type=\"IP\",  # Inner product distance     consistency_level=\"Strong\",  # Supported values are (`\"Strong\"`, `\"Session\"`, `\"Bounded\"`, `\"Eventually\"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details. ) In\u00a0[10]: Copied! <pre>from tqdm import tqdm\n\ndata = []\n\nfor i, chunk in enumerate(tqdm(texts, desc=\"Processing chunks\")):\n    embedding = emb_text(chunk)\n    data.append({\"id\": i, \"vector\": embedding, \"text\": chunk})\n\nmilvus_client.insert(collection_name=collection_name, data=data)\n</pre> from tqdm import tqdm  data = []  for i, chunk in enumerate(tqdm(texts, desc=\"Processing chunks\")):     embedding = emb_text(chunk)     data.append({\"id\": i, \"vector\": embedding, \"text\": chunk})  milvus_client.insert(collection_name=collection_name, data=data) <pre>Processing chunks: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 38/38 [00:14&lt;00:00,  2.59it/s]\n</pre> Out[10]: <pre>{'insert_count': 38, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37], 'cost': 0}</pre> In\u00a0[11]: Copied! <pre>question = (\n    \"What are the three deployment modes of Milvus, and what are their differences?\"\n)\n</pre> question = (     \"What are the three deployment modes of Milvus, and what are their differences?\" ) <p>Search for the question in the collection and retrieve the semantic top-3 matches.</p> In\u00a0[12]: Copied! <pre>search_res = milvus_client.search(\n    collection_name=collection_name,\n    data=[emb_text(question)],\n    limit=3,\n    search_params={\"metric_type\": \"IP\", \"params\": {}},\n    output_fields=[\"text\"],\n)\n</pre> search_res = milvus_client.search(     collection_name=collection_name,     data=[emb_text(question)],     limit=3,     search_params={\"metric_type\": \"IP\", \"params\": {}},     output_fields=[\"text\"], ) <p>Let\u2019s take a look at the search results of the query</p> In\u00a0[13]: Copied! <pre>import json\n\nretrieved_lines_with_distances = [\n    (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0]\n]\nprint(json.dumps(retrieved_lines_with_distances, indent=4))\n</pre> import json  retrieved_lines_with_distances = [     (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0] ] print(json.dumps(retrieved_lines_with_distances, indent=4)) <pre>[\n    [\n        \"Milvus offers three deployment modes, covering a wide range of data scales\\u2014from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors:\",\n        0.6503315567970276\n    ],\n    [\n        \"Milvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it\\u2019s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\\nMilvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\\nMilvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more.\",\n        0.6281915903091431\n    ],\n    [\n        \"What is Milvus?\\nUnstructured Data, Embeddings, and Milvus\\nWhat Makes Milvus so Fast\\uff1f\\nWhat Makes Milvus so Scalable\\nTypes of Searches Supported by Milvus\\nComprehensive Feature Set\",\n        0.6117826700210571\n    ]\n]\n</pre> In\u00a0[14]: Copied! <pre>context = \"\\n\".join(\n    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]\n)\n</pre> context = \"\\n\".join(     [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances] ) <p>Define system and user prompts for the Lanage Model. This prompt is assembled with the retrieved documents from Milvus.</p> In\u00a0[16]: Copied! <pre>SYSTEM_PROMPT = \"\"\"\nHuman: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.\n\"\"\"\nUSER_PROMPT = f\"\"\"\nUse the following pieces of information enclosed in &lt;context&gt; tags to provide an answer to the question enclosed in &lt;question&gt; tags.\n&lt;context&gt;\n{context}\n&lt;/context&gt;\n&lt;question&gt;\n{question}\n&lt;/question&gt;\n\"\"\"\n</pre> SYSTEM_PROMPT = \"\"\" Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. \"\"\" USER_PROMPT = f\"\"\" Use the following pieces of information enclosed in  tags to provide an answer to the question enclosed in  tags.  {context}   {question}  \"\"\" <p>Use OpenAI ChatGPT to generate a response based on the prompts.</p> In\u00a0[17]: Copied! <pre>response = openai_client.chat.completions.create(\n    model=\"gpt-4o\",\n    messages=[\n        {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n        {\"role\": \"user\", \"content\": USER_PROMPT},\n    ],\n)\nprint(response.choices[0].message.content)\n</pre> response = openai_client.chat.completions.create(     model=\"gpt-4o\",     messages=[         {\"role\": \"system\", \"content\": SYSTEM_PROMPT},         {\"role\": \"user\", \"content\": USER_PROMPT},     ], ) print(response.choices[0].message.content) <pre>The three deployment modes of Milvus are:\n\n1. **Milvus Lite**: This is a Python library that integrates easily into your applications. It's a lightweight version ideal for quick prototyping in Jupyter Notebooks or for running on edge devices with limited resources.\n\n2. **Milvus Standalone**: This mode is a single-machine server deployment where all components are bundled into a single Docker image, making it convenient to deploy.\n\n3. **Milvus Distributed**: This mode is designed for deployment on Kubernetes clusters. It features a cloud-native architecture suited for managing scenarios at a billion-scale or larger, ensuring redundancy in critical components.\n</pre>"},{"location":"examples/rag_milvus/#rag-with-milvus","title":"RAG with Milvus\u00b6","text":"Step Tech Execution Embedding OpenAI (text-embedding-3-small) \ud83c\udf10 Remote Vector store Milvus \ud83d\udcbb Local Gen AI OpenAI (gpt-4o) \ud83c\udf10 Remote"},{"location":"examples/rag_milvus/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"<p>This is a code recipe that uses Milvus, the world's most advanced open-source vector database, to perform RAG over documents parsed by Docling.</p> <p>In this notebook, we accomplish the following:</p> <ul> <li>Parse documents using Docling's document conversion capabilities</li> <li>Perform hierarchical chunking of the documents using Docling</li> <li>Generate text embeddings with OpenAI</li> <li>Perform RAG using Milvus, the world's most advanced open-source vector database</li> </ul> <p>Note: For best results, please use GPU acceleration to run this notebook. Here are two options for running this notebook:</p> <ol> <li>Locally on a MacBook with an Apple Silicon chip. Converting all documents in the notebook takes ~2 minutes on a MacBook M2 due to Docling's usage of MPS accelerators.</li> <li>Run this notebook on Google Colab. Converting all documents in the notebook takes ~8 minutes on a Google Colab T4 GPU.</li> </ol>"},{"location":"examples/rag_milvus/#preparation","title":"Preparation\u00b6","text":""},{"location":"examples/rag_milvus/#dependencies-and-environment","title":"Dependencies and Environment\u00b6","text":"<p>To start, install the required dependencies by running the following command:</p>"},{"location":"examples/rag_milvus/#gpu-checking","title":"GPU Checking\u00b6","text":""},{"location":"examples/rag_milvus/#setting-up-api-keys","title":"Setting Up API Keys\u00b6","text":"<p>We will use OpenAI as the LLM in this example. You should prepare the OPENAI_API_KEY as an environment variable.</p>"},{"location":"examples/rag_milvus/#prepare-the-llm-and-embedding-model","title":"Prepare the LLM and Embedding Model\u00b6","text":"<p>We initialize the OpenAI client to prepare the embedding model.</p>"},{"location":"examples/rag_milvus/#process-data-using-docling","title":"Process Data Using Docling\u00b6","text":"<p>Docling can parse various document formats into a unified representation (Docling Document), which can then be exported to different output formats. For a full list of supported input and output formats, please refer to the official documentation.</p>"},{"location":"examples/rag_milvus/#load-data-into-milvus","title":"Load Data into Milvus\u00b6","text":""},{"location":"examples/rag_milvus/#create-the-collection","title":"Create the collection\u00b6","text":"<p>With data in hand, we can create a <code>MilvusClient</code> instance and insert the data into a Milvus collection.</p>"},{"location":"examples/rag_milvus/#insert-data","title":"Insert data\u00b6","text":""},{"location":"examples/rag_milvus/#build-rag","title":"Build RAG\u00b6","text":""},{"location":"examples/rag_milvus/#retrieve-data-for-a-query","title":"Retrieve data for a query\u00b6","text":"<p>Let\u2019s specify a query question about the website we just scraped.</p>"},{"location":"examples/rag_milvus/#use-llm-to-get-a-rag-response","title":"Use LLM to get a RAG response\u00b6","text":"<p>Convert the retrieved documents into a string format.</p>"},{"location":"examples/rag_mongodb/","title":"RAG with MongoDB + VoyageAI","text":"Step Tech Execution Embedding Voyage AI \ud83c\udf10 Remote Vector store MongoDB \ud83c\udf10 Remote Gen AI Azure Open AI \ud83c\udf10 Remote In\u00a0[124]: Copied! <pre>%%capture\n%pip install docling~=\"2.7.0\"\n%pip install pymongo[srv]\n%pip install voyageai\n%pip install openai\n\nimport logging\nimport warnings\n\nwarnings.filterwarnings(\"ignore\")\nlogging.getLogger(\"pymongo\").setLevel(logging.ERROR)\n</pre> %%capture %pip install docling~=\"2.7.0\" %pip install pymongo[srv] %pip install voyageai %pip install openai  import logging import warnings  warnings.filterwarnings(\"ignore\") logging.getLogger(\"pymongo\").setLevel(logging.ERROR) In\u00a0[125]: Copied! <pre>import torch\n\n# Check if GPU or MPS is available\nif torch.cuda.is_available():\n    device = torch.device(\"cuda\")\n    print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n    device = torch.device(\"mps\")\n    print(\"MPS GPU is enabled.\")\nelse:\n    raise OSError(\n        \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n    )\n</pre> import torch  # Check if GPU or MPS is available if torch.cuda.is_available():     device = torch.device(\"cuda\")     print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available():     device = torch.device(\"mps\")     print(\"MPS GPU is enabled.\") else:     raise OSError(         \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"     ) <pre>MPS GPU is enabled.\n</pre> In\u00a0[126]: Copied! <pre># Influential machine learning papers\nsource_urls = [\n    \"https://arxiv.org/pdf/1706.03762\"  # Attention is All You Need\n]\n</pre> # Influential machine learning papers source_urls = [     \"https://arxiv.org/pdf/1706.03762\"  # Attention is All You Need ] In\u00a0[127]: Copied! <pre>from pprint import pprint\n\nfrom docling.document_converter import DocumentConverter\n\n# Instantiate the doc converter\ndoc_converter = DocumentConverter()\n\n# Since we want to use a single document, we will convert just the first URL. For multiple documents, you can use convert_all() method and then iterate through the list of converted documents.\npdf_doc = source_urls[0]\nconverted_doc = doc_converter.convert(pdf_doc).document\n</pre> from pprint import pprint  from docling.document_converter import DocumentConverter  # Instantiate the doc converter doc_converter = DocumentConverter()  # Since we want to use a single document, we will convert just the first URL. For multiple documents, you can use convert_all() method and then iterate through the list of converted documents. pdf_doc = source_urls[0] converted_doc = doc_converter.convert(pdf_doc).document <pre>Fetching 9 files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 9/9 [00:00&lt;00:00, 73728.00it/s]\n</pre> In\u00a0[137]: Copied! <pre>from docling_core.transforms.chunker import HierarchicalChunker\n\n# Initialize the chunker\nchunker = HierarchicalChunker()\n\n# Perform hierarchical chunking on the converted document and get text from chunks\nchunks = list(chunker.chunk(converted_doc))\nchunk_texts = [chunk.text for chunk in chunks]\nchunk_texts[:20]  # Display a few chunk texts\n</pre> from docling_core.transforms.chunker import HierarchicalChunker  # Initialize the chunker chunker = HierarchicalChunker()  # Perform hierarchical chunking on the converted document and get text from chunks chunks = list(chunker.chunk(converted_doc)) chunk_texts = [chunk.text for chunk in chunks] chunk_texts[:20]  # Display a few chunk texts Out[137]: <pre>['arXiv:1706.03762v7 [cs.CL] 2 Aug 2023',\n 'Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.',\n 'Ashish Vaswani \u2217 Google Brain avaswani@google.com',\n 'Noam Shazeer \u2217 Google Brain noam@google.com',\n 'Niki Parmar \u2217 Google Research nikip@google.com',\n 'Jakob Uszkoreit \u2217 Google Research usz@google.com',\n 'Llion Jones \u2217 Google Research llion@google.com',\n 'Aidan N. Gomez \u2217 \u2020 University of Toronto aidan@cs.toronto.edu',\n '\u0141ukasz Kaiser \u2217 Google Brain lukaszkaiser@google.com',\n 'Illia Polosukhin \u2217 \u2021',\n 'illia.polosukhin@gmail.com',\n 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.',\n '$^{\u2217}$Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.',\n '$^{\u2020}$Work performed while at Google Brain.',\n '$^{\u2021}$Work performed while at Google Research.',\n '31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.',\n 'Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].',\n 'Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states h$_{t}$ , as a function of the previous hidden state h$_{t}$$_{-}$$_{1}$ and the input for position t . This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.',\n 'Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.',\n 'In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.']</pre> <p>We will be using VoyageAI embedding model for converting the above chunks to embeddings, thereafter pushing them to MongoDB for further consumption.</p> <p>VoyageAI has a load of offerings for embedding models, we will be using <code>voyage-context-3</code> for best results in this case, which is a contextualized chunk embedding model, where chunk embedding encodes not only the chunk\u2019s own content, but also captures the contextual information from the full document.</p> <p>You can go through the blogpost to understand how it performas in comparison to other embedding models.</p> <p>Create an account on Voyage and get you API key.</p> In\u00a0[\u00a0]: Copied! <pre>import voyageai\n\n# Voyage API key\nVOYAGE_API_KEY = \"**********************\"\n\n# Initialize the VoyageAI client\nvo = voyageai.Client(VOYAGE_API_KEY)\nresult = vo.contextualized_embed(inputs=[chunk_texts], model=\"voyage-context-3\")\ncontextualized_chunk_embds = [emb for r in result.results for emb in r.embeddings]\n</pre> import voyageai  # Voyage API key VOYAGE_API_KEY = \"**********************\"  # Initialize the VoyageAI client vo = voyageai.Client(VOYAGE_API_KEY) result = vo.contextualized_embed(inputs=[chunk_texts], model=\"voyage-context-3\") contextualized_chunk_embds = [emb for r in result.results for emb in r.embeddings] In\u00a0[121]: Copied! <pre># Check lengths to ensure they match\nprint(\"Chunk Texts Length:\", chunk_texts.__len__())\nprint(\"Contextualized Chunk Embeddings Length:\", contextualized_chunk_embds.__len__())\n</pre> # Check lengths to ensure they match print(\"Chunk Texts Length:\", chunk_texts.__len__()) print(\"Contextualized Chunk Embeddings Length:\", contextualized_chunk_embds.__len__()) <pre>Chunk Texts Length: 118\nContextualized Chunk Embeddings Length: 118\n</pre> In\u00a0[115]: Copied! <pre># Combine chunks with their embeddings\nchunk_data = [\n    {\"text\": text, \"embedding\": emb}\n    for text, emb in zip(chunk_texts, contextualized_chunk_embds)\n]\n</pre> # Combine chunks with their embeddings chunk_data = [     {\"text\": text, \"embedding\": emb}     for text, emb in zip(chunk_texts, contextualized_chunk_embds) ] In\u00a0[\u00a0]: Copied! <pre># Insert to MongoDB\nfrom pymongo import MongoClient\n\nclient = MongoClient(\n    \"mongodb+srv://*******.mongodb.net/\"\n)  # Replace with your MongoDB connection string\ndb = client[\"rag_db\"]  # Database name\ncollection = db[\"documents\"]  # Collection name\n\n# Insert chunk data into MongoDB\nresponse = collection.insert_many(chunk_data)\nprint(f\"Inserted {len(response.inserted_ids)} documents into MongoDB.\")\n</pre> # Insert to MongoDB from pymongo import MongoClient  client = MongoClient(     \"mongodb+srv://*******.mongodb.net/\" )  # Replace with your MongoDB connection string db = client[\"rag_db\"]  # Database name collection = db[\"documents\"]  # Collection name  # Insert chunk data into MongoDB response = collection.insert_many(chunk_data) print(f\"Inserted {len(response.inserted_ids)} documents into MongoDB.\") <pre>Inserted 118 documents into MongoDB.\n</pre> In\u00a0[117]: Copied! <pre>from pymongo.operations import SearchIndexModel\n\n# Create your index model, then create the search index\nsearch_index_model = SearchIndexModel(\n    definition={\n        \"fields\": [\n            {\n                \"type\": \"vector\",\n                \"path\": \"embedding\",\n                \"numDimensions\": 1024,\n                \"similarity\": \"dotProduct\",\n            }\n        ]\n    },\n    name=\"vector_index\",\n    type=\"vectorSearch\",\n)\nresult = collection.create_search_index(model=search_index_model)\nprint(\"New search index named \" + result + \" is building.\")\n</pre> from pymongo.operations import SearchIndexModel  # Create your index model, then create the search index search_index_model = SearchIndexModel(     definition={         \"fields\": [             {                 \"type\": \"vector\",                 \"path\": \"embedding\",                 \"numDimensions\": 1024,                 \"similarity\": \"dotProduct\",             }         ]     },     name=\"vector_index\",     type=\"vectorSearch\", ) result = collection.create_search_index(model=search_index_model) print(\"New search index named \" + result + \" is building.\") <pre>New search index named vector_index is building.\n</pre> In\u00a0[\u00a0]: Copied! <pre>import os\n\nfrom openai import AzureOpenAI\nfrom rich.console import Console\nfrom rich.panel import Panel\n\n# Create MongoDB vector search query for \"Attention is All You Need\"\n# (prompt already defined above, reuse if present; else keep this definition)\nprompt = \"Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context.\"\n\n# Generate embedding for the query using VoyageAI (vo already initialized earlier)\nquery_embd_context = (\n    vo.contextualized_embed(\n        inputs=[[prompt]], model=\"voyage-context-3\", input_type=\"query\"\n    )\n    .results[0]\n    .embeddings[0]\n)\n\n# Vector search pipeline\nsearch_pipeline = [\n    {\n        \"$vectorSearch\": {\n            \"index\": \"vector_index\",\n            \"path\": \"embedding\",\n            \"queryVector\": query_embd_context,\n            \"numCandidates\": 10,\n            \"limit\": 10,\n        }\n    },\n    {\"$project\": {\"text\": 1, \"_id\": 0, \"score\": {\"$meta\": \"vectorSearchScore\"}}},\n]\n\nresults = list(collection.aggregate(search_pipeline))\nif not results:\n    raise ValueError(\n        \"No vector search results returned. Verify the index is built before querying.\"\n    )\n\ncontext_texts = [doc[\"text\"] for doc in results]\ncombined_context = \"\\n\\n\".join(context_texts)\n\n# Expect these environment variables to be set (do NOT hardcode secrets):\n#   AZURE_OPENAI_API_KEY\n#   AZURE_OPENAI_ENDPOINT   -&gt; e.g. https://your-resource-name.openai.azure.com/\n#   AZURE_OPENAI_API_VERSION (optional, else fallback)\nAZURE_OPENAI_API_KEY = \"**********************\"\nAZURE_OPENAI_ENDPOINT = \"**********************\"\nAZURE_OPENAI_API_VERSION = \"**********************\"\n\n# Initialize Azure OpenAI client (endpoint must NOT include path segments)\nclient = AzureOpenAI(\n    api_key=AZURE_OPENAI_API_KEY,\n    azure_endpoint=AZURE_OPENAI_ENDPOINT.rstrip(\"/\"),\n    api_version=AZURE_OPENAI_API_VERSION,\n)\n\n# Chat completion using retrieved context\nresponse = client.chat.completions.create(\n    model=\"gpt-4o-mini\",  # Azure deployment name\n    messages=[\n        {\n            \"role\": \"system\",\n            \"content\": \"You are a helpful assistant. Use only the provided context to answer questions. If the context is insufficient, say so.\",\n        },\n        {\n            \"role\": \"user\",\n            \"content\": f\"Context:\\n{combined_context}\\n\\nQuestion: {prompt}\",\n        },\n    ],\n    temperature=0.2,\n)\n\nresponse_text = response.choices[0].message.content\n\nconsole = Console()\nconsole.print(Panel(f\"{prompt}\", title=\"Prompt\", border_style=\"bold red\"))\nconsole.print(\n    Panel(response_text, title=\"Generated Content\", border_style=\"bold green\")\n)\n</pre> import os  from openai import AzureOpenAI from rich.console import Console from rich.panel import Panel  # Create MongoDB vector search query for \"Attention is All You Need\" # (prompt already defined above, reuse if present; else keep this definition) prompt = \"Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context.\"  # Generate embedding for the query using VoyageAI (vo already initialized earlier) query_embd_context = (     vo.contextualized_embed(         inputs=[[prompt]], model=\"voyage-context-3\", input_type=\"query\"     )     .results[0]     .embeddings[0] )  # Vector search pipeline search_pipeline = [     {         \"$vectorSearch\": {             \"index\": \"vector_index\",             \"path\": \"embedding\",             \"queryVector\": query_embd_context,             \"numCandidates\": 10,             \"limit\": 10,         }     },     {\"$project\": {\"text\": 1, \"_id\": 0, \"score\": {\"$meta\": \"vectorSearchScore\"}}}, ]  results = list(collection.aggregate(search_pipeline)) if not results:     raise ValueError(         \"No vector search results returned. Verify the index is built before querying.\"     )  context_texts = [doc[\"text\"] for doc in results] combined_context = \"\\n\\n\".join(context_texts)  # Expect these environment variables to be set (do NOT hardcode secrets): #   AZURE_OPENAI_API_KEY #   AZURE_OPENAI_ENDPOINT   -&gt; e.g. https://your-resource-name.openai.azure.com/ #   AZURE_OPENAI_API_VERSION (optional, else fallback) AZURE_OPENAI_API_KEY = \"**********************\" AZURE_OPENAI_ENDPOINT = \"**********************\" AZURE_OPENAI_API_VERSION = \"**********************\"  # Initialize Azure OpenAI client (endpoint must NOT include path segments) client = AzureOpenAI(     api_key=AZURE_OPENAI_API_KEY,     azure_endpoint=AZURE_OPENAI_ENDPOINT.rstrip(\"/\"),     api_version=AZURE_OPENAI_API_VERSION, )  # Chat completion using retrieved context response = client.chat.completions.create(     model=\"gpt-4o-mini\",  # Azure deployment name     messages=[         {             \"role\": \"system\",             \"content\": \"You are a helpful assistant. Use only the provided context to answer questions. If the context is insufficient, say so.\",         },         {             \"role\": \"user\",             \"content\": f\"Context:\\n{combined_context}\\n\\nQuestion: {prompt}\",         },     ],     temperature=0.2, )  response_text = response.choices[0].message.content  console = Console() console.print(Panel(f\"{prompt}\", title=\"Prompt\", border_style=\"bold red\")) console.print(     Panel(response_text, title=\"Generated Content\", border_style=\"bold green\") ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context.               \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 1. **Introduction of the Transformer Architecture**: The Transformer model is a novel architecture that relies  \u2502\n\u2502 entirely on attention mechanisms, eliminating the need for recurrence and convolutions. This allows for         \u2502\n\u2502 significantly more parallelization during training and leads to superior performance in tasks such as machine   \u2502\n\u2502 translation.                                                                                                    \u2502\n\u2502                                                                                                                 \u2502\n\u2502 2. **Performance and Efficiency**: The Transformer achieves state-of-the-art results on machine translation     \u2502\n\u2502 tasks, such as a BLEU score of 28.4 on the WMT 2014 English-to-German task and 41.8 on the English-to-French    \u2502\n\u2502 task, while requiring much less training time (3.5 days on eight GPUs) compared to previous models. This        \u2502\n\u2502 demonstrates the efficiency and effectiveness of the architecture.                                              \u2502\n\u2502                                                                                                                 \u2502\n\u2502 3. **Self-Attention Mechanism**: The self-attention layers in both the encoder and decoder allow for each       \u2502\n\u2502 position to attend to all other positions in the sequence, enabling the model to capture global dependencies.   \u2502\n\u2502 This mechanism is more computationally efficient than recurrent layers, which require sequential operations,    \u2502\n\u2502 thus improving the model's speed and scalability.                                                               \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>This notebook demonstrated a powerful RAG pipeline using MongoDB, VoyageAI, and Azure OpenAI. By combining MongoDB's vector search capabilities with VoyageAI's embeddings and Azure OpenAI's language models, we created an intelligent document retrieval system.</p>"},{"location":"examples/rag_mongodb/#rag-with-mongodb-voyageai","title":"RAG with MongoDB + VoyageAI\u00b6","text":""},{"location":"examples/rag_mongodb/#how-to-cook","title":"How to cook\u00b6","text":"<p>This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) pipeline using MongoDB as a vector store and Voyage AI embedding models for semantic search. The workflow involves extracting and chunking text from documents, generating embeddings with Voyage AI, storing vectors in MongoDB, and leveraging OpenAI for generative responses.</p> <ul> <li>MongoDB Vector Search: MongoDB supports storing and searching high-dimensional vectors, enabling efficient similarity search for RAG applications. Learn more: MongoDB Vector Search</li> <li>Voyage AI Embeddings: Voyage AI provides state-of-the-art embedding models for text, supporting robust semantic search and retrieval. See: Voyage AI Documentation</li> <li>OpenAI LLM Models: Azure OpenAI's models are used to generate answers based on retrieved context. More info: Azure OpenAI API</li> </ul> <p>By combining these technologies, you can build scalable, production-ready RAG systems for advanced document understanding and question answering.</p>"},{"location":"examples/rag_mongodb/#setting-up-your-environment","title":"Setting Up Your Environment\u00b6","text":"<p>First, we'll install the necessary libraries and configure our environment. These packages enable document processing, database connections, embedding generation, and AI model interaction. We're using Docling for document handling, PyMongo for MongoDB integration, VoyageAI for embeddings, and OpenAI client for generation capabilities.</p>"},{"location":"examples/rag_mongodb/#part-1-setting-up-docling","title":"Part 1: Setting up Docling\u00b6","text":"<p>Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.</p> <p>The code below checks to see if a GPU is available, either via CUDA or MPS.</p>"},{"location":"examples/rag_mongodb/#single-document-rag-baseline","title":"Single-Document RAG Baseline\u00b6","text":"<p>To begin, we will focus on a single seminal paper and treat it as the entire knowledge base. Building a Retrieval-Augmented Generation (RAG) pipeline on just one document serves as a clear, controlled baseline before scaling to multiple sources. This helps validate each stage of the workflow (parsing, chunking, embedding, retrieval, generation) without confounding factors introduced by inter-document noise.</p>"},{"location":"examples/rag_mongodb/#convert-source-documents-to-markdown","title":"Convert Source Documents to Markdown\u00b6","text":"<p>Convert each source URL to Markdown with Docling, reusing any already-converted document to avoid redundant downloads/parsing. Produces a dict mapping URLs to their Markdown content.</p> <p>There are other methods that can be used to</p>"},{"location":"examples/rag_mongodb/#post-process-extracted-document-data","title":"Post-process extracted document data\u00b6","text":"<p>We use Docling's <code>HierarchicalChunker()</code> to perform hierarchy-aware chunking of our list of documents. This is meant to preserve some of the structure and relationships within the document, which enables more accurate and relevant retrieval in our RAG pipeline.</p>"},{"location":"examples/rag_mongodb/#part-2-voyageai-by-mongodb","title":"Part 2: VoyageAI (by MongoDB)\u00b6","text":""},{"location":"examples/rag_mongodb/#we-will-be-using-voyageai-for-embedding-creation","title":"We will be using VoyageAI for embedding creation.\u00b6","text":""},{"location":"examples/rag_mongodb/#part-3-inserting-to-mongodb","title":"Part 3: Inserting to MongoDB\u00b6","text":"<p>With the generated embeddings prepared, we now insert them into MongoDB so they can be leveraged in the RAG pipeline.</p> <p>MongoDB is an ideal vector store for RAG applications because:</p> <ul> <li>It supports efficient vector search capabilities through Atlas Vector Search</li> <li>It scales well for large document collections</li> <li>It offers flexible querying options for combining semantic and traditional search</li> <li>It provides robust indexing for fast retrieval</li> </ul> <p>The chunks with their embeddings will be stored in a MongoDB collection, allowing us to perform similarity searches when responding to user queries.</p>"},{"location":"examples/rag_mongodb/#creating-atlas-vector-search-index","title":"Creating Atlas Vector search index\u00b6","text":"<p>Using <code>pymongo</code> we can create a vector index, that will help us search through our vectors and respond to user queries. This index is crucial for efficient similarity searches between user questions and our document chunks. MongoDB Atlas Vector Search provides fast and accurate retrieval of semantically related content, which forms the foundation of our RAG pipeline.</p>"},{"location":"examples/rag_mongodb/#query-the-vectorized-data","title":"Query the vectorized data\u00b6","text":"<p>To perform a query on the vectorized data stored in MongoDB, we can use the <code>$vectorSearch</code> aggregation pipeline. This powerful feature of MongoDB Atlas enables semantic search capabilities by finding documents based on vector similarity.</p> <p>When executing a vector search query:</p> <ol> <li>MongoDB computes the similarity between the query vector and vectors stored in the collection</li> <li>The documents are ranked by their similarity score</li> <li>The top-N most similar results are returned</li> </ol> <p>This enables us to find semantically related content rather than relying on exact keyword matches. The similarity metric we're using (dot product) measures the cosine similarity between vectors, allowing us to identify content that is conceptually similar even if it uses different terminology.</p> <p>For RAG applications, this vector search capability is crucial as it allows us to retrieve the most relevant context from our document collection based on the semantic meaning of a user's query, providing the foundation for generating accurate and contextually appropriate responses.</p>"},{"location":"examples/rag_mongodb/#part-4-perform-rag-on-parsed-articles","title":"Part 4: Perform RAG on parsed articles\u00b6","text":"<p>We specify a prompt that includes the field we want to search through in the database (in this case it's <code>text</code>), a query that includes our search term, and the number of retrieved results to use in the generation.</p>"},{"location":"examples/rag_mongodb/#key-achievements","title":"Key Achievements:\u00b6","text":"<ul> <li>Processed documents with Docling</li> <li>Generated contextual embeddings with VoyageAI</li> <li>Stored vectors in MongoDB Atlas</li> <li>Implemented semantic search for relevant context retrieval</li> <li>Generated accurate responses with Azure OpenAI</li> </ul>"},{"location":"examples/rag_mongodb/#next-steps","title":"Next Steps:\u00b6","text":"<ol> <li>Expand your knowledge base with more documents</li> <li>Experiment with chunking and embedding parameters</li> <li>Build a user interface</li> <li>Implement evaluation metrics</li> <li>Deploy to production with proper scaling</li> </ol> <p>Start building your own intelligent document retrieval system today!</p>"},{"location":"examples/rag_opensearch/","title":"RAG with OpenSearch","text":"In\u00a0[1]: Copied! <pre>import os\n\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n\n! uv pip install -q --no-progress notebook ipywidgets docling llama-index-readers-file llama-index-readers-docling llama-index-readers-elasticsearch llama-index-node-parser-docling llama-index-vector-stores-opensearch llama-index-embeddings-huggingface llama-index-llms-ollama\n</pre> import os  os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"  ! uv pip install -q --no-progress notebook ipywidgets docling llama-index-readers-file llama-index-readers-docling llama-index-readers-elasticsearch llama-index-node-parser-docling llama-index-vector-stores-opensearch llama-index-embeddings-huggingface llama-index-llms-ollama <p>We now import all the necessary modules for this notebook:</p> In\u00a0[2]: Copied! <pre>import logging\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nimport requests\nimport torch\nfrom docling_core.transforms.chunker import HierarchicalChunker\nfrom docling_core.transforms.chunker.hierarchical_chunker import (\n    ChunkingDocSerializer,\n    ChunkingSerializerProvider,\n)\nfrom docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\nfrom docling_core.transforms.serializer.markdown import MarkdownTableSerializer\nfrom llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex\nfrom llama_index.core.data_structs import Node\nfrom llama_index.core.response_synthesizers import get_response_synthesizer\nfrom llama_index.core.schema import NodeWithScore, TransformComponent\nfrom llama_index.core.vector_stores import MetadataFilter, MetadataFilters\nfrom llama_index.core.vector_stores.types import VectorStoreQueryMode\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.ollama import Ollama\nfrom llama_index.node_parser.docling import DoclingNodeParser\nfrom llama_index.readers.docling import DoclingReader\nfrom llama_index.readers.elasticsearch import ElasticsearchReader\nfrom llama_index.vector_stores.opensearch import (\n    OpensearchVectorClient,\n    OpensearchVectorStore,\n)\nfrom rich.console import Console\nfrom rich.pretty import pprint\nfrom transformers import AutoTokenizer\n\nfrom docling.chunking import HybridChunker\n\nlogging.getLogger().setLevel(logging.WARNING)\n</pre> import logging from pathlib import Path from tempfile import mkdtemp  import requests import torch from docling_core.transforms.chunker import HierarchicalChunker from docling_core.transforms.chunker.hierarchical_chunker import (     ChunkingDocSerializer,     ChunkingSerializerProvider, ) from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer from docling_core.transforms.serializer.markdown import MarkdownTableSerializer from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex from llama_index.core.data_structs import Node from llama_index.core.response_synthesizers import get_response_synthesizer from llama_index.core.schema import NodeWithScore, TransformComponent from llama_index.core.vector_stores import MetadataFilter, MetadataFilters from llama_index.core.vector_stores.types import VectorStoreQueryMode from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.ollama import Ollama from llama_index.node_parser.docling import DoclingNodeParser from llama_index.readers.docling import DoclingReader from llama_index.readers.elasticsearch import ElasticsearchReader from llama_index.vector_stores.opensearch import (     OpensearchVectorClient,     OpensearchVectorStore, ) from rich.console import Console from rich.pretty import pprint from transformers import AutoTokenizer  from docling.chunking import HybridChunker  logging.getLogger().setLevel(logging.WARNING) <pre>/Users/ceb/git/docling/.venv/lib/python3.12/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'validate_default' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'validate_default' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.\n  warnings.warn(\n</pre> <p>Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.</p> <p>The code below checks if a GPU is available, either via CUDA or MPS.</p> In\u00a0[3]: Copied! <pre># Check if GPU or MPS is available\nif torch.cuda.is_available():\n    device = torch.device(\"cuda\")\n    print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n    device = torch.device(\"mps\")\n    print(\"MPS GPU is enabled.\")\nelse:\n    raise OSError(\n        \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n    )\n</pre> # Check if GPU or MPS is available if torch.cuda.is_available():     device = torch.device(\"cuda\")     print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available():     device = torch.device(\"mps\")     print(\"MPS GPU is enabled.\") else:     raise OSError(         \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"     ) <pre>MPS GPU is enabled.\n</pre> In\u00a0[4]: Copied! <pre>response = requests.get(\"http://localhost:9200\")\nprint(response.text)\n</pre> response = requests.get(\"http://localhost:9200\") print(response.text) <pre>{\n  \"name\" : \"b20d8368e745\",\n  \"cluster_name\" : \"docker-cluster\",\n  \"cluster_uuid\" : \"0gEZCJQwRHabS_E-n_3i9g\",\n  \"version\" : {\n    \"distribution\" : \"opensearch\",\n    \"number\" : \"3.0.0\",\n    \"build_type\" : \"tar\",\n    \"build_hash\" : \"dc4efa821904cc2d7ea7ef61c0f577d3fc0d8be9\",\n    \"build_date\" : \"2025-05-03T06:23:50.311109522Z\",\n    \"build_snapshot\" : false,\n    \"lucene_version\" : \"10.1.0\",\n    \"minimum_wire_compatibility_version\" : \"2.19.0\",\n    \"minimum_index_compatibility_version\" : \"2.0.0\"\n  },\n  \"tagline\" : \"The OpenSearch Project: https://opensearch.org/\"\n}\n\n</pre> In\u00a0[5]: Copied! <pre># http endpoint for your cluster\nOPENSEARCH_ENDPOINT = \"http://localhost:9200\"\n# index to store the Docling document vectors\nOPENSEARCH_INDEX = \"docling-index\"\n# the embedding model\nEMBED_MODEL = HuggingFaceEmbedding(\n    model_name=\"ibm-granite/granite-embedding-30m-english\"\n)\n# maximum chunk size in tokens\nEMBED_MAX_TOKENS = 200\n# the generation model\nGEN_MODEL = Ollama(\n    model=\"granite4:tiny-h\",\n    request_timeout=120.0,\n    # Manually set the context window to limit memory usage\n    context_window=8000,\n    # Set temperature to 0 for reproducibility of the results\n    temperature=0.0,\n)\n# a sample document\nSOURCE = \"https://arxiv.org/pdf/2408.09869\"\n\nembed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\nprint(f\"The embedding dimension is {embed_dim}.\")\n</pre> # http endpoint for your cluster OPENSEARCH_ENDPOINT = \"http://localhost:9200\" # index to store the Docling document vectors OPENSEARCH_INDEX = \"docling-index\" # the embedding model EMBED_MODEL = HuggingFaceEmbedding(     model_name=\"ibm-granite/granite-embedding-30m-english\" ) # maximum chunk size in tokens EMBED_MAX_TOKENS = 200 # the generation model GEN_MODEL = Ollama(     model=\"granite4:tiny-h\",     request_timeout=120.0,     # Manually set the context window to limit memory usage     context_window=8000,     # Set temperature to 0 for reproducibility of the results     temperature=0.0, ) # a sample document SOURCE = \"https://arxiv.org/pdf/2408.09869\"  embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\")) print(f\"The embedding dimension is {embed_dim}.\") <pre>The embedding dimension is 384.\n</pre> <p>In this recipe, we will use a single PDF file, the Docling Technical Report. We will process it using the Hybrid Chunker provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks.</p> In\u00a0[6]: Copied! <pre>tmp_dir_path = Path(mkdtemp())\nreq = requests.get(SOURCE)\nwith open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:\n    out_file.write(req.content)\n\nreader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\ndir_reader = SimpleDirectoryReader(\n    input_dir=tmp_dir_path,\n    file_extractor={\".pdf\": reader},\n)\n\n# load the PDF files\ndocuments = dir_reader.load_data()\n</pre> tmp_dir_path = Path(mkdtemp()) req = requests.get(SOURCE) with open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:     out_file.write(req.content)  reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) dir_reader = SimpleDirectoryReader(     input_dir=tmp_dir_path,     file_extractor={\".pdf\": reader}, )  # load the PDF files documents = dir_reader.load_data() In\u00a0[7]: Copied! <pre># create the hybrid chunker\ntokenizer = HuggingFaceTokenizer(\n    tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL.model_name),\n    max_tokens=EMBED_MAX_TOKENS,\n)\nchunker = HybridChunker(tokenizer=tokenizer)\n\n# create a Docling node parser\nnode_parser = DoclingNodeParser(chunker=chunker)\n\n\n# create a custom transformation to avoid out-of-range integers\nclass MetadataTransform(TransformComponent):\n    def __call__(self, nodes, **kwargs):\n        for node in nodes:\n            binary_hash = node.metadata.get(\"origin\", {}).get(\"binary_hash\", None)\n            if binary_hash is not None:\n                node.metadata[\"origin\"][\"binary_hash\"] = str(binary_hash)\n        return nodes\n</pre> # create the hybrid chunker tokenizer = HuggingFaceTokenizer(     tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL.model_name),     max_tokens=EMBED_MAX_TOKENS, ) chunker = HybridChunker(tokenizer=tokenizer)  # create a Docling node parser node_parser = DoclingNodeParser(chunker=chunker)   # create a custom transformation to avoid out-of-range integers class MetadataTransform(TransformComponent):     def __call__(self, nodes, **kwargs):         for node in nodes:             binary_hash = node.metadata.get(\"origin\", {}).get(\"binary_hash\", None)             if binary_hash is not None:                 node.metadata[\"origin\"][\"binary_hash\"] = str(binary_hash)         return nodes In\u00a0[8]: Copied! <pre># OpensearchVectorClient stores text in this field by default\ntext_field = \"content\"\n# OpensearchVectorClient stores embeddings in this field by default\nembed_field = \"embedding\"\n\nclient = OpensearchVectorClient(\n    endpoint=OPENSEARCH_ENDPOINT,\n    index=OPENSEARCH_INDEX,\n    dim=embed_dim,\n    engine=\"faiss\",\n    embedding_field=embed_field,\n    text_field=text_field,\n)\n\nvector_store = OpensearchVectorStore(client)\nstorage_context = StorageContext.from_defaults(vector_store=vector_store)\n\nindex = VectorStoreIndex.from_documents(\n    documents=documents,\n    transformations=[node_parser, MetadataTransform()],\n    storage_context=storage_context,\n    embed_model=EMBED_MODEL,\n)\n</pre> # OpensearchVectorClient stores text in this field by default text_field = \"content\" # OpensearchVectorClient stores embeddings in this field by default embed_field = \"embedding\"  client = OpensearchVectorClient(     endpoint=OPENSEARCH_ENDPOINT,     index=OPENSEARCH_INDEX,     dim=embed_dim,     engine=\"faiss\",     embedding_field=embed_field,     text_field=text_field, )  vector_store = OpensearchVectorStore(client) storage_context = StorageContext.from_defaults(vector_store=vector_store)  index = VectorStoreIndex.from_documents(     documents=documents,     transformations=[node_parser, MetadataTransform()],     storage_context=storage_context,     embed_model=EMBED_MODEL, ) <pre>2025-10-24 15:05:49,841 - WARNING - GET http://localhost:9200/docling-index [status:404 request:0.006s]\n</pre> In\u00a0[9]: Copied! <pre>console = Console(width=88)\n\nQUERY = \"Which are the main AI models in Docling?\"\nquery_engine = index.as_query_engine(llm=GEN_MODEL)\nres = query_engine.query(QUERY)\n\nconsole.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\")\n</pre> console = Console(width=88)  QUERY = \"Which are the main AI models in Docling?\" query_engine = index.as_query_engine(llm=GEN_MODEL) res = query_engine.query(QUERY)  console.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\") <pre>\ud83d\udc64: Which are the main AI models in Docling?\n\ud83e\udd16: The two main AI models used in Docling are:\n\n1. A layout analysis model, an accurate object-detector for page elements \n2. TableFormer, a state-of-the-art table structure recognition model\n\nThese models were initially released as part of the open-source Docling package to help \nwith document understanding tasks.\n</pre> In\u00a0[10]: Copied! <pre>QUERY = \"What is the time to solution with the native backend on Intel?\"\nquery_engine = index.as_query_engine(llm=GEN_MODEL)\nres = query_engine.query(QUERY)\nconsole.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\")\n</pre> QUERY = \"What is the time to solution with the native backend on Intel?\" query_engine = index.as_query_engine(llm=GEN_MODEL) res = query_engine.query(QUERY) console.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\") <pre>\ud83d\udc64: What is the time to solution with the native backend on Intel?\n\ud83e\udd16: The time to solution (TTS) for the native backend on Intel is:\n- For Apple M3 Max (16 cores): 375 seconds \n- For Intel(R) Xeon E5-2690, native backend: 244 seconds\n\nSo the TTS with the native backend on Intel ranges from approximately 244 to 375 seconds\ndepending on the specific configuration.\n</pre> <p>The result above was generated with the table serialized in a triplet format. Language models may perform better on complex tables if the structure is represented in a format that is widely adopted, like markdown.</p> <p>For this purpose, we can leverage a custom serializer that transforms tables in markdown format:</p> In\u00a0[11]: Copied! <pre>class MDTableSerializerProvider(ChunkingSerializerProvider):\n    def get_serializer(self, doc):\n        return ChunkingDocSerializer(\n            doc=doc,\n            # configuring a different table serializer\n            table_serializer=MarkdownTableSerializer(),\n        )\n\n\n# clear the database from the previous chunks\nclient.clear()\nvector_store.clear()\n\nchunker = HybridChunker(\n    tokenizer=tokenizer,\n    max_tokens=EMBED_MAX_TOKENS,\n    serializer_provider=MDTableSerializerProvider(),\n)\nnode_parser = DoclingNodeParser(chunker=chunker)\nindex = VectorStoreIndex.from_documents(\n    documents=documents,\n    transformations=[node_parser, MetadataTransform()],\n    storage_context=storage_context,\n    embed_model=EMBED_MODEL,\n)\n</pre> class MDTableSerializerProvider(ChunkingSerializerProvider):     def get_serializer(self, doc):         return ChunkingDocSerializer(             doc=doc,             # configuring a different table serializer             table_serializer=MarkdownTableSerializer(),         )   # clear the database from the previous chunks client.clear() vector_store.clear()  chunker = HybridChunker(     tokenizer=tokenizer,     max_tokens=EMBED_MAX_TOKENS,     serializer_provider=MDTableSerializerProvider(), ) node_parser = DoclingNodeParser(chunker=chunker) index = VectorStoreIndex.from_documents(     documents=documents,     transformations=[node_parser, MetadataTransform()],     storage_context=storage_context,     embed_model=EMBED_MODEL, ) <pre>Token indices sequence length is longer than the specified maximum sequence length for this model (538 &gt; 512). Running this sequence through the model will result in indexing errors\n</pre> In\u00a0[12]: Copied! <pre>query_engine = index.as_query_engine(llm=GEN_MODEL)\nres = query_engine.query(QUERY)\nconsole.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\")\n</pre> query_engine = index.as_query_engine(llm=GEN_MODEL) res = query_engine.query(QUERY) console.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\") <pre>\ud83d\udc64: What is the time to solution with the native backend on Intel?\n\ud83e\udd16: The table shows that for the native backend on Intel systems, the time-to-solution \n(TTS) ranges from 239 seconds to 375 seconds. Specifically:\n- With 4 threads, the TTS is 239 seconds.\n- With 16 threads, the TTS is 244 seconds.\n\nSo the time to solution with the native backend on Intel varies between approximately \n239 and 375 seconds depending on the thread budget used.\n</pre> <p>Observe that the generated response is now more accurate. Refer to the Advanced chunking &amp; serialization example for more details on serialization strategies.</p> In\u00a0[13]: Copied! <pre>def display_nodes(nodes):\n    res = []\n    for idx, item in enumerate(nodes):\n        doc_res = {\"k\": idx + 1, \"score\": item.score, \"text\": item.text, \"items\": []}\n        doc_items = item.metadata[\"doc_items\"]\n        for doc in doc_items:\n            doc_res[\"items\"].append({\"ref\": doc[\"self_ref\"], \"label\": doc[\"label\"]})\n        res.append(doc_res)\n    pprint(res, max_string=200)\n</pre> def display_nodes(nodes):     res = []     for idx, item in enumerate(nodes):         doc_res = {\"k\": idx + 1, \"score\": item.score, \"text\": item.text, \"items\": []}         doc_items = item.metadata[\"doc_items\"]         for doc in doc_items:             doc_res[\"items\"].append({\"ref\": doc[\"self_ref\"], \"label\": doc[\"label\"]})         res.append(doc_res)     pprint(res, max_string=200) In\u00a0[14]: Copied! <pre>retriever = index.as_retriever(similarity_top_k=1)\n\nQUERY = \"How does pypdfium perform?\"\nnodes = retriever.retrieve(QUERY)\n\nprint(QUERY)\ndisplay_nodes(nodes)\n</pre> retriever = index.as_retriever(similarity_top_k=1)  QUERY = \"How does pypdfium perform?\" nodes = retriever.retrieve(QUERY)  print(QUERY) display_nodes(nodes) <pre>How does pypdfium perform?\n</pre> <pre>[\n\u2502   {\n\u2502   \u2502   'k': 1,\n\u2502   \u2502   'score': 0.694972,\n\u2502   \u2502   'text': '- [13] B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar. Doclaynet: a large humanannotated dataset for document-layout segmentation. pages 3743-3751, 2022.\\n- [14] pypdf Maintainers. pypdf: '+314,\n\u2502   \u2502   'items': [\n\u2502   \u2502   \u2502   {'ref': '#/texts/93', 'label': 'list_item'},\n\u2502   \u2502   \u2502   {'ref': '#/texts/94', 'label': 'list_item'},\n\u2502   \u2502   \u2502   {'ref': '#/texts/95', 'label': 'list_item'},\n\u2502   \u2502   \u2502   {'ref': '#/texts/96', 'label': 'list_item'}\n\u2502   \u2502   ]\n\u2502   }\n]\n</pre> <p>We may want to restrict the retrieval to only those chunks containing tabular data, expecting to retrieve more quantitative information for our type of question:</p> In\u00a0[15]: Copied! <pre>filters = MetadataFilters(\n    filters=[MetadataFilter(key=\"doc_items.label\", value=\"table\")]\n)\n\ntable_retriever = index.as_retriever(filters=filters, similarity_top_k=1)\nnodes = table_retriever.retrieve(QUERY)\n\nprint(QUERY)\ndisplay_nodes(nodes)\n</pre> filters = MetadataFilters(     filters=[MetadataFilter(key=\"doc_items.label\", value=\"table\")] )  table_retriever = index.as_retriever(filters=filters, similarity_top_k=1) nodes = table_retriever.retrieve(QUERY)  print(QUERY) display_nodes(nodes) <pre>How does pypdfium perform?\n</pre> <pre>[\n\u2502   {\n\u2502   \u2502   'k': 1,\n\u2502   \u2502   'score': 0.6238112,\n\u2502   \u2502   'text': 'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution (TT'+515,\n\u2502   \u2502   'items': [{'ref': '#/tables/0', 'label': 'table'}, {'ref': '#/tables/0', 'label': 'table'}]\n\u2502   }\n]\n</pre> In\u00a0[16]: Copied! <pre>url = f\"{OPENSEARCH_ENDPOINT}/_search/pipeline/rrf-pipeline\"\nheaders = {\"Content-Type\": \"application/json\"}\nbody = {\n    \"description\": \"Post processor for hybrid RRF search\",\n    \"phase_results_processors\": [\n        {\"score-ranker-processor\": {\"combination\": {\"technique\": \"rrf\"}}}\n    ],\n}\n\nresponse = requests.put(url, json=body, headers=headers)\nprint(response.text)\n</pre> url = f\"{OPENSEARCH_ENDPOINT}/_search/pipeline/rrf-pipeline\" headers = {\"Content-Type\": \"application/json\"} body = {     \"description\": \"Post processor for hybrid RRF search\",     \"phase_results_processors\": [         {\"score-ranker-processor\": {\"combination\": {\"technique\": \"rrf\"}}}     ], }  response = requests.put(url, json=body, headers=headers) print(response.text) <pre>{\"acknowledged\":true}\n</pre> <p>We can then repeat the previous steps to get a <code>VectorStoreIndex</code> object, leveraging the search pipeline that we just created:</p> In\u00a0[17]: Copied! <pre>client_rrf = OpensearchVectorClient(\n    endpoint=OPENSEARCH_ENDPOINT,\n    index=f\"{OPENSEARCH_INDEX}-rrf\",\n    dim=embed_dim,\n    engine=\"faiss\",\n    embedding_field=embed_field,\n    text_field=text_field,\n    search_pipeline=\"rrf-pipeline\",\n)\n\nvector_store_rrf = OpensearchVectorStore(client_rrf)\nstorage_context_rrf = StorageContext.from_defaults(vector_store=vector_store_rrf)\nindex_hybrid = VectorStoreIndex.from_documents(\n    documents=documents,\n    transformations=[node_parser, MetadataTransform()],\n    storage_context=storage_context_rrf,\n    embed_model=EMBED_MODEL,\n)\n</pre> client_rrf = OpensearchVectorClient(     endpoint=OPENSEARCH_ENDPOINT,     index=f\"{OPENSEARCH_INDEX}-rrf\",     dim=embed_dim,     engine=\"faiss\",     embedding_field=embed_field,     text_field=text_field,     search_pipeline=\"rrf-pipeline\", )  vector_store_rrf = OpensearchVectorStore(client_rrf) storage_context_rrf = StorageContext.from_defaults(vector_store=vector_store_rrf) index_hybrid = VectorStoreIndex.from_documents(     documents=documents,     transformations=[node_parser, MetadataTransform()],     storage_context=storage_context_rrf,     embed_model=EMBED_MODEL, ) <pre>2025-10-24 15:06:05,175 - WARNING - GET http://localhost:9200/docling-index-rrf [status:404 request:0.001s]\n</pre> <p>The first retriever, which entirely relies on semantic (vector) search, fails to catch the supporting chunk for the given question in the top 1 position. Note that we highlight few expected keywords for illustration purposes.</p> In\u00a0[18]: Copied! <pre>QUERY = \"Does Docling project provide a Dockerfile?\"\nretriever = index.as_retriever(similarity_top_k=3)\nnodes = retriever.retrieve(QUERY)\nexp = \"Docling also provides a Dockerfile\"\nstart = \"[bold yellow]\"\nend = \"[/]\"\nfor idx, item in enumerate(nodes):\n    console.print(\n        f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n    )\n</pre> QUERY = \"Does Docling project provide a Dockerfile?\" retriever = index.as_retriever(similarity_top_k=3) nodes = retriever.retrieve(QUERY) exp = \"Docling also provides a Dockerfile\" start = \"[bold yellow]\" end = \"[/]\" for idx, item in enumerate(nodes):     console.print(         f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"     ) <pre>*** k=1 ***\nDocling is designed to allow easy extension of the model library and pipelines. In the \nfuture, we plan to extend Docling with several more models, such as a figure-classifier \nmodel, an equationrecognition model, a code-recognition model and more. This will help \nimprove the quality of conversion for specific types of content, as well as augment \nextracted document metadata with additional information. Further investment into testing\nand optimizing GPU acceleration as well as improving the Docling-native PDF backend are \non our roadmap, too.\nWe encourage everyone to propose or implement additional features and models, and will \ngladly take your inputs and contributions under review . The codebase of Docling is open\nfor use and contribution, under the MIT license agreement and in alignment with our \ncontributing guidelines included in the Docling repository. If you use Docling in your \nprojects, please consider citing this technical report.\n</pre> <pre>*** k=2 ***\nIn the final pipeline stage, Docling assembles all prediction results produced on each \npage into a well-defined datatype that encapsulates a converted document, as defined in \nthe auxiliary package docling-core . The generated document object is passed through a \npost-processing model which leverages several algorithms to augment features, such as \ndetection of the document language, correcting the reading order, matching figures with \ncaptions and labelling metadata such as title, authors and references. The final output \ncan then be serialized to JSON or transformed into a Markdown representation at the \nusers request.\n</pre> <pre>*** k=3 ***\n```\nsource = \"https://arxiv.org/pdf/2206.01062\" # PDF path or URL converter = \nDocumentConverter() result = converter.convert_single(source) \nprint(result.render_as_markdown()) # output: \"## DocLayNet: A Large Human -Annotated \nDataset for Document -Layout Analysis [...]\"\n```\nOptionally, you can configure custom pipeline features and runtime options, such as \nturning on or off features (e.g. OCR, table structure recognition), enforcing limits on \nthe input document size, and defining the budget of CPU threads. Advanced usage examples\nand options are documented in the README file. Docling also provides a Dockerfile to \ndemonstrate how to install and run it inside a container.\n</pre> <p>However, the retriever with the hybrid search pipeline effectively recognizes the key paragraph in the first position:</p> In\u00a0[19]: Copied! <pre>retriever_rrf = index_hybrid.as_retriever(\n    vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=3\n)\nnodes = retriever_rrf.retrieve(QUERY)\nfor idx, item in enumerate(nodes):\n    console.print(\n        f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n    )\n</pre> retriever_rrf = index_hybrid.as_retriever(     vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=3 ) nodes = retriever_rrf.retrieve(QUERY) for idx, item in enumerate(nodes):     console.print(         f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"     ) <pre>*** k=1 ***\n```\nsource = \"https://arxiv.org/pdf/2206.01062\" # PDF path or URL converter = \nDocumentConverter() result = converter.convert_single(source) \nprint(result.render_as_markdown()) # output: \"## DocLayNet: A Large Human -Annotated \nDataset for Document -Layout Analysis [...]\"\n```\nOptionally, you can configure custom pipeline features and runtime options, such as \nturning on or off features (e.g. OCR, table structure recognition), enforcing limits on \nthe input document size, and defining the budget of CPU threads. Advanced usage examples\nand options are documented in the README file. Docling also provides a Dockerfile to \ndemonstrate how to install and run it inside a container.\n</pre> <pre>*** k=2 ***\nDocling is designed to allow easy extension of the model library and pipelines. In the \nfuture, we plan to extend Docling with several more models, such as a figure-classifier \nmodel, an equationrecognition model, a code-recognition model and more. This will help \nimprove the quality of conversion for specific types of content, as well as augment \nextracted document metadata with additional information. Further investment into testing\nand optimizing GPU acceleration as well as improving the Docling-native PDF backend are \non our roadmap, too.\nWe encourage everyone to propose or implement additional features and models, and will \ngladly take your inputs and contributions under review . The codebase of Docling is open\nfor use and contribution, under the MIT license agreement and in alignment with our \ncontributing guidelines included in the Docling repository. If you use Docling in your \nprojects, please consider citing this technical report.\n</pre> <pre>*** k=3 ***\nWe therefore decided to provide multiple backend choices, and additionally open-source a\ncustombuilt PDF parser, which is based on the low-level qpdf [4] library. It is made \navailable in a separate package named docling-parse and powers the default PDF backend \nin Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \nbe a safe backup choice in certain cases, e.g. if issues are seen with particular font \nencodings.\n</pre> <p>In the following example, the generated response is wrong, since the top retrieved chunks do not contain all the information that is required to answer the question.</p> In\u00a0[20]: Copied! <pre>QUERY = \"According to the tests with arXiv and IBM Redbooks, which backend should I use if I have limited resources and complex tables?\"\nquery_rrf = index_hybrid.as_query_engine(\n    vector_store_query_mode=VectorStoreQueryMode.HYBRID,\n    llm=GEN_MODEL,\n    similarity_top_k=3,\n)\nres = query_rrf.query(QUERY)\nconsole.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\")\n</pre> QUERY = \"According to the tests with arXiv and IBM Redbooks, which backend should I use if I have limited resources and complex tables?\" query_rrf = index_hybrid.as_query_engine(     vector_store_query_mode=VectorStoreQueryMode.HYBRID,     llm=GEN_MODEL,     similarity_top_k=3, ) res = query_rrf.query(QUERY) console.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\") <pre>\ud83d\udc64: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \nhave limited resources and complex tables?\n\ud83e\udd16: According to the tests in this section using both the MacBook Pro M3 Max and \nbare-metal server running Ubuntu 20.04 LTS on an Intel Xeon E5-2690 CPU with a fixed \nthread budget of 4, Docling achieved faster processing speeds when using the \ncustom-built PDF backend based on the low-level qpdf library (docling-parse) compared to\nthe alternative PDF backend relying on pypdfium.\n\nFurthermore, the context mentions that Docling provides a separate package named \ndocling-ibm-models which includes pre-trained weights and inference code for \nTableFormer, a state-of-the-art table structure recognition model. This suggests that if\nyou have complex tables in your documents, using this specialized table recognition \nmodel could be beneficial.\n\nTherefore, based on the tests with arXiv papers and IBM Redbooks, if you have limited \nresources (likely referring to computational power) and need to process documents \ncontaining complex tables, it would be recommended to use the docling-parse PDF backend \nalong with the TableFormer AI model from docling-ibm-models. This combination should \nprovide a good balance of performance and table recognition capabilities for your \nspecific needs.\n</pre> In\u00a0[21]: Copied! <pre>nodes = retriever_rrf.retrieve(QUERY)\nfor idx, item in enumerate(nodes):\n    console.print(\n        f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"\n    )\n</pre> nodes = retriever_rrf.retrieve(QUERY) for idx, item in enumerate(nodes):     console.print(         f\"*** k={idx + 1} ***\\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}\"     ) <pre>*** k=1 ***\nIn this section, we establish some reference numbers for the processing speed of Docling\nand the resource budget it requires. All tests in this section are run with default \noptions on our standard test set distributed with Docling, which consists of three \npapers from arXiv and two IBM Redbooks, with a total of 225 pages. Measurements were \ntaken using both available PDF backends on two different hardware systems: one MacBook \nPro M3 Max, and one bare-metal server running Ubuntu 20.04 LTS on an Intel Xeon E5-2690 \nCPU. For reproducibility, we fixed the thread budget (through setting OMP NUM THREADS \nenvironment variable ) once to 4 (Docling default) and once to 16 (equal to full core \ncount on the test hardware). All results are shown in Table 1.\n</pre> <pre>*** k=2 ***\nWe therefore decided to provide multiple backend choices, and additionally open-source a\ncustombuilt PDF parser, which is based on the low-level qpdf [4] library. It is made \navailable in a separate package named docling-parse and powers the default PDF backend \nin Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may \nbe a safe backup choice in certain cases, e.g. if issues are seen with particular font \nencodings.\n</pre> <pre>*** k=3 ***\nAs part of Docling, we initially release two highly capable AI models to the open-source\ncommunity, which have been developed and published recently by our team. The first model\nis a layout analysis model, an accurate object-detector for page elements [13]. The \nsecond model is TableFormer [12, 9], a state-of-the-art table structure recognition \nmodel. We provide the pre-trained weights (hosted on huggingface) and a separate package\nfor the inference code as docling-ibm-models . Both models are also powering the \nopen-access deepsearch-experience, our cloud-native service for knowledge exploration \ntasks.\n</pre> <p>Even though the top retrieved chunks are relevant for the question, the key information lays in the paragraph after the first chunk:</p> <p>If you need to run Docling in very low-resource environments, please consider configuring the pypdfium backend. While it is faster and more memory efficient than the default docling-parse backend, it will come at the expense of worse quality results, especially in table structure recovery.</p> <p>We next examine the fragments that immediately precede and follow the top\u2011retrieved chunk, so long as those neighbors remain within the same section, to preserve the semantic integrity of the context. The generated answer is now accurate because it has been grounded in the necessary contextual information.</p> <p>\ud83d\udca1 In a production setting, it may be preferable to persist the parsed documents (i.e., <code>DoclingDocument</code> objects) as JSON in an object store or database and then fetch them when you need to traverse the document for context\u2011expansion scenarios. In this simplified example, however, we will query the OpenSearch index directly to obtain the required chunks.</p> In\u00a0[22]: Copied! <pre>top_headings = nodes[0].metadata[\"headings\"]\ntop_text = nodes[0].text\n\nrdr = ElasticsearchReader(endpoint=OPENSEARCH_ENDPOINT, index=OPENSEARCH_INDEX)\ndocs = rdr.load_data(\n    field=text_field,\n    query={\n        \"query\": {\n            \"terms_set\": {\n                \"metadata.headings.keyword\": {\n                    \"terms\": top_headings,\n                    \"minimum_should_match_script\": {\"source\": \"params.num_terms\"},\n                }\n            }\n        }\n    },\n)\next_nodes = []\nfor idx, item in enumerate(docs):\n    if item.text == top_text:\n        ext_nodes.append(NodeWithScore(node=Node(text=item.text), score=1.0))\n        if idx &gt; 0:\n            ext_nodes.append(\n                NodeWithScore(node=Node(text=docs[idx - 1].text), score=1.0)\n            )\n        if idx &lt; len(docs) - 1:\n            ext_nodes.append(\n                NodeWithScore(node=Node(text=docs[idx + 1].text), score=1.0)\n            )\n        break\n\nsynthesizer = get_response_synthesizer(llm=GEN_MODEL)\nres = synthesizer.synthesize(query=QUERY, nodes=ext_nodes)\nconsole.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\")\n</pre> top_headings = nodes[0].metadata[\"headings\"] top_text = nodes[0].text  rdr = ElasticsearchReader(endpoint=OPENSEARCH_ENDPOINT, index=OPENSEARCH_INDEX) docs = rdr.load_data(     field=text_field,     query={         \"query\": {             \"terms_set\": {                 \"metadata.headings.keyword\": {                     \"terms\": top_headings,                     \"minimum_should_match_script\": {\"source\": \"params.num_terms\"},                 }             }         }     }, ) ext_nodes = [] for idx, item in enumerate(docs):     if item.text == top_text:         ext_nodes.append(NodeWithScore(node=Node(text=item.text), score=1.0))         if idx &gt; 0:             ext_nodes.append(                 NodeWithScore(node=Node(text=docs[idx - 1].text), score=1.0)             )         if idx &lt; len(docs) - 1:             ext_nodes.append(                 NodeWithScore(node=Node(text=docs[idx + 1].text), score=1.0)             )         break  synthesizer = get_response_synthesizer(llm=GEN_MODEL) res = synthesizer.synthesize(query=QUERY, nodes=ext_nodes) console.print(f\"\ud83d\udc64: {QUERY}\\n\ud83e\udd16: {res.response.strip()}\") <pre>\ud83d\udc64: According to the tests with arXiv and IBM Redbooks, which backend should I use if I \nhave limited resources and complex tables?\n\ud83e\udd16: According to the tests described in the provided context, if you need to run Docling\nin a very low-resource environment and are dealing with complex tables that require \nhigh-quality table structure recovery, you should consider configuring the pypdfium \nbackend. The context mentions that while it is faster and more memory efficient than the\ndefault docling-parse backend, it may come at the expense of worse quality results, \nespecially in table structure recovery. Therefore, for limited resources and complex \ntables where quality is crucial, pypdfium would be a suitable choice despite its \npotential drawbacks compared to the default backend.\n</pre>"},{"location":"examples/rag_opensearch/#rag-with-opensearch","title":"RAG with OpenSearch\u00b6","text":"Step Tech Execution Embedding HuggingFace (IBM Granite Embedding 30M) \ud83d\udcbb Local Vector store OpenSearch 3.0.0 \ud83d\udcbb Local Gen AI Ollama (IBM Granite 4.0 Tiny) \ud83d\udcbb Local <p>This is a code recipe that uses OpenSearch, an open-source search and analytics tool, and the LlamaIndex framework to perform RAG over documents parsed by Docling.</p> <p>In this notebook, we accomplish the following:</p> <ul> <li>\ud83d\udcda Parse documents using Docling's document conversion capabilities</li> <li>\ud83e\udde9 Perform hierarchical chunking of the documents using Docling</li> <li>\ud83d\udd22 Generate text embeddings on document chunks</li> <li>\ud83e\udd16 Perform RAG using OpenSearch and the LlamaIndex framework</li> <li>\ud83d\udee0\ufe0f Leverage the transformation and structure capabilities of Docling documents for RAG</li> </ul>"},{"location":"examples/rag_opensearch/#preparation","title":"Preparation\u00b6","text":""},{"location":"examples/rag_opensearch/#running-the-notebook","title":"Running the notebook\u00b6","text":"<p>For running this notebook on your machine, you can use applications like Jupyter Notebook or Visual Studio Code.</p> <p>\ud83d\udca1 For best results, please use GPU acceleration to run this notebook.</p>"},{"location":"examples/rag_opensearch/#virtual-environment","title":"Virtual environment\u00b6","text":"<p>Before installing dependencies and to avoid conflicts in your environment, it is advisable to use a virtual environment (venv). For instance, uv is a popular tool to manage virtual environments and dependencies. You can install it with:</p> <pre>curl -LsSf https://astral.sh/uv/install.sh | sh\n</pre> <p>Then create the virtual environment and activate it:</p> <pre> uv venv\n source .venv/bin/activate\n</pre> <p>Refer to Installing uv for more details.</p>"},{"location":"examples/rag_opensearch/#dependencies","title":"Dependencies\u00b6","text":"<p>To start, install the required dependencies by running the following command:</p>"},{"location":"examples/rag_opensearch/#gpu-checking","title":"GPU Checking\u00b6","text":""},{"location":"examples/rag_opensearch/#local-opensearch-instance","title":"Local OpenSearch instance\u00b6","text":"<p>To run the notebook locally, we can pull an OpenSearch image and run a single node for local development. You can use a container tool like Podman or Docker. In the interest of simplicity, we disable the SSL option for this example.</p> <p>\ud83d\udca1 The version of the OpenSearch instance needs to be compatible with the version of the OpenSearch Python Client library, since this library is used by the LlamaIndex framework, which we leverage in this notebook.</p> <p>On your computer terminal run:</p> <pre>podman run \\\n    -it \\\n    --pull always \\\n    -p 9200:9200 \\\n    -p 9600:9600 \\\n    -e \"discovery.type=single-node\" \\\n    -e DISABLE_INSTALL_DEMO_CONFIG=true \\\n    -e DISABLE_SECURITY_PLUGIN=true \\\n    --name opensearch-node \\\n    -d opensearchproject/opensearch:3.0.0\n</pre> <p>Once the instance is running, verify that you can connect to OpenSearch:</p>"},{"location":"examples/rag_opensearch/#language-models","title":"Language models\u00b6","text":"<p>We will use HuggingFace and Ollama to run language models on your local computer, rather than relying on cloud services.</p> <p>In this example, the following models are considered:</p> <ul> <li>IBM Granite Embedding 30M English with HuggingFace for text embeddings</li> <li>IBM Granite 4.0 Tiny with Ollama for model inference</li> </ul> <p>Once Ollama is installed on your computer, you can pull the model above from your terminal:</p> <pre>ollama pull granite4:tiny-h\n</pre>"},{"location":"examples/rag_opensearch/#setup","title":"Setup\u00b6","text":"<p>We setup the main variables for OpenSearch and the embedding and generation models.</p>"},{"location":"examples/rag_opensearch/#process-data-using-docling","title":"Process Data Using Docling\u00b6","text":"<p>Docling can parse various document formats into a unified representation (DoclingDocument), which can then be exported to different output formats. For a full list of supported input and output formats, please refer to Supported formats section of Docling's documentation.</p>"},{"location":"examples/rag_opensearch/#run-the-document-conversion-pipeline","title":"Run the document conversion pipeline\u00b6","text":"<p>We will convert the original PDF file into a <code>DoclingDocument</code> format using a <code>DoclingReader</code> object. We specify the JSON export type to retain the document hierarchical structure as an input for the next step (chunking the document).</p>"},{"location":"examples/rag_opensearch/#load-data-into-opensearch","title":"Load Data into OpenSearch\u00b6","text":""},{"location":"examples/rag_opensearch/#define-the-transformations","title":"Define the Transformations\u00b6","text":"<p>Before the actual ingestion of data, we need to define the data transformations to apply on the <code>DoclingDocument</code>:</p> <ul> <li><code>DoclingNodeParser</code> executes the document-based chunking with the hybrid chunker, which leverages the tokenizer of the embedding model to ensure that the resulting chunks fit within the model input text limit.</li> <li><code>MetadataTransform</code> is a custom transformation to ensure that generated chunk metadata is best formatted for indexing with OpenSearch</li> </ul> <p>\ud83d\udca1 For demonstration purposes, we configure the hybrid chunker to produce chunks capped at 200 tokens. The optimal limit will vary according to the specific requirements of the AI application in question. If this value is omitted, the chunker automatically derives the maximum size from the tokenizer. This safeguard guarantees that each chunk remains within the bounds supported by the underlying embedding model.</p>"},{"location":"examples/rag_opensearch/#embed-and-insert-the-data","title":"Embed and Insert the Data\u00b6","text":"<p>In this step, we create an <code>OpenSearchVectorClient</code>, which encapsulates the logic for a single OpenSearch index with vector search enabled.</p> <p>We then initialize the index using our sample data (a single PDF file), the Docling node parser, and the OpenSearch client that we just created.</p> <p>\ud83d\udca1 You may get a warning message like:</p> <p>Token indices sequence length is longer than the specified maximum sequence length for this model</p> <p>This is a false alarm and you may get more background explanation in Docling's FAQ page.</p>"},{"location":"examples/rag_opensearch/#build-rag","title":"Build RAG\u00b6","text":"<p>In this section, we will see how to assemble a RAG system, execute a query, and get a generated response.</p> <p>We will also describe how to leverage Docling capabilities to improve RAG results.</p>"},{"location":"examples/rag_opensearch/#run-a-query","title":"Run a query\u00b6","text":"<p>With LlamaIndex's query engine, we can simply run a RAG system as follows:</p>"},{"location":"examples/rag_opensearch/#custom-serializers","title":"Custom serializers\u00b6","text":"<p>Docling can extract the table content and process it for chunking, like other text elements.</p> <p>In the following example, the response is generated from a retrieved chunk containing a table.</p>"},{"location":"examples/rag_opensearch/#filter-context-query","title":"Filter-context Query\u00b6","text":"<p>By default, the <code>DoclingNodeParser</code> will keep the hierarchical information of items when creating the chunks. That information will be stored as metadata in the OpenSearch index. Leveraging the document structure is a powerful feature of Docling for improving RAG systems, both for retrieval and for answer generation.</p> <p>For example, we can use chunk metadata with layout information to run queries in a filter context, for high retrieval accuracy.</p> <p>Using the previous setup, we can see that the most similar chunk corresponds to a paragraph without enough grounding for the question:</p>"},{"location":"examples/rag_opensearch/#hybrid-search-retrieval-with-rrf","title":"Hybrid Search Retrieval with RRF\u00b6","text":"<p>Hybrid search combines keyword and semantic search to improve search relevance. To avoid relying on traditional score normalization techniques, the reciprocal rank fusion (RRF) feature on hybrid search can significantly improve the relevance of the retrieved chunks in our RAG system.</p> <p>First, create a search pipeline and specify RRF as technique:</p>"},{"location":"examples/rag_opensearch/#context-expansion","title":"Context expansion\u00b6","text":"<p>Using small chunks can offer several benefits: it increases retrieval precision and it keeps the answer generation tightly focused, which improves accuracy, reduces hallucination, and speeds up inferece. However, your RAG system may overlook contextual information necessary for producing a fully grounded response.</p> <p>Docling's preservation of document structure enables you to employ various strategies for enriching the context available during answer generation within the RAG pipeline. For example, after identifying the most relevant chunk, you might include adjacent chunks from the same section as additional groudning material before generating the final answer.</p>"},{"location":"examples/rag_weaviate/","title":"RAG with Weaviate","text":"Step Tech Execution Embedding Open AI \ud83c\udf10 Remote Vector store Weavieate \ud83d\udcbb Local Gen AI Open AI \ud83c\udf10 Remote In\u00a0[\u00a0]: Copied! <pre>%%capture\n%pip install docling~=\"2.7.0\"\n%pip install -U weaviate-client~=\"4.9.4\"\n%pip install rich\n%pip install torch\n\nimport logging\nimport warnings\n\nwarnings.filterwarnings(\"ignore\")\n\n# Suppress Weaviate client logs\nlogging.getLogger(\"weaviate\").setLevel(logging.ERROR)\n</pre> %%capture %pip install docling~=\"2.7.0\" %pip install -U weaviate-client~=\"4.9.4\" %pip install rich %pip install torch  import logging import warnings  warnings.filterwarnings(\"ignore\")  # Suppress Weaviate client logs logging.getLogger(\"weaviate\").setLevel(logging.ERROR) In\u00a0[2]: Copied! <pre>import torch\n\n# Check if GPU or MPS is available\nif torch.cuda.is_available():\n    device = torch.device(\"cuda\")\n    print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n    device = torch.device(\"mps\")\n    print(\"MPS GPU is enabled.\")\nelse:\n    raise OSError(\n        \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n    )\n</pre> import torch  # Check if GPU or MPS is available if torch.cuda.is_available():     device = torch.device(\"cuda\")     print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available():     device = torch.device(\"mps\")     print(\"MPS GPU is enabled.\") else:     raise OSError(         \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"     ) <pre>MPS GPU is enabled.\n</pre> <p>Here, we've collected 10 influential machine learning papers published as PDFs on arXiv. Because Docling does not yet have title extraction for PDFs, we manually add the titles in a corresponding list.</p> <p>Note: Converting all 10 papers should take around 8 minutes with a T4 GPU.</p> In\u00a0[3]: Copied! <pre># Influential machine learning papers\nsource_urls = [\n    \"https://arxiv.org/pdf/1706.03762\",\n    \"https://arxiv.org/pdf/1810.04805\",\n    \"https://arxiv.org/pdf/1406.2661\",\n    \"https://arxiv.org/pdf/1409.0473\",\n    \"https://arxiv.org/pdf/1412.6980\",\n    \"https://arxiv.org/pdf/1312.6114\",\n    \"https://arxiv.org/pdf/1312.5602\",\n    \"https://arxiv.org/pdf/1512.03385\",\n    \"https://arxiv.org/pdf/1409.3215\",\n    \"https://arxiv.org/pdf/1301.3781\",\n]\n\n# And their corresponding titles (because Docling doesn't have title extraction yet!)\nsource_titles = [\n    \"Attention Is All You Need\",\n    \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\",\n    \"Generative Adversarial Nets\",\n    \"Neural Machine Translation by Jointly Learning to Align and Translate\",\n    \"Adam: A Method for Stochastic Optimization\",\n    \"Auto-Encoding Variational Bayes\",\n    \"Playing Atari with Deep Reinforcement Learning\",\n    \"Deep Residual Learning for Image Recognition\",\n    \"Sequence to Sequence Learning with Neural Networks\",\n    \"A Neural Probabilistic Language Model\",\n]\n</pre> # Influential machine learning papers source_urls = [     \"https://arxiv.org/pdf/1706.03762\",     \"https://arxiv.org/pdf/1810.04805\",     \"https://arxiv.org/pdf/1406.2661\",     \"https://arxiv.org/pdf/1409.0473\",     \"https://arxiv.org/pdf/1412.6980\",     \"https://arxiv.org/pdf/1312.6114\",     \"https://arxiv.org/pdf/1312.5602\",     \"https://arxiv.org/pdf/1512.03385\",     \"https://arxiv.org/pdf/1409.3215\",     \"https://arxiv.org/pdf/1301.3781\", ]  # And their corresponding titles (because Docling doesn't have title extraction yet!) source_titles = [     \"Attention Is All You Need\",     \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\",     \"Generative Adversarial Nets\",     \"Neural Machine Translation by Jointly Learning to Align and Translate\",     \"Adam: A Method for Stochastic Optimization\",     \"Auto-Encoding Variational Bayes\",     \"Playing Atari with Deep Reinforcement Learning\",     \"Deep Residual Learning for Image Recognition\",     \"Sequence to Sequence Learning with Neural Networks\",     \"A Neural Probabilistic Language Model\", ] In\u00a0[4]: Copied! <pre>from docling.document_converter import DocumentConverter\n\n# Instantiate the doc converter\ndoc_converter = DocumentConverter()\n\n# Directly pass list of files or streams to `convert_all`\nconv_results_iter = doc_converter.convert_all(source_urls)  # previously `convert`\n\n# Iterate over the generator to get a list of Docling documents\ndocs = [result.document for result in conv_results_iter]\n</pre> from docling.document_converter import DocumentConverter  # Instantiate the doc converter doc_converter = DocumentConverter()  # Directly pass list of files or streams to `convert_all` conv_results_iter = doc_converter.convert_all(source_urls)  # previously `convert`  # Iterate over the generator to get a list of Docling documents docs = [result.document for result in conv_results_iter] <pre>Fetching 9 files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 9/9 [00:00&lt;00:00, 84072.91it/s]\n</pre> <pre>ERR#: COULD NOT CONVERT TO RS THIS TABLE TO COMPUTE SPANS\n</pre> In\u00a0[5]: Copied! <pre>from docling_core.transforms.chunker import HierarchicalChunker\n\n# Initialize lists for text, and titles\ntexts, titles = [], []\n\nchunker = HierarchicalChunker()\n\n# Process each document in the list\nfor doc, title in zip(docs, source_titles):  # Pair each document with its title\n    chunks = list(\n        chunker.chunk(doc)\n    )  # Perform hierarchical chunking and get text from chunks\n    for chunk in chunks:\n        texts.append(chunk.text)\n        titles.append(title)\n</pre> from docling_core.transforms.chunker import HierarchicalChunker  # Initialize lists for text, and titles texts, titles = [], []  chunker = HierarchicalChunker()  # Process each document in the list for doc, title in zip(docs, source_titles):  # Pair each document with its title     chunks = list(         chunker.chunk(doc)     )  # Perform hierarchical chunking and get text from chunks     for chunk in chunks:         texts.append(chunk.text)         titles.append(title) <p>Because we're splitting the documents into chunks, we'll concatenate the article title to the beginning of each chunk for additional context.</p> In\u00a0[6]: Copied! <pre># Concatenate title and text\nfor i in range(len(texts)):\n    texts[i] = f\"{titles[i]} {texts[i]}\"\n</pre> # Concatenate title and text for i in range(len(texts)):     texts[i] = f\"{titles[i]} {texts[i]}\" <p>We'll be using the OpenAI API for both generating the text embeddings and for the generative model in our RAG pipeline. The code below dynamically fetches your API key based on whether you're running this notebook in Google Colab and running it as a regular Jupyter notebook. All you need to do is replace <code>openai_api_key_var</code> with the name of your environmental variable name or Colab secret name for the API key.</p> <p>If you're running this notebook in Google Colab, make sure you add your API key as a secret.</p> In\u00a0[7]: Copied! <pre># OpenAI API key variable name\nopenai_api_key_var = \"OPENAI_API_KEY\"  # Replace with the name of your secret/env var\n\n# Fetch OpenAI API key\ntry:\n    # If running in Colab, fetch API key from Secrets\n    import google.colab\n    from google.colab import userdata\n\n    openai_api_key = userdata.get(openai_api_key_var)\n    if not openai_api_key:\n        raise ValueError(f\"Secret '{openai_api_key_var}' not found in Colab secrets.\")\nexcept ImportError:\n    # If not running in Colab, fetch API key from environment variable\n    import os\n\n    openai_api_key = os.getenv(openai_api_key_var)\n    if not openai_api_key:\n        raise OSError(\n            f\"Environment variable '{openai_api_key_var}' is not set. \"\n            \"Please define it before running this script.\"\n        )\n</pre> # OpenAI API key variable name openai_api_key_var = \"OPENAI_API_KEY\"  # Replace with the name of your secret/env var  # Fetch OpenAI API key try:     # If running in Colab, fetch API key from Secrets     import google.colab     from google.colab import userdata      openai_api_key = userdata.get(openai_api_key_var)     if not openai_api_key:         raise ValueError(f\"Secret '{openai_api_key_var}' not found in Colab secrets.\") except ImportError:     # If not running in Colab, fetch API key from environment variable     import os      openai_api_key = os.getenv(openai_api_key_var)     if not openai_api_key:         raise OSError(             f\"Environment variable '{openai_api_key_var}' is not set. \"             \"Please define it before running this script.\"         ) <p>Embedded Weaviate allows you to spin up a Weaviate instance directly from your application code, without having to use a Docker container. If you're interested in other deployment methods, like using Docker-Compose or Kubernetes, check out this page in the Weaviate docs.</p> In\u00a0[\u00a0]: Copied! <pre>import weaviate\n\n# Connect to Weaviate embedded\nclient = weaviate.connect_to_embedded(headers={\"X-OpenAI-Api-Key\": openai_api_key})\n</pre> import weaviate  # Connect to Weaviate embedded client = weaviate.connect_to_embedded(headers={\"X-OpenAI-Api-Key\": openai_api_key}) In\u00a0[\u00a0]: Copied! <pre>import weaviate.classes.config as wc\n\n# Define the collection name\ncollection_name = \"docling\"\n\n# Delete the collection if it already exists\nif client.collections.exists(collection_name):\n    client.collections.delete(collection_name)\n\n# Create the collection\ncollection = client.collections.create(\n    name=collection_name,\n    vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(\n        model=\"text-embedding-3-large\",  # Specify your embedding model here\n    ),\n    # Enable generative model from Cohere\n    generative_config=wc.Configure.Generative.openai(\n        model=\"gpt-4o\"  # Specify your generative model for RAG here\n    ),\n    # Define properties of metadata\n    properties=[\n        wc.Property(name=\"text\", data_type=wc.DataType.TEXT),\n        wc.Property(name=\"title\", data_type=wc.DataType.TEXT, skip_vectorization=True),\n    ],\n)\n</pre> import weaviate.classes.config as wc  # Define the collection name collection_name = \"docling\"  # Delete the collection if it already exists if client.collections.exists(collection_name):     client.collections.delete(collection_name)  # Create the collection collection = client.collections.create(     name=collection_name,     vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(         model=\"text-embedding-3-large\",  # Specify your embedding model here     ),     # Enable generative model from Cohere     generative_config=wc.Configure.Generative.openai(         model=\"gpt-4o\"  # Specify your generative model for RAG here     ),     # Define properties of metadata     properties=[         wc.Property(name=\"text\", data_type=wc.DataType.TEXT),         wc.Property(name=\"title\", data_type=wc.DataType.TEXT, skip_vectorization=True),     ], ) In\u00a0[10]: Copied! <pre># Initialize the data object\ndata = []\n\n# Create a dictionary for each row by iterating through the corresponding lists\nfor text, title in zip(texts, titles):\n    data_point = {\n        \"text\": text,\n        \"title\": title,\n    }\n    data.append(data_point)\n</pre> # Initialize the data object data = []  # Create a dictionary for each row by iterating through the corresponding lists for text, title in zip(texts, titles):     data_point = {         \"text\": text,         \"title\": title,     }     data.append(data_point) In\u00a0[\u00a0]: Copied! <pre># Insert text chunks and metadata into vector DB collection\nresponse = collection.data.insert_many(data)\n\nif response.has_errors:\n    print(response.errors)\nelse:\n    print(\"Insert complete.\")\n</pre> # Insert text chunks and metadata into vector DB collection response = collection.data.insert_many(data)  if response.has_errors:     print(response.errors) else:     print(\"Insert complete.\") In\u00a0[12]: Copied! <pre>from weaviate.classes.query import MetadataQuery\n\nresponse = collection.query.near_text(\n    query=\"bert\",\n    limit=2,\n    return_metadata=MetadataQuery(distance=True),\n    return_properties=[\"text\", \"title\"],\n)\n\nfor o in response.objects:\n    print(o.properties)\n    print(o.metadata.distance)\n</pre> from weaviate.classes.query import MetadataQuery  response = collection.query.near_text(     query=\"bert\",     limit=2,     return_metadata=MetadataQuery(distance=True),     return_properties=[\"text\", \"title\"], )  for o in response.objects:     print(o.properties)     print(o.metadata.distance) <pre>{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding A distinctive feature of BERT is its unified architecture across different tasks. There is mini-', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}\n0.6578550338745117\n{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding We introduce a new language representation model called BERT , which stands for B idirectional E ncoder R epresentations from T ransformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}\n0.6696287989616394\n</pre> In\u00a0[13]: Copied! <pre>from rich.console import Console\nfrom rich.panel import Panel\n\n# Create a prompt where context from the Weaviate collection will be injected\nprompt = \"Explain how {text} works, using only the retrieved context.\"\nquery = \"bert\"\n\nresponse = collection.generate.near_text(\n    query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"]\n)\n\n# Prettify the output using Rich\nconsole = Console()\n\nconsole.print(\n    Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\")\n)\nconsole.print(\n    Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\")\n)\n</pre> from rich.console import Console from rich.panel import Panel  # Create a prompt where context from the Weaviate collection will be injected prompt = \"Explain how {text} works, using only the retrieved context.\" query = \"bert\"  response = collection.generate.near_text(     query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"] )  # Prettify the output using Rich console = Console()  console.print(     Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\") ) console.print(     Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\") ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Explain how bert works, using only the retrieved context.                                                       \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 BERT, which stands for Bidirectional Encoder Representations from Transformers, is a language representation    \u2502\n\u2502 model designed to pretrain deep bidirectional representations from unlabeled text. It conditions on both left   \u2502\n\u2502 and right context in all layers, unlike traditional left-to-right or right-to-left language models. This        \u2502\n\u2502 pre-training involves two unsupervised tasks. The pre-trained BERT model can then be fine-tuned with just one   \u2502\n\u2502 additional output layer to create state-of-the-art models for various tasks, such as question answering and     \u2502\n\u2502 language inference, without needing substantial task-specific architecture modifications. A distinctive feature \u2502\n\u2502 of BERT is its unified architecture across different tasks.                                                     \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> In\u00a0[14]: Copied! <pre># Create a prompt where context from the Weaviate collection will be injected\nprompt = \"Explain how {text} works, using only the retrieved context.\"\nquery = \"a generative adversarial net\"\n\nresponse = collection.generate.near_text(\n    query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"]\n)\n\n# Prettify the output using Rich\nconsole = Console()\n\nconsole.print(\n    Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\")\n)\nconsole.print(\n    Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\")\n)\n</pre> # Create a prompt where context from the Weaviate collection will be injected prompt = \"Explain how {text} works, using only the retrieved context.\" query = \"a generative adversarial net\"  response = collection.generate.near_text(     query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"] )  # Prettify the output using Rich console = Console()  console.print(     Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\") ) console.print(     Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\") ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Explain how a generative adversarial net works, using only the retrieved context.                               \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Generative Adversarial Nets (GANs) operate within an adversarial framework where two models are trained         \u2502\n\u2502 simultaneously: a generative model (G) and a discriminative model (D). The generative model aims to capture the \u2502\n\u2502 data distribution and generate samples that mimic real data, while the discriminative model's task is to        \u2502\n\u2502 distinguish between samples from the real data and those generated by G. This setup is akin to a game where the \u2502\n\u2502 generative model acts like counterfeiters trying to produce indistinguishable fake currency, and the            \u2502\n\u2502 discriminative model acts like the police trying to detect these counterfeits.                                  \u2502\n\u2502                                                                                                                 \u2502\n\u2502 The training process involves a minimax two-player game where G tries to maximize the probability of D making a \u2502\n\u2502 mistake, while D tries to minimize it. When both models are defined by multilayer perceptrons, they can be      \u2502\n\u2502 trained using backpropagation without the need for Markov chains or approximate inference networks. The         \u2502\n\u2502 ultimate goal is for G to perfectly replicate the training data distribution, making D's output equal to 1/2    \u2502\n\u2502 everywhere, indicating it cannot distinguish between real and generated data. This framework allows for         \u2502\n\u2502 specific training algorithms and optimization techniques, such as backpropagation and dropout, to be            \u2502\n\u2502 effectively utilized.                                                                                           \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>We can see that our RAG pipeline performs relatively well for simple queries, especially given the small size of the dataset. Scaling this method for converting a larger sample of PDFs would require more compute (GPUs) and a more advanced deployment of Weaviate (like Docker, Kubernetes, or Weaviate Cloud). For more information on available Weaviate configurations, check out the documentation.</p>"},{"location":"examples/rag_weaviate/#rag-with-weaviate","title":"RAG with Weaviate\u00b6","text":""},{"location":"examples/rag_weaviate/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"<p>This is a code recipe that uses Weaviate to perform RAG over PDF documents parsed by Docling.</p> <p>In this notebook, we accomplish the following:</p> <ul> <li>Parse the top machine learning papers on arXiv using Docling</li> <li>Perform hierarchical chunking of the documents using Docling</li> <li>Generate text embeddings with OpenAI</li> <li>Perform RAG using Weaviate</li> </ul> <p>To run this notebook, you'll need:</p> <ul> <li>An OpenAI API key</li> <li>Access to GPU/s</li> </ul> <p>Note: For best results, please use GPU acceleration to run this notebook. Here are two options for running this notebook:</p> <ol> <li>Locally on a MacBook with an Apple Silicon chip. Converting all documents in the notebook takes ~2 minutes on a MacBook M2 due to Docling's usage of MPS accelerators.</li> <li>Run this notebook on Google Colab. Converting all documents in the notebook takes ~8 minutes on a Google Colab T4 GPU.</li> </ol>"},{"location":"examples/rag_weaviate/#install-docling-and-weaviate-client","title":"Install Docling and Weaviate client\u00b6","text":"<p>Note: If Colab prompts you to restart the session after running the cell below, click \"restart\" and proceed with running the rest of the notebook.</p>"},{"location":"examples/rag_weaviate/#part-1-docling","title":"\ud83d\udc25 Part 1: Docling\u00b6","text":"<p>Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.</p> <p>The code below checks to see if a GPU is available, either via CUDA or MPS.</p>"},{"location":"examples/rag_weaviate/#convert-pdfs-to-docling-documents","title":"Convert PDFs to Docling documents\u00b6","text":"<p>Here we use Docling's <code>.convert_all()</code> to parse a batch of PDFs. The result is a list of Docling documents that we can use for text extraction.</p> <p>Note: Please ignore the <code>ERR#</code> message.</p>"},{"location":"examples/rag_weaviate/#post-process-extracted-document-data","title":"Post-process extracted document data\u00b6","text":""},{"location":"examples/rag_weaviate/#perform-hierarchical-chunking-on-documents","title":"Perform hierarchical chunking on documents\u00b6","text":"<p>We use Docling's <code>HierarchicalChunker()</code> to perform hierarchy-aware chunking of our list of documents. This is meant to preserve some of the structure and relationships within the document, which enables more accurate and relevant retrieval in our RAG pipeline.</p>"},{"location":"examples/rag_weaviate/#part-2-weaviate","title":"\ud83d\udc9a Part 2: Weaviate\u00b6","text":""},{"location":"examples/rag_weaviate/#create-and-configure-an-embedded-weaviate-collection","title":"Create and configure an embedded Weaviate collection\u00b6","text":""},{"location":"examples/rag_weaviate/#wrangle-data-into-an-acceptable-format-for-weaviate","title":"Wrangle data into an acceptable format for Weaviate\u00b6","text":"<p>Transform our data from lists to a list of dictionaries for insertion into our Weaviate collection.</p>"},{"location":"examples/rag_weaviate/#insert-data-into-weaviate-and-generate-embeddings","title":"Insert data into Weaviate and generate embeddings\u00b6","text":"<p>Embeddings will be generated upon insertion to our Weaviate collection.</p>"},{"location":"examples/rag_weaviate/#query-the-data","title":"Query the data\u00b6","text":"<p>Here, we perform a simple similarity search to return the most similar embedded chunks to our search query.</p>"},{"location":"examples/rag_weaviate/#perform-rag-on-parsed-articles","title":"Perform RAG on parsed articles\u00b6","text":"<p>Weaviate's <code>generate</code> module allows you to perform RAG over your embedded data without having to use a separate framework.</p> <p>We specify a prompt that includes the field we want to search through in the database (in this case it's <code>text</code>), a query that includes our search term, and the number of retrieved results to use in the generation.</p>"},{"location":"examples/rapidocr_with_custom_models/","title":"RapidOCR with custom OCR models","text":"<p>Use RapidOCR with custom ONNX models to OCR a PDF page and print Markdown.</p> <p>What this example does</p> <ul> <li>Downloads RapidOCR models from Hugging Face via ModelScope.</li> <li>Configures <code>RapidOcrOptions</code> with explicit det/rec/cls model paths.</li> <li>Runs the PDF pipeline with RapidOCR and prints Markdown output.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling, <code>modelscope</code>, and have network access to download models.</li> <li>Ensure your environment can import <code>docling</code> and <code>modelscope</code>.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/rapidocr_with_custom_models.py</code>.</li> <li>The script prints the recognized text as Markdown to stdout.</li> </ul> <p>Notes</p> <ul> <li>The default <code>source</code> points to an arXiv PDF URL; replace with a local path if desired.</li> <li>Model paths are derived from the downloaded snapshot directory.</li> <li>ModelScope caches downloads (typically under <code>~/.cache/modelscope</code>); set a proxy or pre-download models if running in a restricted network environment.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import os\n\nfrom modelscope import snapshot_download\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n\ndef main():\n    # Source document to convert\n    source = \"https://arxiv.org/pdf/2408.09869v4\"\n\n    # Download RapidOCR models from Hugging Face\n    print(\"Downloading RapidOCR models\")\n    download_path = snapshot_download(repo_id=\"RapidAI/RapidOCR\")\n\n    # Setup RapidOcrOptions for English detection\n    det_model_path = os.path.join(\n        download_path, \"onnx\", \"PP-OCRv5\", \"det\", \"ch_PP-OCRv5_server_det.onnx\"\n    )\n    rec_model_path = os.path.join(\n        download_path, \"onnx\", \"PP-OCRv5\", \"rec\", \"ch_PP-OCRv5_rec_server_infer.onnx\"\n    )\n    cls_model_path = os.path.join(\n        download_path, \"onnx\", \"PP-OCRv4\", \"cls\", \"ch_ppocr_mobile_v2.0_cls_infer.onnx\"\n    )\n    ocr_options = RapidOcrOptions(\n        det_model_path=det_model_path,\n        rec_model_path=rec_model_path,\n        cls_model_path=cls_model_path,\n    )\n\n    pipeline_options = PdfPipelineOptions(\n        ocr_options=ocr_options,\n    )\n\n    # Convert the document\n    converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_options=pipeline_options,\n            ),\n        },\n    )\n\n    conversion_result: ConversionResult = converter.convert(source=source)\n    doc = conversion_result.document\n    md = doc.export_to_markdown()\n    print(md)\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import os  from modelscope import snapshot_download  from docling.datamodel.base_models import InputFormat from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions from docling.document_converter import DocumentConverter, PdfFormatOption   def main():     # Source document to convert     source = \"https://arxiv.org/pdf/2408.09869v4\"      # Download RapidOCR models from Hugging Face     print(\"Downloading RapidOCR models\")     download_path = snapshot_download(repo_id=\"RapidAI/RapidOCR\")      # Setup RapidOcrOptions for English detection     det_model_path = os.path.join(         download_path, \"onnx\", \"PP-OCRv5\", \"det\", \"ch_PP-OCRv5_server_det.onnx\"     )     rec_model_path = os.path.join(         download_path, \"onnx\", \"PP-OCRv5\", \"rec\", \"ch_PP-OCRv5_rec_server_infer.onnx\"     )     cls_model_path = os.path.join(         download_path, \"onnx\", \"PP-OCRv4\", \"cls\", \"ch_ppocr_mobile_v2.0_cls_infer.onnx\"     )     ocr_options = RapidOcrOptions(         det_model_path=det_model_path,         rec_model_path=rec_model_path,         cls_model_path=cls_model_path,     )      pipeline_options = PdfPipelineOptions(         ocr_options=ocr_options,     )      # Convert the document     converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_options=pipeline_options,             ),         },     )      conversion_result: ConversionResult = converter.convert(source=source)     doc = conversion_result.document     md = doc.export_to_markdown()     print(md)   if __name__ == \"__main__\":     main()"},{"location":"examples/retrieval_qdrant/","title":"Retrieval with Qdrant","text":"Step Tech Execution Embedding FastEmbed \ud83d\udcbb Local Vector store Qdrant \ud83d\udcbb Local <p>This example demonstrates using Docling with Qdrant to perform a hybrid search across your documents using dense and sparse vectors.</p> <p>We'll chunk the documents using Docling before adding them to a Qdrant collection. By limiting the length of the chunks, we can preserve the meaning in each vector embedding.</p> <ul> <li>\ud83d\udc49 Qdrant client uses FastEmbed to generate vector embeddings. You can install the <code>fastembed-gpu</code> package if you've got the hardware to support it.</li> </ul> In\u00a0[1]: Copied! <pre>%pip install --no-warn-conflicts -q qdrant-client docling fastembed\n</pre> %pip install --no-warn-conflicts -q qdrant-client docling fastembed <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> <p>Let's import all the classes we'll be working with.</p> In\u00a0[2]: Copied! <pre>from qdrant_client import QdrantClient\n\nfrom docling.chunking import HybridChunker\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter\n</pre> from qdrant_client import QdrantClient  from docling.chunking import HybridChunker from docling.datamodel.base_models import InputFormat from docling.document_converter import DocumentConverter <ul> <li>For Docling, we'll set the  allowed formats to HTML since we'll only be working with webpages in this tutorial.</li> <li>If we set a sparse model, Qdrant client will fuse the dense and sparse results using RRF. Reference.</li> </ul> In\u00a0[3]: Copied! <pre>COLLECTION_NAME = \"docling\"\n\ndoc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML])\nclient = QdrantClient(location=\":memory:\")\n# The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI.\n# For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant\n# client = QdrantClient(location=\"http://localhost:6333\")\n\nclient.set_model(\"sentence-transformers/all-MiniLM-L6-v2\")\nclient.set_sparse_model(\"Qdrant/bm25\")\n</pre> COLLECTION_NAME = \"docling\"  doc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML]) client = QdrantClient(location=\":memory:\") # The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI. # For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant # client = QdrantClient(location=\"http://localhost:6333\")  client.set_model(\"sentence-transformers/all-MiniLM-L6-v2\") client.set_sparse_model(\"Qdrant/bm25\") <pre>/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n  warnings.warn(\n</pre> <p>We can now download and chunk the document using Docling. For demonstration, we'll use an article about chunking strategies :)</p> In\u00a0[4]: Copied! <pre>result = doc_converter.convert(\n    \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\"\n)\ndocuments, metadatas = [], []\nfor chunk in HybridChunker().chunk(result.document):\n    documents.append(chunk.text)\n    metadatas.append(chunk.meta.export_json_dict())\n</pre> result = doc_converter.convert(     \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\" ) documents, metadatas = [], [] for chunk in HybridChunker().chunk(result.document):     documents.append(chunk.text)     metadatas.append(chunk.meta.export_json_dict()) <p>Let's now upload the documents to Qdrant.</p> <ul> <li>The <code>add()</code> method batches the documents and uses FastEmbed to generate vector embeddings on our machine.</li> </ul> In\u00a0[5]: Copied! <pre>_ = client.add(\n    collection_name=COLLECTION_NAME,\n    documents=documents,\n    metadata=metadatas,\n    batch_size=64,\n)\n</pre> _ = client.add(     collection_name=COLLECTION_NAME,     documents=documents,     metadata=metadatas,     batch_size=64, ) In\u00a0[6]: Copied! <pre>points = client.query(\n    collection_name=COLLECTION_NAME,\n    query_text=\"Can I split documents?\",\n    limit=10,\n)\n</pre> points = client.query(     collection_name=COLLECTION_NAME,     query_text=\"Can I split documents?\",     limit=10, ) In\u00a0[7]: Copied! <pre>for i, point in enumerate(points):\n    print(f\"=== {i} ===\")\n    print(point.document)\n    print()\n</pre> for i, point in enumerate(points):     print(f\"=== {i} ===\")     print(point.document)     print() <pre>=== 0 ===\nHave you ever wondered how we, humans, would chunk? Here's a breakdown of a possible way a human would process a new document:\n1. We start at the top of the document, treating the first part as a chunk.\n\u00a0\u00a0\u00a02. We continue down the document, deciding if a new sentence or piece of information belongs with the first chunk or should start a new one.\n \u00a0\u00a0\u00a03. We keep this up until we reach the end of the document.\nThe ultimate dream? Having an agent do this for you. But slow down! This approach is still being tested and isn't quite ready for the big leagues due to the time it takes to process multiple LLM calls and the cost of those calls. There's no implementation available in public libraries just yet. However, Greg Kamradt has his version available here.\n\n=== 1 ===\nDocument Specific Chunking is a strategy that respects the document's structure. Rather than using a set number of characters or a recursive process, it creates chunks that align with the logical sections of the document, like paragraphs or subsections. This approach maintains the original author's organization of content and helps keep the text coherent. It makes the retrieved information more relevant and useful, particularly for structured documents with clearly defined sections.\nDocument Specific Chunking can handle a variety of document formats, such as:\nMarkdown\nHTML\nPython\netc\nHere we\u2019ll take Markdown as our example and use a modified version of our first sample text:\n\u200d\nThe result is the following:\nYou can see here that with a chunk size of 105, the Markdown structure of the document is taken into account, and the chunks thus preserve the semantics of the text!\n\n=== 2 ===\nAnd there you have it! These chunking strategies are like a personal toolbox when it comes to implementing Retrieval Augmented Generation. They're a ton of ways to slice and dice text, each with its unique features and quirks. This variety gives you the freedom to pick the strategy that suits your project best, allowing you to tailor your approach to perfectly fit the unique needs of your work.\nTo put these strategies into action, there's a whole array of tools and libraries at your disposal. For example, llama_index is a fantastic tool that lets you create document indices and retrieve chunked documents. Let's not forget LangChain, another remarkable tool that makes implementing chunking strategies a breeze, particularly when dealing with multi-language data. Diving into these tools and understanding how they can work in harmony with the chunking strategies we've discussed is a crucial part of mastering Retrieval Augmented Generation.\nBy the way, if you're eager to experiment with your own examples using the chunking visualisation tool featured in this blog, feel free to give it a try! You can access it right here. Enjoy, and happy chunking! \ud83d\ude09\n\n=== 3 ===\nRetrieval Augmented Generation (RAG) has been a hot topic in understanding, interpreting, and generating text with AI for the last few months. It's like a wonderful union of retrieval-based and generative models, creating a playground for researchers, data scientists, and natural language processing enthusiasts, like you and me.\nTo truly control the results produced by our RAG, we need to understand chunking strategies and their role in the process of retrieving and generating text. Indeed, each chunking strategy enhances RAG's effectiveness in its unique way.\nThe goal of chunking is, as its name says, to chunk the information into multiple smaller pieces in order to store it in a more efficient and meaningful way. This allows the retrieval to capture pieces of information that are more related to the question at hand, and the generation to be more precise, but also less costly, as only a part of a document will be included in the LLM prompt, instead of the whole document.\nLet's explore some chunking strategies together.\nThe methods mentioned in the article you're about to read usually make use of two key parameters. First, we have [chunk_size]\u2014 which controls the size of your text chunks. Then there's [chunk_overlap], which takes care of how much text overlaps between one chunk and the next.\n\n=== 4 ===\nSemantic Chunking considers the relationships within the text. It divides the text into meaningful, semantically complete chunks. This approach ensures the information's integrity during retrieval, leading to a more accurate and contextually appropriate outcome.\nSemantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.\nBy focusing on the text's meaning and context, Semantic Chunking significantly enhances the quality of retrieval. It's a top-notch choice when maintaining the semantic integrity of the text is vital.\nHowever, this method does require more effort and is notably slower than the previous ones.\nOn our example text, since it is quite short and does not expose varied subjects, this method would only generate a single chunk.\n\n=== 5 ===\nLanguage models used in the rest of your possible RAG pipeline have a token limit, which should not be exceeded. When dividing your text into chunks, it's advisable to count the number of tokens. Plenty of tokenizers are available. To ensure accuracy, use the same tokenizer for counting tokens as the one used in the language model.\nConsequently, there are also splitters available for this purpose.\nFor instance, by using the [SpacyTextSplitter] from LangChain, the following chunks are created:\n\u200d\n\n=== 6 ===\nFirst things first, we have Character Chunking. This strategy divides the text into chunks based on a fixed number of characters. Its simplicity makes it a great starting point, but it can sometimes disrupt the text's flow, breaking sentences or words in unexpected places. Despite its limitations, it's a great stepping stone towards more advanced methods.\nNow let\u2019s see that in action with an example. Imagine a text that reads:\nIf we decide to set our chunk size to 100 and no chunk overlap, we'd end up with the following chunks. As you can see, Character Chunking can lead to some intriguing, albeit sometimes nonsensical, results, cutting some of the sentences in their middle.\nBy choosing a smaller chunk size, \u00a0we would obtain more chunks, and by setting a bigger chunk overlap, we could obtain something like this:\n\u200d\nAlso, by default this method creates chunks character by character based on the empty character [\u2019 \u2019]. But you can specify a different one in order to chunk on something else, even a complete word! For instance, by specifying [' '] as the separator, you can avoid cutting words in their middle.\n\n=== 7 ===\nNext, let's take a look at Recursive Character Chunking. Based on the basic concept of Character Chunking, this advanced version takes it up a notch by dividing the text into chunks until a certain condition is met, such as reaching a minimum chunk size. This method ensures that the chunking process aligns with the text's structure, preserving more meaning. Its adaptability makes Recursive Character Chunking great for texts with varied structures.\nAgain, let\u2019s use the same example in order to illustrate this method. With a chunk size of 100, and the default settings for the other parameters, we obtain the following chunks:\n\n</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/retrieval_qdrant/#retrieval-with-qdrant","title":"Retrieval with Qdrant\u00b6","text":""},{"location":"examples/retrieval_qdrant/#overview","title":"Overview\u00b6","text":""},{"location":"examples/retrieval_qdrant/#setup","title":"Setup\u00b6","text":""},{"location":"examples/retrieval_qdrant/#retrieval","title":"Retrieval\u00b6","text":""},{"location":"examples/run_md/","title":"Run md","text":"In\u00a0[\u00a0]: Copied! <pre>import json\nimport logging\nimport os\nfrom pathlib import Path\n</pre> import json import logging import os from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>import yaml\n</pre> import yaml In\u00a0[\u00a0]: Copied! <pre>from docling.backend.md_backend import MarkdownDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.document import InputDocument\n</pre> from docling.backend.md_backend import MarkdownDocumentBackend from docling.datamodel.base_models import InputFormat from docling.datamodel.document import InputDocument In\u00a0[\u00a0]: Copied! <pre>_log = logging.getLogger(__name__)\n</pre> _log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied! <pre>def main():\n    input_paths = [Path(\"README.md\")]\n\n    for path in input_paths:\n        in_doc = InputDocument(\n            path_or_stream=path,\n            format=InputFormat.PDF,\n            backend=MarkdownDocumentBackend,\n        )\n        mdb = MarkdownDocumentBackend(in_doc=in_doc, path_or_stream=path)\n        document = mdb.convert()\n\n        out_path = Path(\"scratch\")\n        print(f\"Document {path} converted.\\nSaved markdown output to: {out_path!s}\")\n\n        # Export Docling document format to markdowndoc:\n        fn = os.path.basename(path)\n\n        with (out_path / f\"{fn}.md\").open(\"w\") as fp:\n            fp.write(document.export_to_markdown())\n\n        with (out_path / f\"{fn}.json\").open(\"w\") as fp:\n            fp.write(json.dumps(document.export_to_dict()))\n\n        with (out_path / f\"{fn}.yaml\").open(\"w\") as fp:\n            fp.write(yaml.safe_dump(document.export_to_dict()))\n</pre> def main():     input_paths = [Path(\"README.md\")]      for path in input_paths:         in_doc = InputDocument(             path_or_stream=path,             format=InputFormat.PDF,             backend=MarkdownDocumentBackend,         )         mdb = MarkdownDocumentBackend(in_doc=in_doc, path_or_stream=path)         document = mdb.convert()          out_path = Path(\"scratch\")         print(f\"Document {path} converted.\\nSaved markdown output to: {out_path!s}\")          # Export Docling document format to markdowndoc:         fn = os.path.basename(path)          with (out_path / f\"{fn}.md\").open(\"w\") as fp:             fp.write(document.export_to_markdown())          with (out_path / f\"{fn}.json\").open(\"w\") as fp:             fp.write(json.dumps(document.export_to_dict()))          with (out_path / f\"{fn}.yaml\").open(\"w\") as fp:             fp.write(yaml.safe_dump(document.export_to_dict())) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n    main()\n</pre> if __name__ == \"__main__\":     main()"},{"location":"examples/run_with_accelerator/","title":"Accelerator options","text":"<p>Run conversion with an explicit accelerator configuration (CPU/MPS/CUDA).</p> <p>What this example does</p> <ul> <li>Shows how to select the accelerator device and thread count.</li> <li>Enables OCR and table structure to exercise compute paths, and prints timings.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/run_with_accelerator.py</code>.</li> <li>Toggle the commented <code>AcceleratorOptions</code> examples to try AUTO/MPS/CUDA.</li> </ul> <p>Notes</p> <ul> <li>EasyOCR does not support <code>cuda:N</code> device selection (defaults to <code>cuda:0</code>).</li> <li><code>settings.debug.profile_pipeline_timings = True</code> prints profiling details.</li> <li><code>AcceleratorDevice.MPS</code> is macOS-only; <code>CUDA</code> requires a compatible GPU and CUDA-enabled PyTorch build. CPU mode works everywhere.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>from pathlib import Path\n\nfrom docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n    PdfPipelineOptions,\n)\nfrom docling.datamodel.settings import settings\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n\ndef main():\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n    # Explicitly set the accelerator\n    # accelerator_options = AcceleratorOptions(\n    #     num_threads=8, device=AcceleratorDevice.AUTO\n    # )\n    accelerator_options = AcceleratorOptions(\n        num_threads=8, device=AcceleratorDevice.CPU\n    )\n    # accelerator_options = AcceleratorOptions(\n    #     num_threads=8, device=AcceleratorDevice.MPS\n    # )\n    # accelerator_options = AcceleratorOptions(\n    #     num_threads=8, device=AcceleratorDevice.CUDA\n    # )\n\n    # easyocr doesnt support cuda:N allocation, defaults to cuda:0\n    # accelerator_options = AcceleratorOptions(num_threads=8, device=\"cuda:1\")\n\n    pipeline_options = PdfPipelineOptions()\n    pipeline_options.accelerator_options = accelerator_options\n    pipeline_options.do_ocr = True\n    pipeline_options.do_table_structure = True\n    pipeline_options.table_structure_options.do_cell_matching = True\n\n    converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_options=pipeline_options,\n            )\n        }\n    )\n\n    # Enable the profiling to measure the time spent\n    settings.debug.profile_pipeline_timings = True\n\n    # Convert the document\n    conversion_result = converter.convert(input_doc_path)\n    doc = conversion_result.document\n\n    # List with total time per document\n    doc_conversion_secs = conversion_result.timings[\"pipeline_total\"].times\n\n    md = doc.export_to_markdown()\n    print(md)\n    print(f\"Conversion secs: {doc_conversion_secs}\")\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  from pathlib import Path  from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import (     PdfPipelineOptions, ) from docling.datamodel.settings import settings from docling.document_converter import DocumentConverter, PdfFormatOption   def main():     data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2206.01062.pdf\"      # Explicitly set the accelerator     # accelerator_options = AcceleratorOptions(     #     num_threads=8, device=AcceleratorDevice.AUTO     # )     accelerator_options = AcceleratorOptions(         num_threads=8, device=AcceleratorDevice.CPU     )     # accelerator_options = AcceleratorOptions(     #     num_threads=8, device=AcceleratorDevice.MPS     # )     # accelerator_options = AcceleratorOptions(     #     num_threads=8, device=AcceleratorDevice.CUDA     # )      # easyocr doesnt support cuda:N allocation, defaults to cuda:0     # accelerator_options = AcceleratorOptions(num_threads=8, device=\"cuda:1\")      pipeline_options = PdfPipelineOptions()     pipeline_options.accelerator_options = accelerator_options     pipeline_options.do_ocr = True     pipeline_options.do_table_structure = True     pipeline_options.table_structure_options.do_cell_matching = True      converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_options=pipeline_options,             )         }     )      # Enable the profiling to measure the time spent     settings.debug.profile_pipeline_timings = True      # Convert the document     conversion_result = converter.convert(input_doc_path)     doc = conversion_result.document      # List with total time per document     doc_conversion_secs = conversion_result.timings[\"pipeline_total\"].times      md = doc.export_to_markdown()     print(md)     print(f\"Conversion secs: {doc_conversion_secs}\")   if __name__ == \"__main__\":     main()"},{"location":"examples/run_with_formats/","title":"Multi-format conversion","text":"<p>Run conversion across multiple input formats and customize handling per type.</p> <p>What this example does</p> <ul> <li>Demonstrates converting a mixed list of files (PDF, DOCX, PPTX, HTML, images, etc.).</li> <li>Shows how to restrict <code>allowed_formats</code> and override <code>format_options</code> per format.</li> <li>Writes results (Markdown, JSON, YAML) to <code>scratch/</code>.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling and any format-specific dependencies (e.g., for DOCX/PPTX parsing).</li> <li>Ensure you can import <code>docling</code> from your Python environment.</li> <li>YAML export requires <code>PyYAML</code> (<code>pip install pyyaml</code>).</li> </ul> <p>How to run</p> <ul> <li>From the repository root, run: <code>python docs/examples/run_with_formats.py</code>.</li> <li>Outputs are written under <code>scratch/</code> next to where you run the script.</li> <li>If <code>scratch/</code> does not exist, create it before running.</li> </ul> <p>Customizing inputs</p> <ul> <li>Update <code>input_paths</code> to include or remove files on your machine.</li> <li>Non-whitelisted formats are ignored (see <code>allowed_formats</code>).</li> </ul> <p>Notes</p> <ul> <li><code>allowed_formats</code>: explicit whitelist of formats that will be processed.</li> <li><code>format_options</code>: per-format pipeline/backend overrides. Everything is optional; defaults exist.</li> <li>Exports: per input, writes <code>&lt;stem&gt;.md</code>, <code>&lt;stem&gt;.json</code>, and <code>&lt;stem&gt;.yaml</code> in <code>scratch/</code>.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import json\nimport logging\nfrom pathlib import Path\n\nimport yaml\n\nfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import (\n    DocumentConverter,\n    PdfFormatOption,\n    WordFormatOption,\n)\nfrom docling.pipeline.simple_pipeline import SimplePipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\n\n_log = logging.getLogger(__name__)\n\n\ndef main():\n    input_paths = [\n        Path(\"README.md\"),\n        Path(\"tests/data/html/wiki_duck.html\"),\n        Path(\"tests/data/docx/word_sample.docx\"),\n        Path(\"tests/data/docx/lorem_ipsum.docx\"),\n        Path(\"tests/data/pptx/powerpoint_sample.pptx\"),\n        Path(\"tests/data/2305.03393v1-pg9-img.png\"),\n        Path(\"tests/data/pdf/2206.01062.pdf\"),\n        Path(\"tests/data/asciidoc/test_01.asciidoc\"),\n    ]\n\n    ## for defaults use:\n    # doc_converter = DocumentConverter()\n\n    ## to customize use:\n\n    # Below we explicitly whitelist formats and override behavior for some of them.\n    # You can omit this block and use the defaults (see above) for a quick start.\n    doc_converter = DocumentConverter(  # all of the below is optional, has internal defaults.\n        allowed_formats=[\n            InputFormat.PDF,\n            InputFormat.IMAGE,\n            InputFormat.DOCX,\n            InputFormat.HTML,\n            InputFormat.PPTX,\n            InputFormat.ASCIIDOC,\n            InputFormat.CSV,\n            InputFormat.MD,\n        ],  # whitelist formats, non-matching files are ignored.\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend\n            ),\n            InputFormat.DOCX: WordFormatOption(\n                pipeline_cls=SimplePipeline  # or set a backend, e.g., MsWordDocumentBackend\n                # If you change the backend, remember to import it, e.g.:\n                #   from docling.backend.msword_backend import MsWordDocumentBackend\n            ),\n        },\n    )\n\n    conv_results = doc_converter.convert_all(input_paths)\n\n    for res in conv_results:\n        out_path = Path(\"scratch\")  # ensure this directory exists before running\n        print(\n            f\"Document {res.input.file.name} converted.\"\n            f\"\\nSaved markdown output to: {out_path!s}\"\n        )\n        _log.debug(res.document._export_to_indented_text(max_text_len=16))\n        # Export Docling document to Markdown:\n        with (out_path / f\"{res.input.file.stem}.md\").open(\"w\") as fp:\n            fp.write(res.document.export_to_markdown())\n\n        with (out_path / f\"{res.input.file.stem}.json\").open(\"w\") as fp:\n            fp.write(json.dumps(res.document.export_to_dict()))\n\n        with (out_path / f\"{res.input.file.stem}.yaml\").open(\"w\") as fp:\n            fp.write(yaml.safe_dump(res.document.export_to_dict()))\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import json import logging from pathlib import Path  import yaml  from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend from docling.datamodel.base_models import InputFormat from docling.document_converter import (     DocumentConverter,     PdfFormatOption,     WordFormatOption, ) from docling.pipeline.simple_pipeline import SimplePipeline from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline  _log = logging.getLogger(__name__)   def main():     input_paths = [         Path(\"README.md\"),         Path(\"tests/data/html/wiki_duck.html\"),         Path(\"tests/data/docx/word_sample.docx\"),         Path(\"tests/data/docx/lorem_ipsum.docx\"),         Path(\"tests/data/pptx/powerpoint_sample.pptx\"),         Path(\"tests/data/2305.03393v1-pg9-img.png\"),         Path(\"tests/data/pdf/2206.01062.pdf\"),         Path(\"tests/data/asciidoc/test_01.asciidoc\"),     ]      ## for defaults use:     # doc_converter = DocumentConverter()      ## to customize use:      # Below we explicitly whitelist formats and override behavior for some of them.     # You can omit this block and use the defaults (see above) for a quick start.     doc_converter = DocumentConverter(  # all of the below is optional, has internal defaults.         allowed_formats=[             InputFormat.PDF,             InputFormat.IMAGE,             InputFormat.DOCX,             InputFormat.HTML,             InputFormat.PPTX,             InputFormat.ASCIIDOC,             InputFormat.CSV,             InputFormat.MD,         ],  # whitelist formats, non-matching files are ignored.         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend             ),             InputFormat.DOCX: WordFormatOption(                 pipeline_cls=SimplePipeline  # or set a backend, e.g., MsWordDocumentBackend                 # If you change the backend, remember to import it, e.g.:                 #   from docling.backend.msword_backend import MsWordDocumentBackend             ),         },     )      conv_results = doc_converter.convert_all(input_paths)      for res in conv_results:         out_path = Path(\"scratch\")  # ensure this directory exists before running         print(             f\"Document {res.input.file.name} converted.\"             f\"\\nSaved markdown output to: {out_path!s}\"         )         _log.debug(res.document._export_to_indented_text(max_text_len=16))         # Export Docling document to Markdown:         with (out_path / f\"{res.input.file.stem}.md\").open(\"w\") as fp:             fp.write(res.document.export_to_markdown())          with (out_path / f\"{res.input.file.stem}.json\").open(\"w\") as fp:             fp.write(json.dumps(res.document.export_to_dict()))          with (out_path / f\"{res.input.file.stem}.yaml\").open(\"w\") as fp:             fp.write(yaml.safe_dump(res.document.export_to_dict()))   if __name__ == \"__main__\":     main()"},{"location":"examples/serialization/","title":"Serialization","text":"<p>In this notebook we showcase the usage of Docling serializers.</p> In\u00a0[1]: Copied! <pre>%pip install -qU pip docling docling-core~=2.29 rich\n</pre> %pip install -qU pip docling docling-core~=2.29 rich <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>DOC_SOURCE = \"https://arxiv.org/pdf/2311.18481\"\n\n# we set some start-stop cues for defining an excerpt to print\nstart_cue = \"Copyright \u00a9 2024\"\nstop_cue = \"Application of NLP to ESG\"\n</pre> DOC_SOURCE = \"https://arxiv.org/pdf/2311.18481\"  # we set some start-stop cues for defining an excerpt to print start_cue = \"Copyright \u00a9 2024\" stop_cue = \"Application of NLP to ESG\" In\u00a0[3]: Copied! <pre>from rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console(width=210)  # for preventing Markdown table wrapped rendering\n\n\ndef print_in_console(text):\n    console.print(Panel(text))\n</pre> from rich.console import Console from rich.panel import Panel  console = Console(width=210)  # for preventing Markdown table wrapped rendering   def print_in_console(text):     console.print(Panel(text)) <p>We first convert the document:</p> In\u00a0[4]: Copied! <pre>from docling.document_converter import DocumentConverter\n\nconverter = DocumentConverter()\ndoc = converter.convert(source=DOC_SOURCE).document\n</pre> from docling.document_converter import DocumentConverter  converter = DocumentConverter() doc = converter.convert(source=DOC_SOURCE).document <pre>/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n  warnings.warn(warn_msg)\n</pre> <p>We can now apply any <code>BaseDocSerializer</code> on the produced document.</p> <p>\ud83d\udc49 Note that, to keep the shown output brief, we only print an excerpt.</p> <p>E.g. below we apply an <code>HTMLDocSerializer</code>:</p> In\u00a0[5]: Copied! <pre>from docling_core.transforms.serializer.html import HTMLDocSerializer\n\nserializer = HTMLDocSerializer(doc=doc)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\n# we here only print an excerpt to keep the output brief:\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\n</pre> from docling_core.transforms.serializer.html import HTMLDocSerializer  serializer = HTMLDocSerializer(doc=doc) ser_result = serializer.serialize() ser_text = ser_result.text  # we here only print an excerpt to keep the output brief: print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)]) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.&lt;/p&gt;                                                                                          \u2502\n\u2502 &lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th&gt;Report&lt;/th&gt;&lt;th&gt;Question&lt;/th&gt;&lt;th&gt;Answer&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;IBM 2022&lt;/td&gt;&lt;td&gt;How many hours were spent on employee learning in 2021?&lt;/td&gt;&lt;td&gt;22.5 million hours&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;IBM         \u2502\n\u2502 2022&lt;/td&gt;&lt;td&gt;What was the rate of fatalities in 2021?&lt;/td&gt;&lt;td&gt;The rate of fatalities in 2021 was 0.0016.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;IBM 2022&lt;/td&gt;&lt;td&gt;How many full audits were con- ducted in 2022 in                    \u2502\n\u2502 India?&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Starbucks 2022&lt;/td&gt;&lt;td&gt;What is the percentage of women in the Board of Directors?&lt;/td&gt;&lt;td&gt;25%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Starbucks 2022&lt;/td&gt;&lt;td&gt;What was the total energy con-         \u2502\n\u2502 sumption in 2021?&lt;/td&gt;&lt;td&gt;According to the table, the total energy consumption in 2021 was 2,491,543 MWh.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Starbucks 2022&lt;/td&gt;&lt;td&gt;How much packaging material was made from renewable mate-    \u2502\n\u2502 rials?&lt;/td&gt;&lt;td&gt;According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;                                                       \u2502\n\u2502 &lt;p&gt;Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.&lt;/p&gt;                                                                                             \u2502\n\u2502 &lt;p&gt;ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the   \u2502\n\u2502 response.&lt;/p&gt;                                                                                                                                                                                                  \u2502\n\u2502 &lt;h2&gt;Related Work&lt;/h2&gt;                                                                                                                                                                                          \u2502\n\u2502 &lt;p&gt;The DocQA integrates multiple AI technologies, namely:&lt;/p&gt;                                                                                                                                                  \u2502\n\u2502 &lt;p&gt;Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric     \u2502\n\u2502 layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et \u2502\n\u2502 al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .  \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-&lt;/p&gt;                        \u2502\n\u2502 &lt;figure&gt;&lt;figcaption&gt;Figure 1: System architecture: Simplified sketch of document question-answering pipeline.&lt;/figcaption&gt;&lt;/figure&gt;                                                                            \u2502\n\u2502 &lt;p&gt;based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).&lt;/p&gt;                     \u2502\n\u2502 &lt;p&gt;                                                                                                                                                                                                            \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>In the following example, we use a <code>MarkdownDocSerializer</code>:</p> In\u00a0[6]: Copied! <pre>from docling_core.transforms.serializer.markdown import MarkdownDocSerializer\n\nserializer = MarkdownDocSerializer(doc=doc)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\n</pre> from docling_core.transforms.serializer.markdown import MarkdownDocSerializer  serializer = MarkdownDocSerializer(doc=doc) ser_result = serializer.serialize() ser_text = ser_result.text  print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)]) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 | Report         | Question                                                         | Answer                                                                                                          |        \u2502\n\u2502 |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|        \u2502\n\u2502 | IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |        \u2502\n\u2502 | IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |        \u2502\n\u2502 | IBM 2022       | How many full audits were con- ducted in 2022 in India?          | 2                                                                                                               |        \u2502\n\u2502 | Starbucks 2022 | What is the percentage of women in the Board of Directors?       | 25%                                                                                                             |        \u2502\n\u2502 | Starbucks 2022 | What was the total energy con- sumption in 2021?                 | According to the table, the total energy consumption in 2021 was 2,491,543 MWh.                                 |        \u2502\n\u2502 | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. |        \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      \u2502\n\u2502 response.                                                                                                                                                                                                      \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 ## Related Work                                                                                                                                                                                                \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 &lt;!-- image --&gt;                                                                                                                                                                                                 \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>Let's now assume we would like to reconfigure the Markdown serialization such that:</p> <ul> <li>it uses a different component serializer, e.g. if we'd prefer tables to be printed in a triplet format (which could potentially improve the vector representation compared to Markdown tables)</li> <li>it uses specific user-defined parameters, e.g. if we'd prefer a different image placeholder text than the default one</li> </ul> <p>Check out the following configuration and notice the serialization differences in the output further below:</p> In\u00a0[7]: Copied! <pre>from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer\nfrom docling_core.transforms.serializer.markdown import MarkdownParams\n\nserializer = MarkdownDocSerializer(\n    doc=doc,\n    table_serializer=TripletTableSerializer(),\n    params=MarkdownParams(\n        image_placeholder=\"&lt;!-- demo picture placeholder --&gt;\",\n        # ...\n    ),\n)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\n</pre> from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer from docling_core.transforms.serializer.markdown import MarkdownParams  serializer = MarkdownDocSerializer(     doc=doc,     table_serializer=TripletTableSerializer(),     params=MarkdownParams(         image_placeholder=\"\",         # ...     ), ) ser_result = serializer.serialize() ser_text = ser_result.text  print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)]) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 IBM 2022, Question = How many hours were spent on employee learning in 2021?. IBM 2022, Answer = 22.5 million hours. IBM 2022, Question = What was the rate of fatalities in 2021?. IBM 2022, Answer = The     \u2502\n\u2502 rate of fatalities in 2021 was 0.0016.. IBM 2022, Question = How many full audits were con- ducted in 2022 in India?. IBM 2022, Answer = 2. Starbucks 2022, Question = What is the percentage of women in the  \u2502\n\u2502 Board of Directors?. Starbucks 2022, Answer = 25%. Starbucks 2022, Question = What was the total energy con- sumption in 2021?. Starbucks 2022, Answer = According to the table, the total energy consumption  \u2502\n\u2502 in 2021 was 2,491,543 MWh.. Starbucks 2022, Question = How much packaging material was made from renewable mate- rials?. Starbucks 2022, Answer = According to the given data, 31% of packaging materials were \u2502\n\u2502 made from recycled or renewable materials in FY22.                                                                                                                                                             \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      \u2502\n\u2502 response.                                                                                                                                                                                                      \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 ## Related Work                                                                                                                                                                                                \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 &lt;!-- demo picture placeholder --&gt;                                                                                                                                                                              \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>In the examples above, we were able to reuse existing implementations for our desired serialization strategy, but let's now assume we want to define a custom serialization logic, e.g. we would like picture serialization to include any available picture description (captioning) annotations.</p> <p>To that end, we first need to revisit our conversion and include all pipeline options needed for picture description enrichment.</p> In\u00a0[8]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n    PdfPipelineOptions,\n    PictureDescriptionVlmOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions(\n    do_picture_description=True,\n    picture_description_options=PictureDescriptionVlmOptions(\n        repo_id=\"HuggingFaceTB/SmolVLM-256M-Instruct\",\n        prompt=\"Describe this picture in three to five sentences. Be precise and concise.\",\n    ),\n    generate_picture_images=True,\n    images_scale=2,\n)\n\nconverter = DocumentConverter(\n    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}\n)\ndoc = converter.convert(source=DOC_SOURCE).document\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import (     PdfPipelineOptions,     PictureDescriptionVlmOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption  pipeline_options = PdfPipelineOptions(     do_picture_description=True,     picture_description_options=PictureDescriptionVlmOptions(         repo_id=\"HuggingFaceTB/SmolVLM-256M-Instruct\",         prompt=\"Describe this picture in three to five sentences. Be precise and concise.\",     ),     generate_picture_images=True,     images_scale=2, )  converter = DocumentConverter(     format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} ) doc = converter.convert(source=DOC_SOURCE).document <pre>/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n  warnings.warn(warn_msg)\n</pre> <p>We can then define our custom picture serializer:</p> In\u00a0[9]: Copied! <pre>from typing import Any, Optional\n\nfrom docling_core.transforms.serializer.base import (\n    BaseDocSerializer,\n    SerializationResult,\n)\nfrom docling_core.transforms.serializer.common import create_ser_result\nfrom docling_core.transforms.serializer.markdown import (\n    MarkdownParams,\n    MarkdownPictureSerializer,\n)\nfrom docling_core.types.doc.document import (\n    DoclingDocument,\n    ImageRefMode,\n    PictureDescriptionData,\n    PictureItem,\n)\nfrom typing_extensions import override\n\n\nclass AnnotationPictureSerializer(MarkdownPictureSerializer):\n    @override\n    def serialize(\n        self,\n        *,\n        item: PictureItem,\n        doc_serializer: BaseDocSerializer,\n        doc: DoclingDocument,\n        separator: Optional[str] = None,\n        **kwargs: Any,\n    ) -&gt; SerializationResult:\n        text_parts: list[str] = []\n\n        # reusing the existing result:\n        parent_res = super().serialize(\n            item=item,\n            doc_serializer=doc_serializer,\n            doc=doc,\n            **kwargs,\n        )\n        text_parts.append(parent_res.text)\n\n        # appending annotations:\n        for annotation in item.annotations:\n            if isinstance(annotation, PictureDescriptionData):\n                text_parts.append(f\"&lt;!-- Picture description: {annotation.text} --&gt;\")\n\n        text_res = (separator or \"\\n\").join(text_parts)\n        return create_ser_result(text=text_res, span_source=item)\n</pre> from typing import Any, Optional  from docling_core.transforms.serializer.base import (     BaseDocSerializer,     SerializationResult, ) from docling_core.transforms.serializer.common import create_ser_result from docling_core.transforms.serializer.markdown import (     MarkdownParams,     MarkdownPictureSerializer, ) from docling_core.types.doc.document import (     DoclingDocument,     ImageRefMode,     PictureDescriptionData,     PictureItem, ) from typing_extensions import override   class AnnotationPictureSerializer(MarkdownPictureSerializer):     @override     def serialize(         self,         *,         item: PictureItem,         doc_serializer: BaseDocSerializer,         doc: DoclingDocument,         separator: Optional[str] = None,         **kwargs: Any,     ) -&gt; SerializationResult:         text_parts: list[str] = []          # reusing the existing result:         parent_res = super().serialize(             item=item,             doc_serializer=doc_serializer,             doc=doc,             **kwargs,         )         text_parts.append(parent_res.text)          # appending annotations:         for annotation in item.annotations:             if isinstance(annotation, PictureDescriptionData):                 text_parts.append(f\"\")          text_res = (separator or \"\\n\").join(text_parts)         return create_ser_result(text=text_res, span_source=item) <p>Last but not least, we define a new doc serializer which leverages our custom picture serializer.</p> <p>Notice the picture description annotations in the output below:</p> In\u00a0[10]: Copied! <pre>serializer = MarkdownDocSerializer(\n    doc=doc,\n    picture_serializer=AnnotationPictureSerializer(),\n    params=MarkdownParams(\n        image_mode=ImageRefMode.PLACEHOLDER,\n        image_placeholder=\"\",\n    ),\n)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\n</pre> serializer = MarkdownDocSerializer(     doc=doc,     picture_serializer=AnnotationPictureSerializer(),     params=MarkdownParams(         image_mode=ImageRefMode.PLACEHOLDER,         image_placeholder=\"\",     ), ) ser_result = serializer.serialize() ser_text = ser_result.text  print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)]) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.                                                                                              \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 | Report         | Question                                                         | Answer                                                                                                          |        \u2502\n\u2502 |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|        \u2502\n\u2502 | IBM 2022       | How many hours were spent on employee learning in 2021?          | 22.5 million hours                                                                                              |        \u2502\n\u2502 | IBM 2022       | What was the rate of fatalities in 2021?                         | The rate of fatalities in 2021 was 0.0016.                                                                      |        \u2502\n\u2502 | IBM 2022       | How many full audits were con- ducted in 2022 in India?          | 2                                                                                                               |        \u2502\n\u2502 | Starbucks 2022 | What is the percentage of women in the Board of Directors?       | 25%                                                                                                             |        \u2502\n\u2502 | Starbucks 2022 | What was the total energy con- sumption in 2021?                 | According to the table, the total energy consumption in 2021 was 2,491,543 MWh.                                 |        \u2502\n\u2502 | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. |        \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.                                                                                                    \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the      \u2502\n\u2502 response.                                                                                                                                                                                                      \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 ## Related Work                                                                                                                                                                                                \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely:                                                                                                                                                         \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al.    \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based .      \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-                            \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline.                                                                                                                      \u2502\n\u2502 &lt;!-- Picture description: The image depicts a document conversion process. It is a sequence of steps that includes document conversion, information retrieval, and response generation. The document           \u2502\n\u2502 conversion step involves converting the document from a text format to a markdown format. The information retrieval step involves retrieving the document from a database or other source. The response        \u2502\n\u2502 generation step involves generating a response from the information retrieval step. --&gt;                                                                                                                        \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).                            \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2502                                                                                                                                                                                                                \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre>"},{"location":"examples/serialization/#serialization","title":"Serialization\u00b6","text":""},{"location":"examples/serialization/#overview","title":"Overview\u00b6","text":""},{"location":"examples/serialization/#setup","title":"Setup\u00b6","text":""},{"location":"examples/serialization/#basic-usage","title":"Basic usage\u00b6","text":""},{"location":"examples/serialization/#configuring-a-serializer","title":"Configuring a serializer\u00b6","text":""},{"location":"examples/serialization/#creating-a-custom-serializer","title":"Creating a custom serializer\u00b6","text":""},{"location":"examples/suryaocr_with_custom_models/","title":"SuryaOCR with custom OCR models","text":"<p>Example: Integrating SuryaOCR with Docling for PDF OCR and Markdown Export</p> <p>Overview:</p> <ul> <li>Configures SuryaOCR options for OCR.</li> <li>Executes PDF pipeline with SuryaOCR integration.</li> <li>Models auto-download from Hugging Face on first run.</li> </ul> <p>Prerequisites:</p> <ul> <li>Install: <code>pip install docling-surya</code></li> <li>Ensure <code>docling</code> imports successfully.</li> </ul> <p>Execution:</p> <ul> <li>Run from repo root: <code>python docs/examples/suryaocr_with_custom_models.py</code></li> <li>Outputs Markdown to stdout.</li> </ul> <p>Notes:</p> <ul> <li>Default source: EPA PDF URL; substitute with local path as needed.</li> <li>Models cached in <code>~/.cache/huggingface</code>; override with HF_HOME env var.</li> <li>Use proxy config for restricted networks.</li> <li>Important Licensing Note: The <code>docling-surya</code> package integrates SuryaOCR, which is licensed under the GNU General Public License (GPL). Using this integration may impose GPL obligations on your project. Review the license terms carefully.</li> </ul> In\u00a0[\u00a0]: Copied! <pre># Requires `pip install docling-surya`\n# See https://pypi.org/project/docling-surya/\nfrom docling_surya import SuryaOcrOptions\n</pre> # Requires `pip install docling-surya` # See https://pypi.org/project/docling-surya/ from docling_surya import SuryaOcrOptions In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied! <pre>def main():\n    source = \"https://19january2021snapshot.epa.gov/sites/static/files/2016-02/documents/epa_sample_letter_sent_to_commissioners_dated_february_29_2015.pdf\"\n\n    pipeline_options = PdfPipelineOptions(\n        do_ocr=True,\n        ocr_model=\"suryaocr\",\n        allow_external_plugins=True,\n        ocr_options=SuryaOcrOptions(lang=[\"en\"]),\n    )\n\n    converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),\n            InputFormat.IMAGE: PdfFormatOption(pipeline_options=pipeline_options),\n        }\n    )\n\n    result = converter.convert(source)\n    print(result.document.export_to_markdown())\n</pre> def main():     source = \"https://19january2021snapshot.epa.gov/sites/static/files/2016-02/documents/epa_sample_letter_sent_to_commissioners_dated_february_29_2015.pdf\"      pipeline_options = PdfPipelineOptions(         do_ocr=True,         ocr_model=\"suryaocr\",         allow_external_plugins=True,         ocr_options=SuryaOcrOptions(lang=[\"en\"]),     )      converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),             InputFormat.IMAGE: PdfFormatOption(pipeline_options=pipeline_options),         }     )      result = converter.convert(source)     print(result.document.export_to_markdown()) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n    main()\n</pre> if __name__ == \"__main__\":     main()"},{"location":"examples/tesseract_lang_detection/","title":"Automatic OCR language detection with tesseract","text":"<p>Detect language automatically with Tesseract OCR and force full-page OCR.</p> <p>What this example does</p> <ul> <li>Configures Tesseract (CLI in this snippet) with <code>lang=[\"auto\"]</code>.</li> <li>Forces full-page OCR and prints the recognized text as Markdown.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/tesseract_lang_detection.py</code>.</li> <li>Ensure Tesseract CLI (or library) is installed and on PATH.</li> </ul> <p>Notes</p> <ul> <li>You can switch to <code>TesseractOcrOptions</code> instead of <code>TesseractCliOcrOptions</code>.</li> <li>Language packs must be installed; set <code>TESSDATA_PREFIX</code> if Tesseract cannot find language data. Using <code>lang=[\"auto\"]</code> requires traineddata that supports script/language detection on your system.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>from pathlib import Path\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n    PdfPipelineOptions,\n    TesseractCliOcrOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n\ndef main():\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n    # Set lang=[\"auto\"] with a tesseract OCR engine: TesseractOcrOptions, TesseractCliOcrOptions\n    # ocr_options = TesseractOcrOptions(lang=[\"auto\"])\n    ocr_options = TesseractCliOcrOptions(lang=[\"auto\"])\n\n    pipeline_options = PdfPipelineOptions(\n        do_ocr=True, force_full_page_ocr=True, ocr_options=ocr_options\n    )\n\n    converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_options=pipeline_options,\n            )\n        }\n    )\n\n    doc = converter.convert(input_doc_path).document\n    md = doc.export_to_markdown()\n    print(md)\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  from pathlib import Path  from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import (     PdfPipelineOptions,     TesseractCliOcrOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption   def main():     data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2206.01062.pdf\"      # Set lang=[\"auto\"] with a tesseract OCR engine: TesseractOcrOptions, TesseractCliOcrOptions     # ocr_options = TesseractOcrOptions(lang=[\"auto\"])     ocr_options = TesseractCliOcrOptions(lang=[\"auto\"])      pipeline_options = PdfPipelineOptions(         do_ocr=True, force_full_page_ocr=True, ocr_options=ocr_options     )      converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_options=pipeline_options,             )         }     )      doc = converter.convert(input_doc_path).document     md = doc.export_to_markdown()     print(md)   if __name__ == \"__main__\":     main()"},{"location":"examples/translate/","title":"Simple translation","text":"<p>Translate extracted text content and regenerate Markdown with embedded images.</p> <p>What this example does</p> <ul> <li>Converts a PDF and saves original Markdown with embedded images.</li> <li>Translates text elements and table cell contents, then saves a translated Markdown.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling. Add a translation library of your choice inside <code>translate()</code>.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/translate.py</code>.</li> <li>The script writes original and translated Markdown to <code>scratch/</code>.</li> </ul> <p>Notes</p> <ul> <li><code>translate()</code> is a placeholder; integrate your preferred translation API/client.</li> <li>Image generation is enabled to preserve embedded images in the output.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import logging\nfrom pathlib import Path\n\nfrom docling_core.types.doc import ImageRefMode, TableItem, TextItem\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\n_log = logging.getLogger(__name__)\n\nIMAGE_RESOLUTION_SCALE = 2.0\n\n\n# FIXME: put in your favorite translation code ....\ndef translate(text: str, src: str = \"en\", dest: str = \"de\"):\n    _log.warning(\"!!! IMPLEMENT HERE YOUR FAVORITE TRANSLATION CODE!!!\")\n    # from googletrans import Translator\n\n    # Initialize the translator\n    # translator = Translator()\n\n    # Translate text from English to German\n    # text = \"Hello, how are you?\"\n    # translated = translator.translate(text, src=\"en\", dest=\"de\")\n\n    return text\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO)\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n    output_dir = Path(\"scratch\")  # ensure this directory exists before saving\n\n    # Important: For operating with page images, we must keep them, otherwise the DocumentConverter\n    # will destroy them for cleaning up memory.\n    # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images.\n    # scale=1 correspond of a standard 72 DPI image\n    # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched\n    # with the image field\n    pipeline_options = PdfPipelineOptions()\n    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n    pipeline_options.generate_page_images = True\n    pipeline_options.generate_picture_images = True\n\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n        }\n    )\n\n    conv_res = doc_converter.convert(input_doc_path)\n    conv_doc = conv_res.document\n    doc_filename = conv_res.input.file.name\n\n    # Save markdown with embedded pictures in original text\n    # Tip: create the `scratch/` folder first or adjust `output_dir`.\n    md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\"\n    conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n    for element, _level in conv_res.document.iterate_items():\n        if isinstance(element, TextItem):\n            element.orig = element.text\n            element.text = translate(text=element.text)\n\n        elif isinstance(element, TableItem):\n            for cell in element.data.table_cells:\n                cell.text = translate(text=cell.text)\n\n    # Save markdown with embedded pictures in translated text\n    md_filename = output_dir / f\"{doc_filename}-with-images-translated.md\"\n    conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import logging from pathlib import Path  from docling_core.types.doc import ImageRefMode, TableItem, TextItem  from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption  _log = logging.getLogger(__name__)  IMAGE_RESOLUTION_SCALE = 2.0   # FIXME: put in your favorite translation code .... def translate(text: str, src: str = \"en\", dest: str = \"de\"):     _log.warning(\"!!! IMPLEMENT HERE YOUR FAVORITE TRANSLATION CODE!!!\")     # from googletrans import Translator      # Initialize the translator     # translator = Translator()      # Translate text from English to German     # text = \"Hello, how are you?\"     # translated = translator.translate(text, src=\"en\", dest=\"de\")      return text   def main():     logging.basicConfig(level=logging.INFO)      data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2206.01062.pdf\"     output_dir = Path(\"scratch\")  # ensure this directory exists before saving      # Important: For operating with page images, we must keep them, otherwise the DocumentConverter     # will destroy them for cleaning up memory.     # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images.     # scale=1 correspond of a standard 72 DPI image     # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched     # with the image field     pipeline_options = PdfPipelineOptions()     pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE     pipeline_options.generate_page_images = True     pipeline_options.generate_picture_images = True      doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)         }     )      conv_res = doc_converter.convert(input_doc_path)     conv_doc = conv_res.document     doc_filename = conv_res.input.file.name      # Save markdown with embedded pictures in original text     # Tip: create the `scratch/` folder first or adjust `output_dir`.     md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\"     conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)      for element, _level in conv_res.document.iterate_items():         if isinstance(element, TextItem):             element.orig = element.text             element.text = translate(text=element.text)          elif isinstance(element, TableItem):             for cell in element.data.table_cells:                 cell.text = translate(text=cell.text)      # Save markdown with embedded pictures in translated text     md_filename = output_dir / f\"{doc_filename}-with-images-translated.md\"     conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)   if __name__ == \"__main__\":     main()"},{"location":"examples/visual_grounding/","title":"Visual grounding","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote <p>This example showcases Docling's visual grounding capabilities, which can be combined with any agentic AI / RAG framework.</p> <p>In this instance, we illustrate these capabilities leveraging the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings.</p> <ul> <li>\ud83d\udc49 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.</li> <li>Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var <code>HF_TOKEN</code>.</li> <li>Requirements can be installed as shown below (<code>--no-warn-conflicts</code> meant for Colab's pre-populated Python env; feel free to remove for stricter usage):</li> </ul> In\u00a0[1]: Copied! <pre>%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain matplotlib python-dotenv\n</pre> %pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain matplotlib python-dotenv <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom dotenv import load_dotenv\nfrom langchain_core.prompts import PromptTemplate\nfrom langchain_docling.loader import ExportType\n\n\ndef _get_env_from_colab_or_os(key):\n    try:\n        from google.colab import userdata\n\n        try:\n            return userdata.get(key)\n        except userdata.SecretNotFoundError:\n            pass\n    except ImportError:\n        pass\n    return os.getenv(key)\n\n\nload_dotenv()\n\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nSOURCES = [\"https://arxiv.org/pdf/2408.09869\"]  # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nQUESTION = \"Which are the main AI models in Docling?\"\nPROMPT = PromptTemplate.from_template(\n    \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\",\n)\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\n</pre> import os from pathlib import Path from tempfile import mkdtemp  from dotenv import load_dotenv from langchain_core.prompts import PromptTemplate from langchain_docling.loader import ExportType   def _get_env_from_colab_or_os(key):     try:         from google.colab import userdata          try:             return userdata.get(key)         except userdata.SecretNotFoundError:             pass     except ImportError:         pass     return os.getenv(key)   load_dotenv()  # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"  HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") SOURCES = [\"https://arxiv.org/pdf/2408.09869\"]  # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" QUESTION = \"Which are the main AI models in Docling?\" PROMPT = PromptTemplate.from_template(     \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\", ) TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\nconverter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(\n            pipeline_options=PdfPipelineOptions(\n                generate_page_images=True,\n                images_scale=2.0,\n            ),\n        )\n    }\n)\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption  converter = DocumentConverter(     format_options={         InputFormat.PDF: PdfFormatOption(             pipeline_options=PdfPipelineOptions(                 generate_page_images=True,                 images_scale=2.0,             ),         )     } ) <p>We set up a simple doc store for keeping converted documents, as that is needed for visual grounding further below.</p> In\u00a0[4]: Copied! <pre>doc_store = {}\ndoc_store_root = Path(mkdtemp())\nfor source in SOURCES:\n    dl_doc = converter.convert(source=source).document\n    file_path = Path(doc_store_root / f\"{dl_doc.origin.binary_hash}.json\")\n    dl_doc.save_as_json(file_path)\n    doc_store[dl_doc.origin.binary_hash] = file_path\n</pre> doc_store = {} doc_store_root = Path(mkdtemp()) for source in SOURCES:     dl_doc = converter.convert(source=source).document     file_path = Path(doc_store_root / f\"{dl_doc.origin.binary_hash}.json\")     dl_doc.save_as_json(file_path)     doc_store[dl_doc.origin.binary_hash] = file_path <p>Now we can instantiate our loader and load documents.</p> In\u00a0[5]: Copied! <pre>from langchain_docling import DoclingLoader\n\nfrom docling.chunking import HybridChunker\n\nloader = DoclingLoader(\n    file_path=SOURCES,\n    converter=converter,\n    export_type=ExportType.DOC_CHUNKS,\n    chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n)\n\ndocs = loader.load()\n</pre> from langchain_docling import DoclingLoader  from docling.chunking import HybridChunker  loader = DoclingLoader(     file_path=SOURCES,     converter=converter,     export_type=ExportType.DOC_CHUNKS,     chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), )  docs = loader.load() <pre>Token indices sequence length is longer than the specified maximum sequence length for this model (648 &gt; 512). Running this sequence through the model will result in indexing errors\n</pre> <p>\ud83d\udc49 NOTE: As you see above, using the <code>HybridChunker</code> can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here.</p> <p>Inspecting some sample splits:</p> In\u00a0[6]: Copied! <pre>for d in docs[:3]:\n    print(f\"- {d.page_content=}\")\nprint(\"...\")\n</pre> for d in docs[:3]:     print(f\"- {d.page_content=}\") print(\"...\") <pre>- d.page_content='Docling Technical Report\\nVersion 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R\u00a8 uschlikon, Switzerland'\n- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n- d.page_content='1 Introduction\\nConverting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variability in formats, weak standardization and printing-optimized characteristic, which discards most structural features and metadata. With the advent of LLMs and popular application patterns such as retrieval-augmented generation (RAG), leveraging the rich content embedded in PDFs has become ever more relevant. In the past decade, several powerful document understanding solutions have emerged on the market, most of which are commercial software, cloud offerings [3] and most recently, multi-modal vision-language models. As of today, only a handful of open-source tools cover PDF conversion, leaving a significant feature and quality gap to proprietary solutions.\\nWith Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.\\nHere is what Docling delivers today:\\n\u00b7 Converts PDF documents to JSON or Markdown format, stable and lightning fast\\n\u00b7 Understands detailed page layout, reading order, locates figures and recovers table structures\\n\u00b7 Extracts metadata from the document, such as title, authors, references and language\\n\u00b7 Optionally applies OCR, e.g. for scanned PDFs\\n\u00b7 Can be configured to be optimal for batch-mode (i.e high throughput, low time-to-solution) or interactive mode (compromise on efficiency, low time-to-solution)\\n\u00b7 Can leverage different accelerators (GPU, MPS, etc).'\n...\n</pre> In\u00a0[7]: Copied! <pre>import json\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom langchain_huggingface.embeddings import HuggingFaceEmbeddings\nfrom langchain_milvus import Milvus\n\nembedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)\n\n\nmilvus_uri = str(Path(mkdtemp()) / \"docling.db\")  # or set as needed\nvectorstore = Milvus.from_documents(\n    documents=docs,\n    embedding=embedding,\n    collection_name=\"docling_demo\",\n    connection_args={\"uri\": milvus_uri},\n    index_params={\"index_type\": \"FLAT\"},\n    drop_old=True,\n)\n</pre> import json from pathlib import Path from tempfile import mkdtemp  from langchain_huggingface.embeddings import HuggingFaceEmbeddings from langchain_milvus import Milvus  embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)   milvus_uri = str(Path(mkdtemp()) / \"docling.db\")  # or set as needed vectorstore = Milvus.from_documents(     documents=docs,     embedding=embedding,     collection_name=\"docling_demo\",     connection_args={\"uri\": milvus_uri},     index_params={\"index_type\": \"FLAT\"},     drop_old=True, ) In\u00a0[8]: Copied! <pre>from langchain.chains import create_retrieval_chain\nfrom langchain.chains.combine_documents import create_stuff_documents_chain\nfrom langchain_huggingface import HuggingFaceEndpoint\n\nretriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K})\nllm = HuggingFaceEndpoint(\n    repo_id=GEN_MODEL_ID,\n    huggingfacehub_api_token=HF_TOKEN,\n)\n\n\ndef clip_text(text, threshold=100):\n    return f\"{text[:threshold]}...\" if len(text) &gt; threshold else text\n</pre> from langchain.chains import create_retrieval_chain from langchain.chains.combine_documents import create_stuff_documents_chain from langchain_huggingface import HuggingFaceEndpoint  retriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K}) llm = HuggingFaceEndpoint(     repo_id=GEN_MODEL_ID,     huggingfacehub_api_token=HF_TOKEN, )   def clip_text(text, threshold=100):     return f\"{text[:threshold]}...\" if len(text) &gt; threshold else text <pre>Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\n</pre> In\u00a0[9]: Copied! <pre>from docling.chunking import DocMeta\nfrom docling.datamodel.document import DoclingDocument\n\nquestion_answer_chain = create_stuff_documents_chain(llm, PROMPT)\nrag_chain = create_retrieval_chain(retriever, question_answer_chain)\nresp_dict = rag_chain.invoke({\"input\": QUESTION})\n\nclipped_answer = clip_text(resp_dict[\"answer\"], threshold=200)\nprint(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")\n</pre> from docling.chunking import DocMeta from docling.datamodel.document import DoclingDocument  question_answer_chain = create_stuff_documents_chain(llm, PROMPT) rag_chain = create_retrieval_chain(retriever, question_answer_chain) resp_dict = rag_chain.invoke({\"input\": QUESTION})  clipped_answer = clip_text(resp_dict[\"answer\"], threshold=200) print(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\") <pre>/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_deprecation.py:131: FutureWarning: 'post' (from 'huggingface_hub.inference._client') is deprecated and will be removed from version '0.31.0'. Making direct POST requests to the inference server is not supported anymore. Please use task methods instead (e.g. `InferenceClient.chat_completion`). If your use case is not supported, please open an issue in https://github.com/huggingface/huggingface_hub.\n  warnings.warn(warning_message, FutureWarning)\n</pre> <pre>Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nThe main AI models in Docling are:\n1. A layout analysis model, an accurate object-detector for page elements.\n2. TableFormer, a state-of-the-art table structure recognition model.\n</pre> In\u00a0[10]: Copied! <pre>import matplotlib.pyplot as plt\nfrom PIL import ImageDraw\n\nfor i, doc in enumerate(resp_dict[\"context\"][:]):\n    image_by_page = {}\n    print(f\"Source {i + 1}:\")\n    print(f\"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")\n    meta = DocMeta.model_validate(doc.metadata[\"dl_meta\"])\n\n    # loading the full DoclingDocument from the document store:\n    dl_doc = DoclingDocument.load_from_json(doc_store.get(meta.origin.binary_hash))\n\n    for doc_item in meta.doc_items:\n        if doc_item.prov:\n            prov = doc_item.prov[0]  # here we only consider the first provenence item\n            page_no = prov.page_no\n            if img := image_by_page.get(page_no):\n                pass\n            else:\n                page = dl_doc.pages[prov.page_no]\n                print(f\"  page: {prov.page_no}\")\n                img = page.image.pil_image\n                image_by_page[page_no] = img\n            bbox = prov.bbox.to_top_left_origin(page_height=page.size.height)\n            bbox = bbox.normalized(page.size)\n            thickness = 2\n            padding = thickness + 2\n            bbox.l = round(bbox.l * img.width - padding)\n            bbox.r = round(bbox.r * img.width + padding)\n            bbox.t = round(bbox.t * img.height - padding)\n            bbox.b = round(bbox.b * img.height + padding)\n            draw = ImageDraw.Draw(img)\n            draw.rectangle(\n                xy=bbox.as_tuple(),\n                outline=\"blue\",\n                width=thickness,\n            )\n    for p in image_by_page:\n        img = image_by_page[p]\n        plt.figure(figsize=[15, 15])\n        plt.imshow(img)\n        plt.axis(\"off\")\n        plt.show()\n</pre> import matplotlib.pyplot as plt from PIL import ImageDraw  for i, doc in enumerate(resp_dict[\"context\"][:]):     image_by_page = {}     print(f\"Source {i + 1}:\")     print(f\"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")     meta = DocMeta.model_validate(doc.metadata[\"dl_meta\"])      # loading the full DoclingDocument from the document store:     dl_doc = DoclingDocument.load_from_json(doc_store.get(meta.origin.binary_hash))      for doc_item in meta.doc_items:         if doc_item.prov:             prov = doc_item.prov[0]  # here we only consider the first provenence item             page_no = prov.page_no             if img := image_by_page.get(page_no):                 pass             else:                 page = dl_doc.pages[prov.page_no]                 print(f\"  page: {prov.page_no}\")                 img = page.image.pil_image                 image_by_page[page_no] = img             bbox = prov.bbox.to_top_left_origin(page_height=page.size.height)             bbox = bbox.normalized(page.size)             thickness = 2             padding = thickness + 2             bbox.l = round(bbox.l * img.width - padding)             bbox.r = round(bbox.r * img.width + padding)             bbox.t = round(bbox.t * img.height - padding)             bbox.b = round(bbox.b * img.height + padding)             draw = ImageDraw.Draw(img)             draw.rectangle(                 xy=bbox.as_tuple(),                 outline=\"blue\",                 width=thickness,             )     for p in image_by_page:         img = image_by_page[p]         plt.figure(figsize=[15, 15])         plt.imshow(img)         plt.axis(\"off\")         plt.show() <pre>Source 1:\n  text: \"3.2 AI models\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re...\"\n  page: 3\n</pre> <pre>Source 2:\n  text: \"3 Processing pipeline\\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ...\"\n  page: 2\n</pre> <pre>Source 3:\n  text: \"6 Future work and contributions\\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ...\"\n  page: 5\n</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/visual_grounding/#visual-grounding","title":"Visual grounding\u00b6","text":""},{"location":"examples/visual_grounding/#setup","title":"Setup\u00b6","text":""},{"location":"examples/visual_grounding/#document-store-setup","title":"Document store setup\u00b6","text":""},{"location":"examples/visual_grounding/#document-loading","title":"Document loading\u00b6","text":"<p>We first define our converter, in this case including options for keeping page images (for visual grounding).</p>"},{"location":"examples/visual_grounding/#ingestion","title":"Ingestion\u00b6","text":""},{"location":"examples/visual_grounding/#rag","title":"RAG\u00b6","text":""},{"location":"examples/visual_grounding/#visual-grounding","title":"Visual grounding\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/","title":"VLM pipeline with remote model","text":"<p>Use the VLM pipeline with remote API models (LM Studio, Ollama, watsonx.ai).</p> <p>What this example does</p> <ul> <li>Shows how to configure <code>ApiVlmOptions</code> for different VLM providers.</li> <li>Converts a single PDF page using the VLM pipeline and prints Markdown.</li> </ul> <p>Prerequisites</p> <ul> <li>Install Docling with VLM extras and <code>python-dotenv</code> if using environment files.</li> <li>For local APIs: run LM Studio (HTTP server) or Ollama locally.</li> <li>For cloud APIs: set required environment variables (see below).</li> <li>Requires <code>requests</code> for HTTP calls and <code>python-dotenv</code> if loading env vars from <code>.env</code>.</li> </ul> <p>How to run</p> <ul> <li>From the repo root: <code>python docs/examples/vlm_pipeline_api_model.py</code>.</li> <li>The script prints the converted Markdown to stdout.</li> </ul> <p>Choosing a provider</p> <ul> <li>Uncomment exactly one <code>pipeline_options.vlm_options = ...</code> block below.</li> <li>Keep <code>enable_remote_services=True</code> to permit calling remote APIs.</li> </ul> <p>Notes</p> <ul> <li>LM Studio default endpoint: <code>http://localhost:1234/v1/chat/completions</code>.</li> <li>Ollama default endpoint: <code>http://localhost:11434/v1/chat/completions</code>.</li> <li>watsonx.ai requires <code>WX_API_KEY</code> and <code>WX_PROJECT_ID</code> in env/<code>.env</code>.</li> </ul> In\u00a0[\u00a0]: Copied! <pre>import json\nimport logging\nimport os\nfrom pathlib import Path\nfrom typing import Optional\n\nimport requests\nfrom docling_core.types.doc.page import SegmentedPage\nfrom dotenv import load_dotenv\n\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n    VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\n### Example of ApiVlmOptions definitions\n\n#### Using LM Studio or VLLM (OpenAI-compatible APIs)\n\n\ndef openai_compatible_vlm_options(\n    model: str,\n    prompt: str,\n    format: ResponseFormat,\n    hostname_and_port,\n    temperature: float = 0.7,\n    max_tokens: int = 4096,\n    api_key: str = \"\",\n    skip_special_tokens=False,\n):\n    headers = {}\n    if api_key:\n        headers[\"Authorization\"] = f\"Bearer {api_key}\"\n\n    options = ApiVlmOptions(\n        url=f\"http://{hostname_and_port}/v1/chat/completions\",  # LM studio defaults to port 1234, VLLM to 8000\n        params=dict(\n            model=model,\n            max_tokens=max_tokens,\n            skip_special_tokens=skip_special_tokens,  # needed for VLLM\n        ),\n        headers=headers,\n        prompt=prompt,\n        timeout=90,\n        scale=2.0,\n        temperature=temperature,\n        response_format=format,\n    )\n    return options\n\n\n#### Using LM Studio with OlmOcr model\n\n\ndef lms_olmocr_vlm_options(model: str):\n    class OlmocrVlmOptions(ApiVlmOptions):\n        def build_prompt(self, page: Optional[SegmentedPage]) -&gt; str:\n            if page is None:\n                return self.prompt.replace(\"#RAW_TEXT#\", \"\")\n\n            anchor = [\n                f\"Page dimensions: {int(page.dimension.width)}x{int(page.dimension.height)}\"\n            ]\n\n            for text_cell in page.textline_cells:\n                if not text_cell.text.strip():\n                    continue\n                bbox = text_cell.rect.to_bounding_box().to_bottom_left_origin(\n                    page.dimension.height\n                )\n                anchor.append(f\"[{int(bbox.l)}x{int(bbox.b)}] {text_cell.text}\")\n\n            for image_cell in page.bitmap_resources:\n                bbox = image_cell.rect.to_bounding_box().to_bottom_left_origin(\n                    page.dimension.height\n                )\n                anchor.append(\n                    f\"[Image {int(bbox.l)}x{int(bbox.b)} to {int(bbox.r)}x{int(bbox.t)}]\"\n                )\n\n            if len(anchor) == 1:\n                anchor.append(\n                    f\"[Image 0x0 to {int(page.dimension.width)}x{int(page.dimension.height)}]\"\n                )\n\n            # Original prompt uses cells sorting. We are skipping it for simplicity.\n\n            raw_text = \"\\n\".join(anchor)\n\n            return self.prompt.replace(\"#RAW_TEXT#\", raw_text)\n\n        def decode_response(self, text: str) -&gt; str:\n            # OlmOcr trained to generate json response with language, rotation and other info\n            try:\n                generated_json = json.loads(text)\n            except json.decoder.JSONDecodeError:\n                return \"\"\n\n            return generated_json[\"natural_text\"]\n\n    options = OlmocrVlmOptions(\n        url=\"http://localhost:1234/v1/chat/completions\",\n        params=dict(\n            model=model,\n        ),\n        prompt=(\n            \"Below is the image of one page of a document, as well as some raw textual\"\n            \" content that was previously extracted for it. Just return the plain text\"\n            \" representation of this document as if you were reading it naturally.\\n\"\n            \"Do not hallucinate.\\n\"\n            \"RAW_TEXT_START\\n#RAW_TEXT#\\nRAW_TEXT_END\"\n        ),\n        timeout=90,\n        scale=1.0,\n        max_size=1024,  # from OlmOcr pipeline\n        response_format=ResponseFormat.MARKDOWN,\n    )\n    return options\n\n\n#### Using Ollama\n\n\ndef ollama_vlm_options(model: str, prompt: str):\n    options = ApiVlmOptions(\n        url=\"http://localhost:11434/v1/chat/completions\",  # the default Ollama endpoint\n        params=dict(\n            model=model,\n        ),\n        prompt=prompt,\n        timeout=90,\n        scale=1.0,\n        response_format=ResponseFormat.MARKDOWN,\n    )\n    return options\n\n\n#### Using a cloud service like IBM watsonx.ai\n\n\ndef watsonx_vlm_options(model: str, prompt: str):\n    load_dotenv()\n    api_key = os.environ.get(\"WX_API_KEY\")\n    project_id = os.environ.get(\"WX_PROJECT_ID\")\n\n    def _get_iam_access_token(api_key: str) -&gt; str:\n        res = requests.post(\n            url=\"https://iam.cloud.ibm.com/identity/token\",\n            headers={\n                \"Content-Type\": \"application/x-www-form-urlencoded\",\n            },\n            data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&amp;apikey={api_key}\",\n        )\n        res.raise_for_status()\n        api_out = res.json()\n        print(f\"{api_out=}\")\n        return api_out[\"access_token\"]\n\n    options = ApiVlmOptions(\n        url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\",\n        params=dict(\n            model_id=model,\n            project_id=project_id,\n            parameters=dict(\n                max_new_tokens=400,\n            ),\n        ),\n        headers={\n            \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key),\n        },\n        prompt=prompt,\n        timeout=60,\n        response_format=ResponseFormat.MARKDOWN,\n    )\n    return options\n\n\n### Usage and conversion\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO)\n\n    data_folder = Path(__file__).parent / \"../../tests/data\"\n    input_doc_path = data_folder / \"pdf/2305.03393v1-pg9.pdf\"\n\n    # Configure the VLM pipeline. Enabling remote services allows HTTP calls to\n    # locally hosted APIs (LM Studio, Ollama) or cloud services.\n    pipeline_options = VlmPipelineOptions(\n        enable_remote_services=True  # required when calling remote VLM endpoints\n    )\n\n    # The ApiVlmOptions() allows to interface with APIs supporting\n    # the multi-modal chat interface. Here follow a few example on how to configure those.\n\n    # One possibility is self-hosting the model, e.g., via LM Studio, Ollama or VLLM.\n    #\n    # e.g. with VLLM, serve granite-docling with these commands:\n    # &gt; vllm serve ibm-granite/granite-docling-258M --revision untied\n    #\n    # with LM Studio, serve granite-docling with these commands:\n    # &gt; lms server start\n    # &gt; lms load ibm-granite/granite-docling-258M-mlx\n\n    # Example using the Granite-Docling model with LM Studio or VLLM:\n    pipeline_options.vlm_options = openai_compatible_vlm_options(\n        model=\"granite-docling-258m-mlx\",  # For VLLM use \"ibm-granite/granite-docling-258M\"\n        hostname_and_port=\"localhost:1234\",  # LM studio defaults to port 1234, VLLM to 8000\n        prompt=\"Convert this page to docling.\",\n        format=ResponseFormat.DOCTAGS,\n        api_key=\"\",\n    )\n\n    # Example using the OlmOcr (dynamic prompt) model with LM Studio:\n    # (uncomment the following lines)\n    # pipeline_options.vlm_options = lms_olmocr_vlm_options(\n    #     model=\"hf.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF\",\n    # )\n\n    # Example using the Granite Vision model with Ollama:\n    # (uncomment the following lines)\n    # pipeline_options.vlm_options = ollama_vlm_options(\n    #     model=\"granite3.2-vision:2b\",\n    #     prompt=\"OCR the full page to markdown.\",\n    # )\n\n    # Another possibility is using online services, e.g., watsonx.ai.\n    # Using watsonx.ai requires setting env variables WX_API_KEY and WX_PROJECT_ID\n    # (see the top-level docstring for details). You can use a .env file as well.\n    # (uncomment the following lines)\n    # pipeline_options.vlm_options = watsonx_vlm_options(\n    #     model=\"ibm/granite-vision-3-2-2b\", prompt=\"OCR the full page to markdown.\"\n    # )\n\n    # Create the DocumentConverter and launch the conversion.\n    doc_converter = DocumentConverter(\n        format_options={\n            InputFormat.PDF: PdfFormatOption(\n                pipeline_options=pipeline_options,\n                pipeline_cls=VlmPipeline,\n            )\n        }\n    )\n    result = doc_converter.convert(input_doc_path)\n    print(result.document.export_to_markdown())\n\n\nif __name__ == \"__main__\":\n    main()\n</pre>  import json import logging import os from pathlib import Path from typing import Optional  import requests from docling_core.types.doc.page import SegmentedPage from dotenv import load_dotenv  from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import (     VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline  ### Example of ApiVlmOptions definitions  #### Using LM Studio or VLLM (OpenAI-compatible APIs)   def openai_compatible_vlm_options(     model: str,     prompt: str,     format: ResponseFormat,     hostname_and_port,     temperature: float = 0.7,     max_tokens: int = 4096,     api_key: str = \"\",     skip_special_tokens=False, ):     headers = {}     if api_key:         headers[\"Authorization\"] = f\"Bearer {api_key}\"      options = ApiVlmOptions(         url=f\"http://{hostname_and_port}/v1/chat/completions\",  # LM studio defaults to port 1234, VLLM to 8000         params=dict(             model=model,             max_tokens=max_tokens,             skip_special_tokens=skip_special_tokens,  # needed for VLLM         ),         headers=headers,         prompt=prompt,         timeout=90,         scale=2.0,         temperature=temperature,         response_format=format,     )     return options   #### Using LM Studio with OlmOcr model   def lms_olmocr_vlm_options(model: str):     class OlmocrVlmOptions(ApiVlmOptions):         def build_prompt(self, page: Optional[SegmentedPage]) -&gt; str:             if page is None:                 return self.prompt.replace(\"#RAW_TEXT#\", \"\")              anchor = [                 f\"Page dimensions: {int(page.dimension.width)}x{int(page.dimension.height)}\"             ]              for text_cell in page.textline_cells:                 if not text_cell.text.strip():                     continue                 bbox = text_cell.rect.to_bounding_box().to_bottom_left_origin(                     page.dimension.height                 )                 anchor.append(f\"[{int(bbox.l)}x{int(bbox.b)}] {text_cell.text}\")              for image_cell in page.bitmap_resources:                 bbox = image_cell.rect.to_bounding_box().to_bottom_left_origin(                     page.dimension.height                 )                 anchor.append(                     f\"[Image {int(bbox.l)}x{int(bbox.b)} to {int(bbox.r)}x{int(bbox.t)}]\"                 )              if len(anchor) == 1:                 anchor.append(                     f\"[Image 0x0 to {int(page.dimension.width)}x{int(page.dimension.height)}]\"                 )              # Original prompt uses cells sorting. We are skipping it for simplicity.              raw_text = \"\\n\".join(anchor)              return self.prompt.replace(\"#RAW_TEXT#\", raw_text)          def decode_response(self, text: str) -&gt; str:             # OlmOcr trained to generate json response with language, rotation and other info             try:                 generated_json = json.loads(text)             except json.decoder.JSONDecodeError:                 return \"\"              return generated_json[\"natural_text\"]      options = OlmocrVlmOptions(         url=\"http://localhost:1234/v1/chat/completions\",         params=dict(             model=model,         ),         prompt=(             \"Below is the image of one page of a document, as well as some raw textual\"             \" content that was previously extracted for it. Just return the plain text\"             \" representation of this document as if you were reading it naturally.\\n\"             \"Do not hallucinate.\\n\"             \"RAW_TEXT_START\\n#RAW_TEXT#\\nRAW_TEXT_END\"         ),         timeout=90,         scale=1.0,         max_size=1024,  # from OlmOcr pipeline         response_format=ResponseFormat.MARKDOWN,     )     return options   #### Using Ollama   def ollama_vlm_options(model: str, prompt: str):     options = ApiVlmOptions(         url=\"http://localhost:11434/v1/chat/completions\",  # the default Ollama endpoint         params=dict(             model=model,         ),         prompt=prompt,         timeout=90,         scale=1.0,         response_format=ResponseFormat.MARKDOWN,     )     return options   #### Using a cloud service like IBM watsonx.ai   def watsonx_vlm_options(model: str, prompt: str):     load_dotenv()     api_key = os.environ.get(\"WX_API_KEY\")     project_id = os.environ.get(\"WX_PROJECT_ID\")      def _get_iam_access_token(api_key: str) -&gt; str:         res = requests.post(             url=\"https://iam.cloud.ibm.com/identity/token\",             headers={                 \"Content-Type\": \"application/x-www-form-urlencoded\",             },             data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&amp;apikey={api_key}\",         )         res.raise_for_status()         api_out = res.json()         print(f\"{api_out=}\")         return api_out[\"access_token\"]      options = ApiVlmOptions(         url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\",         params=dict(             model_id=model,             project_id=project_id,             parameters=dict(                 max_new_tokens=400,             ),         ),         headers={             \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key),         },         prompt=prompt,         timeout=60,         response_format=ResponseFormat.MARKDOWN,     )     return options   ### Usage and conversion   def main():     logging.basicConfig(level=logging.INFO)      data_folder = Path(__file__).parent / \"../../tests/data\"     input_doc_path = data_folder / \"pdf/2305.03393v1-pg9.pdf\"      # Configure the VLM pipeline. Enabling remote services allows HTTP calls to     # locally hosted APIs (LM Studio, Ollama) or cloud services.     pipeline_options = VlmPipelineOptions(         enable_remote_services=True  # required when calling remote VLM endpoints     )      # The ApiVlmOptions() allows to interface with APIs supporting     # the multi-modal chat interface. Here follow a few example on how to configure those.      # One possibility is self-hosting the model, e.g., via LM Studio, Ollama or VLLM.     #     # e.g. with VLLM, serve granite-docling with these commands:     # &gt; vllm serve ibm-granite/granite-docling-258M --revision untied     #     # with LM Studio, serve granite-docling with these commands:     # &gt; lms server start     # &gt; lms load ibm-granite/granite-docling-258M-mlx      # Example using the Granite-Docling model with LM Studio or VLLM:     pipeline_options.vlm_options = openai_compatible_vlm_options(         model=\"granite-docling-258m-mlx\",  # For VLLM use \"ibm-granite/granite-docling-258M\"         hostname_and_port=\"localhost:1234\",  # LM studio defaults to port 1234, VLLM to 8000         prompt=\"Convert this page to docling.\",         format=ResponseFormat.DOCTAGS,         api_key=\"\",     )      # Example using the OlmOcr (dynamic prompt) model with LM Studio:     # (uncomment the following lines)     # pipeline_options.vlm_options = lms_olmocr_vlm_options(     #     model=\"hf.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF\",     # )      # Example using the Granite Vision model with Ollama:     # (uncomment the following lines)     # pipeline_options.vlm_options = ollama_vlm_options(     #     model=\"granite3.2-vision:2b\",     #     prompt=\"OCR the full page to markdown.\",     # )      # Another possibility is using online services, e.g., watsonx.ai.     # Using watsonx.ai requires setting env variables WX_API_KEY and WX_PROJECT_ID     # (see the top-level docstring for details). You can use a .env file as well.     # (uncomment the following lines)     # pipeline_options.vlm_options = watsonx_vlm_options(     #     model=\"ibm/granite-vision-3-2-2b\", prompt=\"OCR the full page to markdown.\"     # )      # Create the DocumentConverter and launch the conversion.     doc_converter = DocumentConverter(         format_options={             InputFormat.PDF: PdfFormatOption(                 pipeline_options=pipeline_options,                 pipeline_cls=VlmPipeline,             )         }     )     result = doc_converter.convert(input_doc_path)     print(result.document.export_to_markdown())   if __name__ == \"__main__\":     main() In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/experimental/process_table_crops/","title":"Process table crops","text":"In\u00a0[\u00a0]: Copied! <pre>\"\"\"Run Docling on an image using the experimental TableCrops layout model.\"\"\"\n</pre> \"\"\"Run Docling on an image using the experimental TableCrops layout model.\"\"\" In\u00a0[\u00a0]: Copied! <pre>from __future__ import annotations\n</pre> from __future__ import annotations In\u00a0[\u00a0]: Copied! <pre>from pathlib import Path\n</pre> from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>import docling\nfrom docling.datamodel.document import InputFormat\nfrom docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, ImageFormatOption\nfrom docling.experimental.datamodel.table_crops_layout_options import (\n    TableCropsLayoutOptions,\n)\nfrom docling.experimental.models.table_crops_layout_model import TableCropsLayoutModel\nfrom docling.models.factories import get_layout_factory\n</pre> import docling from docling.datamodel.document import InputFormat from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions from docling.document_converter import DocumentConverter, ImageFormatOption from docling.experimental.datamodel.table_crops_layout_options import (     TableCropsLayoutOptions, ) from docling.experimental.models.table_crops_layout_model import TableCropsLayoutModel from docling.models.factories import get_layout_factory In\u00a0[\u00a0]: Copied! <pre>def main() -&gt; None:\n    sample_image = \"tests/data/2305.03393v1-table_crop.png\"\n\n    pipeline_options = ThreadedPdfPipelineOptions(\n        layout_options=TableCropsLayoutOptions(),\n        do_table_structure=True,\n        generate_page_images=True,\n    )\n\n    converter = DocumentConverter(\n        allowed_formats=[InputFormat.IMAGE],\n        format_options={\n            InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)\n        },\n    )\n\n    conv_res = converter.convert(sample_image)\n\n    print(conv_res.document.tables[0].export_to_markdown())\n</pre> def main() -&gt; None:     sample_image = \"tests/data/2305.03393v1-table_crop.png\"      pipeline_options = ThreadedPdfPipelineOptions(         layout_options=TableCropsLayoutOptions(),         do_table_structure=True,         generate_page_images=True,     )      converter = DocumentConverter(         allowed_formats=[InputFormat.IMAGE],         format_options={             InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)         },     )      conv_res = converter.convert(sample_image)      print(conv_res.document.tables[0].export_to_markdown()) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n    main()\n</pre> if __name__ == \"__main__\":     main()"},{"location":"faq/","title":"FAQ","text":"<p>This is a collection of FAQ collected from the user questions on https://github.com/docling-project/docling/discussions.</p> Is Python 3.14 supported? Is Python 3.13 supported? Install conflicts with numpy (python 3.13) Is macOS x86_64 supported? I get this error ImportError: libGL.so.1: cannot open shared object file: No such file or directory Are text styles (bold, underline, etc) supported? How do I run completely offline?  Which model weights are needed to run Docling? SSL error downloading model weights Which OCR languages are supported? Some images are missing from MS Word and Powerpoint <code>HybridChunker</code> triggers warning: 'Token indices sequence length is longer than the specified maximum sequence length for this model' How to use flash attention?"},{"location":"faq/#is-python-314-supported","title":"Is Python 3.14 supported?","text":"<p>Python 3.14 is supported from Docling 2.59.0.</p>"},{"location":"faq/#is-python-313-supported","title":"Is Python 3.13 supported?","text":"<p>Python 3.13 is supported from Docling 2.18.0.</p>"},{"location":"faq/#install-conflicts-with-numpy-python-313","title":"Install conflicts with numpy (python 3.13)","text":"<p>When using <code>docling-ibm-models&gt;=2.0.7</code> and <code>deepsearch-glm&gt;=0.26.2</code> these issues should not show up anymore. Docling supports numpy versions <code>&gt;=1.24.4,&lt;3.0.0</code> which should match all usages.</p> <p>For older versions</p> <p>This has been observed installing docling and langchain via poetry.</p> <pre><code>...\nThus, docling (&gt;=2.7.0,&lt;3.0.0) requires numpy (&gt;=1.26.4,&lt;2.0.0).\nSo, because ... depends on both numpy (&gt;=2.0.2,&lt;3.0.0) and docling (^2.7.0), version solving failed.\n</code></pre> <p>Numpy is only adding Python 3.13 support starting in some 2.x.y version. In order to prepare for 3.13, Docling depends on a 2.x.y for 3.13, otherwise depending an 1.x.y version. If you are allowing 3.13 in your pyproject.toml, Poetry will try to find some way to reconcile Docling's numpy version for 3.13 (some 2.x.y) with LangChain's version for that (some 1.x.y) \u2014 leading to the error above.</p> <p>Check if Python 3.13 is among the Python versions allowed by your pyproject.toml and if so, remove it and try again. E.g., if you have python = \"^3.10\", use python = \"&gt;=3.10,&lt;3.13\" instead.</p> <p>If you want to retain compatibility with python 3.9-3.13, you can also use a selector in pyproject.toml similar to the following</p> <pre><code>numpy = [\n    { version = \"^2.1.0\", markers = 'python_version &gt;= \"3.13\"' },\n    { version = \"^1.24.4\", markers = 'python_version &lt; \"3.13\"' },\n]\n</code></pre> <p>Source: Issue #283</p>"},{"location":"faq/#is-macos-x86_64-supported","title":"Is macOS x86_64 supported?","text":"<p>Yes, Docling (still) supports running the standard pipeline on macOS x86_64.</p> <p>However, users might get into a combination of incompatible dependencies on a fresh install. Because Docling depends on PyTorch which dropped support for macOS x86_64 after the 2.2.2 release, and this old version of PyTorch works only with NumPy 1.x, users must ensure the correct NumPy version is running.</p> <pre><code>pip install docling \"numpy&lt;2.0.0\"\n</code></pre> <p>Source: Issue #1694.</p>"},{"location":"faq/#i-get-this-error-importerror-libglso1-cannot-open-shared-object-file-no-such-file-or-directory","title":"I get this error ImportError: libGL.so.1: cannot open shared object file: No such file or directory","text":"<p>This error orginates from conflicting OpenCV distribution in some Docling third-party dependencies. <code>opencv-python</code> and <code>opencv-python-headless</code> both define the same python package <code>cv2</code> and, if installed together, this often creates conflicts. Moreover, the <code>opencv-python</code> package (which is more common) depends on the OpenGL UI framework, which is usually not included for headless environments like Docker containers or remote VMs.</p> <p>When you encouter the error above, you have two possibilities.</p> <p>Solution 1: Force the headless OpenCV (preferred)</p> <pre><code>pip uninstall -y opencv-python opencv-python-headless\npip install --no-cache-dir opencv-python-headless\n</code></pre> <p>Solution 2: Install the libGL system dependency.</p> Debian-basedRHEL / Fedora <pre><code>apt-get install libgl1\n</code></pre> <pre><code>dnf install mesa-libGL\n</code></pre>"},{"location":"faq/#are-text-styles-bold-underline-etc-supported","title":"Are text styles (bold, underline, etc) supported?","text":"<p>Text styles are supported in the <code>DoclingDocument</code> format. Currently only the declarative backends (i.e. the ones used for docx, pptx, markdown, html, etc) are able to set the correct text styles. Support for PDF is not yet possible.</p>"},{"location":"faq/#how-do-i-run-completely-offline","title":"How do I run completely offline?","text":"<p>Docling is not using any remote service, hence it can run in completely isolated air-gapped environments.</p> <p>The only requirement is pointing the Docling runtime to the location where the model artifacts have been stored.</p> <p>For example</p> <pre><code>pipeline_options = PdfPipelineOptions(artifacts_path=\"your location\")\nconverter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n    }\n)\n</code></pre> <p>Source: Issue #326</p>"},{"location":"faq/#which-model-weights-are-needed-to-run-docling","title":"Which model weights are needed to run Docling?","text":"<p>Model weights are needed for the AI models used in the PDF pipeline. Other document types (docx, pptx, etc) do not have any such requirement.</p> <p>For processing PDF documents, Docling requires the model weights from https://huggingface.co/ds4sd/docling-models.</p> <p>When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has special pipeline options to control the runtime behavior.</p>"},{"location":"faq/#ssl-error-downloading-model-weights","title":"SSL error downloading model weights","text":"<pre><code>URLError: &lt;urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)&gt;\n</code></pre> <p>Similar SSL download errors have been observed by some users. This happens when model weights are fetched from Hugging Face. The error could happen when the python environment doesn't have an up-to-date list of trusted certificates.</p> <p>Possible solutions were</p> <ul> <li>Update to the latest version of certifi, i.e. <code>pip install --upgrade certifi</code></li> <li>Use pip-system-certs to use the latest trusted certificates on your system.</li> <li>Set environment variables <code>SSL_CERT_FILE</code> and <code>REQUESTS_CA_BUNDLE</code> to the value of <code>python -m certifi</code>:     <pre><code>CERT_PATH=$(python -m certifi)\nexport SSL_CERT_FILE=${CERT_PATH}\nexport REQUESTS_CA_BUNDLE=${CERT_PATH}\n</code></pre></li> </ul>"},{"location":"faq/#which-ocr-languages-are-supported","title":"Which OCR languages are supported?","text":"<p>Docling supports multiple OCR engine, each one has its own list of supported languages. Here is a collection of links to the original OCR engine's documentation listing the OCR languages.</p> <ul> <li>EasyOCR</li> <li>Tesseract</li> <li>RapidOCR</li> <li>Mac OCR</li> </ul> <p>Setting the OCR language in Docling is done via the OCR pipeline options:</p> <pre><code>from docling.datamodel.pipeline_options import PdfPipelineOptions\n\npipeline_options = PdfPipelineOptions()\npipeline_options.ocr_options.lang = [\"fr\", \"de\", \"es\", \"en\"]  # example of languages for EasyOCR\n</code></pre>"},{"location":"faq/#some-images-are-missing-from-ms-word-and-powerpoint","title":"Some images are missing from MS Word and Powerpoint","text":"<p>The image processing library used by Docling is able to handle embedded WMF images only on Windows platform. If you are on other operating systems, these images will be ignored.</p>"},{"location":"faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model","title":"<code>HybridChunker</code> triggers warning: 'Token indices sequence length is longer than the specified maximum sequence length for this model'","text":"<p>TLDR: In the context of the <code>HybridChunker</code>, this is a known &amp; ancitipated \"false alarm\".</p> <p>Details:</p> <p>Using the <code>HybridChunker</code> often triggers a warning like this:</p> <p>Token indices sequence length is longer than the specified maximum sequence length for this model (531 &gt; 512). Running this sequence through the model will result in indexing errors</p> <p>This is a warning that is emitted by transformers, saying that actually running this sequence through the model will result in indexing errors, i.e. the problematic case is only if one indeed passes the particular sequence through the (embedding) model.</p> <p>In our case though, this occurs as a \"false alarm\", since what happens is the following:</p> <ul> <li>the chunker invokes the tokenizer on a potentially long sequence (e.g. 530 tokens as mentioned in the warning) in order to count its tokens, i.e. to assess if it is short enough. At this point transformers already emits the warning above!</li> <li>whenever the sequence at hand is oversized, the chunker proceeds to split it (but the transformers warning has already been shown nonetheless)</li> </ul> <p>What is important is the actual token length of the produced chunks. The snippet below can be used for getting the actual maximum chunk size (for users wanting to confirm that this does not exceed the model limit):</p> <pre><code>chunk_max_len = 0\nfor i, chunk in enumerate(chunks):\n    ser_txt = chunker.serialize(chunk=chunk)\n    ser_tokens = len(tokenizer.tokenize(ser_txt))\n    if ser_tokens &gt; chunk_max_len:\n        chunk_max_len = ser_tokens\n    print(f\"{i}\\t{ser_tokens}\\t{repr(ser_txt[:100])}...\")\nprint(f\"Longest chunk yielded: {chunk_max_len} tokens\")\nprint(f\"Model max length: {tokenizer.model_max_length}\")\n</code></pre> <p>Also see docling#725.</p> <p>Source: Issue docling-core#119</p>"},{"location":"faq/#how-to-use-flash-attention","title":"How to use flash attention?","text":"<p>When running models in Docling on CUDA devices, you can enable the usage of the Flash Attention2 library.</p> <p>Using environment variables:</p> <pre><code>DOCLING_CUDA_USE_FLASH_ATTENTION2=1\n</code></pre> <p>Using code:</p> <pre><code>from docling.datamodel.accelerator_options import (\n    AcceleratorOptions,\n)\n\npipeline_options = VlmPipelineOptions(\n    accelerator_options=AcceleratorOptions(cuda_use_flash_attention2=True)\n)\n</code></pre> <p>This requires having the flash-attn package installed. Below are two alternative ways for installing it:</p> <pre><code># Building from sources (required the CUDA dev environment)\npip install flash-attn\n\n# Using pre-built wheels (not available in all possible setups)\nFLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn\n</code></pre>"},{"location":"getting_started/installation/","title":"Installation","text":"<p>To use Docling, simply install <code>docling</code> from your Python package manager, e.g. pip: <pre><code>pip install docling\n</code></pre></p> <p>Works on macOS, Linux, and Windows, with support for both x86_64 and arm64 architectures.</p> Alternative PyTorch distributions <p>The Docling models depend on the PyTorch library. Depending on your architecture, you might want to use a different distribution of <code>torch</code>. For example, you might want support for different accelerator or for a cpu-only version. All the different ways for installing <code>torch</code> are listed on their website https://pytorch.org/.</p> <p>One common situation is the installation on Linux systems with cpu-only support. In this case, we suggest the installation of Docling with the following options</p> <pre><code># Example for installing on the Linux cpu-only version\npip install docling --extra-index-url https://download.pytorch.org/whl/cpu\n</code></pre> Installation on macOS Intel (x86_64) <p>When installing Docling on macOS with Intel processors, you might encounter errors with PyTorch compatibility. This happens because newer PyTorch versions (2.6.0+) no longer provide wheels for Intel-based Macs.</p> <p>If you're using an Intel Mac, install Docling with compatible PyTorch Note: PyTorch 2.2.2 requires Python 3.12 or lower. Make sure you're not using Python 3.13+.</p> <pre><code># For uv users\nuv add torch==2.2.2 torchvision==0.17.2 docling\n\n# For pip users\npip install \"docling[mac_intel]\"\n\n# For Poetry users\npoetry add docling\n</code></pre>"},{"location":"getting_started/installation/#available-extras","title":"Available extras","text":"<p>The <code>docling</code> package is designed to offer a working solution for the Docling default options. Some Docling functionalities require additional third-party packages and are therefore installed only if selected as extras (or installed independently).</p> <p>The following table summarizes the extras available in the <code>docling</code> package. They can be activated with: <code>pip install \"docling[NAME1,NAME2]\"</code></p> Extra Description <code>asr</code> Installs dependencies for running the ASR pipeline. <code>vlm</code> Installs dependencies for running the VLM pipeline. <code>easyocr</code> Installs the EasyOCR OCR engine. <code>tesserocr</code> Installs the Tesseract binding for using it as OCR engine. <code>ocrmac</code> Installs the OcrMac OCR engine. <code>rapidocr</code> Installs the RapidOCR OCR engine with onnxruntime backend."},{"location":"getting_started/installation/#ocr-engines","title":"OCR engines","text":"<p>Docling supports multiple OCR engines for processing scanned documents. The current version provides the following engines.</p> Engine Installation Usage EasyOCR <code>easyocr</code> extra or via <code>pip install easyocr</code>. <code>EasyOcrOptions</code> Tesseract System dependency. See description for Tesseract and Tesserocr below. <code>TesseractOcrOptions</code> Tesseract CLI System dependency. See description below. <code>TesseractCliOcrOptions</code> OcrMac System dependency. See description below. <code>OcrMacOptions</code> RapidOCR <code>rapidocr</code> extra can or via <code>pip install rapidocr onnxruntime</code> <code>RapidOcrOptions</code> OnnxTR Can be installed via the plugin system <code>pip install \"docling-ocr-onnxtr[cpu]\"</code>. Please take a look at docling-OCR-OnnxTR. <code>OnnxtrOcrOptions</code> <p>The Docling <code>DocumentConverter</code> allows to choose the OCR engine with the <code>ocr_options</code> settings. For example</p> <pre><code>from docling.datamodel.base_models import ConversionStatus, PipelineOptions\nfrom docling.datamodel.pipeline_options import PipelineOptions, EasyOcrOptions, TesseractOcrOptions\nfrom docling.document_converter import DocumentConverter\n\npipeline_options = PipelineOptions()\npipeline_options.do_ocr = True\npipeline_options.ocr_options = TesseractOcrOptions()  # Use Tesseract\n\ndoc_converter = DocumentConverter(\n    pipeline_options=pipeline_options,\n)\n</code></pre> Tesseract installation <p>Tesseract is a popular OCR engine which is available on most operating systems. For using this engine with Docling, Tesseract must be installed on your system, using the packaging tool of your choice. Below we provide example commands. After installing Tesseract you are expected to provide the path to its language files using the <code>TESSDATA_PREFIX</code> environment variable (note that it must terminate with a slash <code>/</code>).</p> macOS (via Homebrew)Debian-basedRHEL <pre><code>brew install tesseract leptonica pkg-config\nTESSDATA_PREFIX=/opt/homebrew/share/tessdata/\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n</code></pre> <pre><code>apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config\nTESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n</code></pre> <pre><code>dnf install tesseract tesseract-devel tesseract-langpack-eng tesseract-osd leptonica-devel\nTESSDATA_PREFIX=/usr/share/tesseract/tessdata/\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n</code></pre> <p>Linking to Tesseract The most efficient usage of the Tesseract library is via linking. Docling is using the Tesserocr package for this.</p> <p>If you get into installation issues of Tesserocr, we suggest using the following installation options:</p> <pre><code>pip uninstall tesserocr\npip install --no-binary :all: tesserocr\n</code></pre>"},{"location":"getting_started/installation/#development-setup","title":"Development setup","text":"<p>To develop Docling features, bugfixes etc., install as follows from your local clone's root dir:</p> <pre><code>uv sync --all-extras\n</code></pre>"},{"location":"getting_started/quickstart/","title":"Quickstart","text":""},{"location":"getting_started/quickstart/#basic-usage","title":"Basic usage","text":""},{"location":"getting_started/quickstart/#python","title":"Python","text":"<p>In Docling, working with documents is as simple as:</p> <ol> <li>converting your source file to a Docling document</li> <li>using that Docling document for your workflow</li> </ol> <p>For example, the snippet below shows conversion with export to Markdown:</p> <pre><code>from docling.document_converter import DocumentConverter\n\nsource = \"https://arxiv.org/pdf/2408.09869\"  # file path or URL\nconverter = DocumentConverter()\ndoc = converter.convert(source).document\n\nprint(doc.export_to_markdown())  # output: \"### Docling Technical Report[...]\"\n</code></pre> <p>Docling supports a wide array of file formats and, as outlined in the architecture guide, provides a versatile document model along with a full suite of supported operations.</p>"},{"location":"getting_started/quickstart/#cli","title":"CLI","text":"<p>You can additionally use Docling directly from your terminal, for instance:</p> <pre><code>docling https://arxiv.org/pdf/2206.01062\n</code></pre> <p>The CLI provides various options, such as \ud83e\udd5aGraniteDocling (incl. MLX acceleration) &amp; other VLMs: <pre><code>docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062\n</code></pre></p> <p>For all available options, run <code>docling --help</code> or check the CLI reference.</p>"},{"location":"getting_started/quickstart/#whats-next","title":"What's next","text":"<p>Check out the Usage subpages (navigation menu on the left) as well as our featured examples for additional usage workflows, including conversion customization, RAG, framework integrations, chunking, serialization, enrichments, and much more!</p>"},{"location":"integrations/","title":"Integrations","text":"<p>In this space, you can explore various Docling integrations with leading frameworks and tools!</p> <p>Here some of our picks to get you started:</p> <ul> <li>\ud83e\udd9c\ufe0f\ud83d\udd17 LangChain</li> <li>\u0f04 Langflow</li> <li>\ud83e\udd99 LlamaIndex</li> <li>\ud83c\udf3e Haystack</li> <li>\ud83c\udde8 Crew AI</li> </ul> <p>\ud83d\udc48 ... and there is much more: explore all integrations using the navigation menu on the side</p>          A glimpse into Docling's ecosystem"},{"location":"integrations/apify/","title":"Apify","text":"<p>You can run Docling in the cloud without installation using the Docling Actor on Apify platform. Simply provide a document URL and get the processed result:</p> <p></p> <pre><code>apify call vancura/docling -i '{\n  \"options\": {\n    \"to_formats\": [\"md\", \"json\", \"html\", \"text\", \"doctags\"]\n  },\n  \"http_sources\": [\n    {\"url\": \"https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf\"},\n    {\"url\": \"https://arxiv.org/pdf/2408.09869\"}\n  ]\n}'\n</code></pre> <p>The Actor stores results in:</p> <ul> <li>Processed document in key-value store (<code>OUTPUT_RESULT</code>)</li> <li>Processing logs (<code>DOCLING_LOG</code>)</li> <li>Dataset record with result URL and status</li> </ul> <p>Read more about the Docling Actor, including how to use it via the Apify API and CLI.</p> <ul> <li>\ud83d\udcbb GitHub</li> <li>\ud83d\udcd6 Docs</li> <li>\ud83d\udce6 Docling Actor</li> </ul>"},{"location":"integrations/arconia/","title":"Arconia","text":"<p>Docling is available as a Java integration in Arconia.</p> <ul> <li>\ud83d\udcbb GitHub</li> <li>\ud83d\udcd6 Docs</li> <li>\ud83e\uddd1\ud83c\udffd\u200d\ud83c\udf73 Example</li> </ul>"},{"location":"integrations/bee/","title":"Bee Agent Framework","text":"<p>Docling is available as an extraction backend in the Bee framework.</p> <ul> <li>\ud83d\udcbb Bee GitHub</li> <li>\ud83d\udcd6 Bee docs</li> <li>\ud83d\udce6 Bee NPM</li> </ul>"},{"location":"integrations/cloudera/","title":"Cloudera","text":"<p>Docling is available in Cloudera through the RAG Studio Accelerator for Machine Learning Projects (AMP).</p> <ul> <li>\ud83d\udcbb RAG Studio AMP GitHub</li> </ul>"},{"location":"integrations/crewai/","title":"Crew AI","text":"<p>Docling is available in CrewAI as the <code>CrewDoclingSource</code> knowledge source.</p> <ul> <li>\ud83d\udcbb Crew AI GitHub</li> <li>\ud83d\udcd6 Crew AI knowledge docs</li> <li>\ud83d\udce6 Crew AI PyPI</li> </ul>"},{"location":"integrations/data_prep_kit/","title":"Data Prep Kit","text":"<p>Docling is used by the Data Prep Kit open-source toolkit for preparing unstructured data for LLM application development ranging from laptop scale to datacenter scale.</p>"},{"location":"integrations/data_prep_kit/#components","title":"Components","text":""},{"location":"integrations/data_prep_kit/#pdf-ingestion-to-parquet","title":"PDF ingestion to Parquet","text":"<ul> <li>\ud83d\udcbb Docling2Parquet source</li> <li>\ud83d\udcd6 Docling2Parquet docs</li> </ul>"},{"location":"integrations/data_prep_kit/#document-chunking","title":"Document chunking","text":"<ul> <li>\ud83d\udcbb Doc Chunking source</li> <li>\ud83d\udcd6 Doc Chunking docs</li> </ul>"},{"location":"integrations/docetl/","title":"DocETL","text":"<p>Docling is available as a file conversion method in DocETL:</p> <ul> <li>\ud83d\udcbb DocETL GitHub</li> <li>\ud83d\udcd6 DocETL docs</li> <li>\ud83d\udce6 DocETL PyPI</li> </ul>"},{"location":"integrations/haystack/","title":"Haystack","text":"<p>Docling is available as a converter in Haystack:</p> <ul> <li>\ud83d\udcd6 Docling Haystack integration docs</li> <li>\ud83d\udcbb Docling Haystack integration GitHub</li> <li>\ud83e\uddd1\ud83c\udffd\u200d\ud83c\udf73 Docling Haystack integration example</li> <li>\ud83d\udce6 Docling Haystack integration PyPI</li> </ul>"},{"location":"integrations/hector/","title":"Hector","text":"<p>Docling is available in Hector as an MCP-based document parser for RAG systems and document stores.</p> <p>Hector is a production-grade A2A-native agent platform that integrates with Docling via the MCP server for advanced document parsing capabilities.</p> <ul> <li>\ud83d\udcbb Hector GitHub</li> <li>\ud83d\udcd6 Hector Docs</li> <li>\ud83d\udcd6 Using Docling with Hector</li> </ul>"},{"location":"integrations/instructlab/","title":"InstructLab","text":"<p>Docling is powering document processing in InstructLab, enabling users to unlock the knowledge hidden in documents and present it to InstructLab's fine-tuning for aligning AI models to the user's specific data.</p> <p>More details can be found in this blog post.</p> <ul> <li>\ud83c\udfe0 InstructLab home</li> <li>\ud83d\udcbb InstructLab GitHub</li> <li>\ud83e\uddd1\ud83c\udffb\u200d\ud83d\udcbb InstructLab UI</li> <li>\ud83d\udcd6 InstructLab docs</li> </ul>"},{"location":"integrations/kotaemon/","title":"Kotaemon","text":"<p>Docling is available in Kotaemon as the <code>DoclingReader</code> loader:</p> <ul> <li>\ud83d\udcbb Kotaemon GitHub</li> <li>\ud83d\udcd6 DoclingReader docs</li> <li>\u2699\ufe0f Docling setup in Kotaemon</li> </ul>"},{"location":"integrations/langchain/","title":"LangChain","text":"<p>Docling is available as an official LangChain extension.</p> <p>To get started, check out the step-by-step guide in LangChain.</p> <ul> <li>\ud83d\udcd6 LangChain Docling integration docs</li> <li>\ud83d\udcbb LangChain Docling integration GitHub</li> <li>\ud83e\uddd1\ud83c\udffd\u200d\ud83c\udf73 LangChain Docling integration example</li> <li>\ud83d\udce6 LangChain Docling integration PyPI</li> </ul>"},{"location":"integrations/langflow/","title":"Langflow","text":"<p>Docling is available on the Langflow visual low-code platform.</p> <ul> <li>\ud83d\udcd6 Langflow Docling docs</li> <li>\u25b6\ufe0f Langflow Docling video tutorial</li> <li>\ud83d\udcbb Langflow GitHub</li> </ul>"},{"location":"integrations/llamaindex/","title":"LlamaIndex","text":"<p>Docling is available as an official LlamaIndex extension.</p> <p>To get started, check out the step-by-step guide in LlamaIndex.</p>"},{"location":"integrations/llamaindex/#components","title":"Components","text":""},{"location":"integrations/llamaindex/#docling-reader","title":"Docling Reader","text":"<p>Reads document files and uses Docling to populate LlamaIndex <code>Document</code> objects \u2014 either serializing Docling's data model (losslessly, e.g. as JSON) or exporting to a simplified format (lossily, e.g. as Markdown).</p> <ul> <li>\ud83d\udcbb Docling Reader GitHub</li> <li>\ud83d\udcd6 Docling Reader docs</li> <li>\ud83d\udce6 Docling Reader PyPI</li> </ul>"},{"location":"integrations/llamaindex/#docling-node-parser","title":"Docling Node Parser","text":"<p>Reads LlamaIndex <code>Document</code> objects populated in Docling's format by Docling Reader and, using its knowledge of the Docling format, parses them to LlamaIndex <code>Node</code> objects for downstream usage in LlamaIndex applications, e.g. as chunks for embedding.</p> <ul> <li>\ud83d\udcbb Docling Node Parser GitHub</li> <li>\ud83d\udcd6 Docling Node Parser docs</li> <li>\ud83d\udce6 Docling Node Parser PyPI</li> </ul>"},{"location":"integrations/nvidia/","title":"NVIDIA","text":"<p>Docling is powering the NVIDIA PDF to Podcast agentic AI blueprint:</p> <ul> <li>\ud83c\udfe0 PDF to Podcast home</li> <li>\ud83d\udcbb PDF to Podcast GitHub</li> <li>\ud83d\udce3 PDF to Podcast announcement</li> <li>\u270d\ufe0f PDF to Podcast blog post</li> </ul>"},{"location":"integrations/opencontracts/","title":"OpenContracts","text":"<p>Docling is available an ingestion engine for OpenContracts, allowing you to use Docling's OCR engine(s), chunker(s), labels, etc. and load them into a platform supporting bulk data extraction, text annotating, and question-answering:</p> <ul> <li>\ud83d\udcbb OpenContracts GitHub</li> <li>\ud83d\udcd6 OpenContracts Docs</li> <li>\u25b6\ufe0f OpenContracts x Docling PDF annotation screen capture</li> </ul>"},{"location":"integrations/openwebui/","title":"Open WebUI","text":"<p>Docling is available as a plugin for Open WebUI.</p> <ul> <li>\ud83d\udcd6 Docs</li> <li>\ud83d\udcbb GitHub</li> </ul>"},{"location":"integrations/prodigy/","title":"Prodigy","text":"<p>Docling is available in Prodigy as a Prodigy-PDF plugin recipe.</p> <p>More details can be found in this blog post.</p> <ul> <li>\ud83c\udf10 Prodigy home</li> <li>\ud83d\udd0c Prodigy-PDF plugin</li> <li>\ud83e\uddd1\ud83c\udffd\u200d\ud83c\udf73 pdf-spans.manual recipe</li> </ul>"},{"location":"integrations/quarkus/","title":"Quarkus","text":"<p>Docling is available as a Quarkus extension! See the extension documentation for more information.</p> <ul> <li>\ud83d\udcd6 Docs</li> <li>\ud83d\udcbb GitHub</li> </ul>"},{"location":"integrations/rhel_ai/","title":"RHEL AI","text":"<p>Docling is powering document processing in Red Hat Enterprise Linux AI (RHEL AI), enabling users to unlock the knowledge hidden in documents and present it to InstructLab's fine-tuning for aligning AI models to the user's specific data.</p> <ul> <li>\ud83d\udce3 RHEL AI 1.3 announcement</li> <li>\u270d\ufe0f RHEL blog posts:<ul> <li>RHEL AI 1.3 Docling context aware chunking: What you need to know</li> <li>Docling: The missing document processing companion for generative AI</li> </ul> </li> </ul>"},{"location":"integrations/spacy/","title":"spaCy","text":"<p>Docling is available in spaCy as the spaCy Layout plugin.</p> <p>More details can be found in this blog post.</p> <ul> <li>\ud83d\udcbb SpacyLayout GitHub</li> <li>\ud83d\udcd6 SpacyLayout docs</li> <li>\ud83d\udce6 SpacyLayout PyPI</li> </ul>"},{"location":"integrations/txtai/","title":"txtai","text":"<p>Docling is available as a text extraction backend for txtai.</p> <ul> <li>\ud83d\udcbb txtai GitHub</li> <li>\ud83d\udcd6 txtai docs</li> <li>\ud83d\udcd6 txtai Docling backend</li> </ul>"},{"location":"integrations/vectara/","title":"Vectara","text":"<p>Docling is available as a document parser in Vectara.</p> <ul> <li>\ud83d\udcbb Vectara GitHub org<ul> <li>vectara-ingest GitHub repo</li> </ul> </li> <li>\ud83d\udcd6 Vectara docs</li> </ul>"},{"location":"reference/cli/","title":"CLI reference","text":"<p>This page provides documentation for our command line tools.</p>"},{"location":"reference/cli/#docling","title":"docling","text":"<p>Usage:</p> <pre><code>docling [OPTIONS] source\n</code></pre> <p>Options:</p> Name Type Description Default <code>--from</code> choice (<code>docx</code> | <code>pptx</code> | <code>html</code> | <code>image</code> | <code>pdf</code> | <code>asciidoc</code> | <code>md</code> | <code>csv</code> | <code>xlsx</code> | <code>xml_uspto</code> | <code>xml_jats</code> | <code>mets_gbs</code> | <code>json_docling</code> | <code>audio</code> | <code>vtt</code>) Specify input formats to convert from. Defaults to all formats. None <code>--to</code> choice (<code>md</code> | <code>json</code> | <code>html</code> | <code>html_split_page</code> | <code>text</code> | <code>doctags</code>) Specify output formats. Defaults to Markdown. None <code>--show-layout</code> / <code>--no-show-layout</code> boolean If enabled, the page images will show the bounding-boxes of the items. <code>False</code> <code>--headers</code> text Specify http request headers used when fetching url input sources in the form of a JSON string None <code>--image-export-mode</code> choice (<code>placeholder</code> | <code>embedded</code> | <code>referenced</code>) Image export mode for the document (only in case of JSON, Markdown or HTML). With <code>placeholder</code>, only the position of the image is marked in the output. In <code>embedded</code> mode, the image is embedded as base64 encoded string. In <code>referenced</code> mode, the image is exported in PNG format and referenced from the main exported document. <code>ImageRefMode.EMBEDDED</code> <code>--pipeline</code> choice (<code>legacy</code> | <code>standard</code> | <code>vlm</code> | <code>asr</code>) Choose the pipeline to process PDF or image files. <code>ProcessingPipeline.STANDARD</code> <code>--vlm-model</code> choice (<code>smoldocling</code> | <code>smoldocling_vllm</code> | <code>granite_vision</code> | <code>granite_vision_vllm</code> | <code>granite_vision_ollama</code> | <code>got_ocr_2</code> | <code>granite_docling</code> | <code>granite_docling_vllm</code>) Choose the VLM model to use with PDF or image files. <code>VlmModelType.GRANITEDOCLING</code> <code>--asr-model</code> choice (<code>whisper_tiny</code> | <code>whisper_small</code> | <code>whisper_medium</code> | <code>whisper_base</code> | <code>whisper_large</code> | <code>whisper_turbo</code> | <code>whisper_tiny_mlx</code> | <code>whisper_small_mlx</code> | <code>whisper_medium_mlx</code> | <code>whisper_base_mlx</code> | <code>whisper_large_mlx</code> | <code>whisper_turbo_mlx</code> | <code>whisper_tiny_native</code> | <code>whisper_small_native</code> | <code>whisper_medium_native</code> | <code>whisper_base_native</code> | <code>whisper_large_native</code> | <code>whisper_turbo_native</code>) Choose the ASR model to use with audio/video files. <code>AsrModelType.WHISPER_TINY</code> <code>--ocr</code> / <code>--no-ocr</code> boolean If enabled, the bitmap content will be processed using OCR. <code>True</code> <code>--force-ocr</code> / <code>--no-force-ocr</code> boolean Replace any existing text with OCR generated text over the full content. <code>False</code> <code>--tables</code> / <code>--no-tables</code> boolean If enabled, the table structure model will be used to extract table information. <code>True</code> <code>--ocr-engine</code> text The OCR engine to use. When --allow-external-plugins is not set, the available values are: auto, easyocr, ocrmac, rapidocr, tesserocr, tesseract. Use the option --show-external-plugins to see the options allowed with external plugins. <code>auto</code> <code>--ocr-lang</code> text Provide a comma-separated list of languages used by the OCR engine. Note that each OCR engine has different values for the language names. None <code>--psm</code> integer Page Segmentation Mode for the OCR engine (0-13). None <code>--pdf-backend</code> choice (<code>pypdfium2</code> | <code>dlparse_v1</code> | <code>dlparse_v2</code> | <code>dlparse_v4</code>) The PDF backend to use. <code>PdfBackend.DLPARSE_V4</code> <code>--pdf-password</code> text Password for protected PDF documents None <code>--table-mode</code> choice (<code>fast</code> | <code>accurate</code>) The mode to use in the table structure model. <code>TableFormerMode.ACCURATE</code> <code>--enrich-code</code> / <code>--no-enrich-code</code> boolean Enable the code enrichment model in the pipeline. <code>False</code> <code>--enrich-formula</code> / <code>--no-enrich-formula</code> boolean Enable the formula enrichment model in the pipeline. <code>False</code> <code>--enrich-picture-classes</code> / <code>--no-enrich-picture-classes</code> boolean Enable the picture classification enrichment model in the pipeline. <code>False</code> <code>--enrich-picture-description</code> / <code>--no-enrich-picture-description</code> boolean Enable the picture description model in the pipeline. <code>False</code> <code>--artifacts-path</code> path If provided, the location of the model artifacts. None <code>--enable-remote-services</code> / <code>--no-enable-remote-services</code> boolean Must be enabled when using models connecting to remote services. <code>False</code> <code>--allow-external-plugins</code> / <code>--no-allow-external-plugins</code> boolean Must be enabled for loading modules from third-party plugins. <code>False</code> <code>--show-external-plugins</code> / <code>--no-show-external-plugins</code> boolean List the third-party plugins which are available when the option --allow-external-plugins is set. <code>False</code> <code>--abort-on-error</code> / <code>--no-abort-on-error</code> boolean If enabled, the processing will be aborted when the first error is encountered. <code>False</code> <code>--output</code> path Output directory where results are saved. <code>.</code> <code>--verbose</code>, <code>-v</code> integer Set the verbosity level. -v for info logging, -vv for debug logging. <code>0</code> <code>--debug-visualize-cells</code> / <code>--no-debug-visualize-cells</code> boolean Enable debug output which visualizes the PDF cells <code>False</code> <code>--debug-visualize-ocr</code> / <code>--no-debug-visualize-ocr</code> boolean Enable debug output which visualizes the OCR cells <code>False</code> <code>--debug-visualize-layout</code> / <code>--no-debug-visualize-layout</code> boolean Enable debug output which visualizes the layour clusters <code>False</code> <code>--debug-visualize-tables</code> / <code>--no-debug-visualize-tables</code> boolean Enable debug output which visualizes the table cells <code>False</code> <code>--version</code> boolean Show version information. None <code>--document-timeout</code> float The timeout for processing each document, in seconds. None <code>--num-threads</code> integer Number of threads <code>4</code> <code>--device</code> choice (<code>auto</code> | <code>cpu</code> | <code>cuda</code> | <code>mps</code>) Accelerator device <code>AcceleratorDevice.AUTO</code> <code>--logo</code> boolean Docling logo None <code>--page-batch-size</code> integer Number of pages processed in one batch. Default: 4 <code>4</code> <code>--help</code> boolean Show this message and exit. <code>False</code>"},{"location":"reference/docling_document/","title":"Docling Document","text":"<p>This is an automatic generated API reference of the DoclingDocument type.</p>"},{"location":"reference/docling_document/#docling_core.types.doc","title":"doc","text":"<p>Package for models defined by the Document type.</p> <p>Classes:</p> <ul> <li> <code>DoclingDocument</code>           \u2013            <p>DoclingDocument.</p> </li> <li> <code>DocumentOrigin</code>           \u2013            <p>FileSource.</p> </li> <li> <code>DocItem</code>           \u2013            <p>DocItem.</p> </li> <li> <code>DocItemLabel</code>           \u2013            <p>DocItemLabel.</p> </li> <li> <code>ProvenanceItem</code>           \u2013            <p>ProvenanceItem.</p> </li> <li> <code>GroupItem</code>           \u2013            <p>GroupItem.</p> </li> <li> <code>GroupLabel</code>           \u2013            <p>GroupLabel.</p> </li> <li> <code>NodeItem</code>           \u2013            <p>NodeItem.</p> </li> <li> <code>PageItem</code>           \u2013            <p>PageItem.</p> </li> <li> <code>FloatingItem</code>           \u2013            <p>FloatingItem.</p> </li> <li> <code>TextItem</code>           \u2013            <p>TextItem.</p> </li> <li> <code>TableItem</code>           \u2013            <p>TableItem.</p> </li> <li> <code>TableCell</code>           \u2013            <p>TableCell.</p> </li> <li> <code>TableData</code>           \u2013            <p>BaseTableData.</p> </li> <li> <code>TableCellLabel</code>           \u2013            <p>TableCellLabel.</p> </li> <li> <code>KeyValueItem</code>           \u2013            <p>KeyValueItem.</p> </li> <li> <code>SectionHeaderItem</code>           \u2013            <p>SectionItem.</p> </li> <li> <code>PictureItem</code>           \u2013            <p>PictureItem.</p> </li> <li> <code>ImageRef</code>           \u2013            <p>ImageRef.</p> </li> <li> <code>PictureClassificationClass</code>           \u2013            <p>PictureClassificationData.</p> </li> <li> <code>PictureClassificationData</code>           \u2013            <p>PictureClassificationData.</p> </li> <li> <code>RefItem</code>           \u2013            <p>RefItem.</p> </li> <li> <code>BoundingBox</code>           \u2013            <p>BoundingBox.</p> </li> <li> <code>CoordOrigin</code>           \u2013            <p>CoordOrigin.</p> </li> <li> <code>ImageRefMode</code>           \u2013            <p>ImageRefMode.</p> </li> <li> <code>Size</code>           \u2013            <p>Size.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument","title":"DoclingDocument","text":"<p>               Bases: <code>BaseModel</code></p> <p>DoclingDocument.</p> <p>Methods:</p> <ul> <li> <code>add_code</code>             \u2013              <p>add_code.</p> </li> <li> <code>add_document</code>             \u2013              <p>Adds the content from the body of a DoclingDocument to this document under a specific parent.</p> </li> <li> <code>add_form</code>             \u2013              <p>add_form.</p> </li> <li> <code>add_formula</code>             \u2013              <p>add_formula.</p> </li> <li> <code>add_group</code>             \u2013              <p>add_group.</p> </li> <li> <code>add_heading</code>             \u2013              <p>add_heading.</p> </li> <li> <code>add_inline_group</code>             \u2013              <p>add_inline_group.</p> </li> <li> <code>add_key_values</code>             \u2013              <p>add_key_values.</p> </li> <li> <code>add_list_group</code>             \u2013              <p>add_list_group.</p> </li> <li> <code>add_list_item</code>             \u2013              <p>add_list_item.</p> </li> <li> <code>add_node_items</code>             \u2013              <p>Adds multiple NodeItems and their children under a parent in this document.</p> </li> <li> <code>add_ordered_list</code>             \u2013              <p>add_ordered_list.</p> </li> <li> <code>add_page</code>             \u2013              <p>add_page.</p> </li> <li> <code>add_picture</code>             \u2013              <p>add_picture.</p> </li> <li> <code>add_table</code>             \u2013              <p>add_table.</p> </li> <li> <code>add_table_cell</code>             \u2013              <p>Add a table cell to the table.</p> </li> <li> <code>add_text</code>             \u2013              <p>add_text.</p> </li> <li> <code>add_title</code>             \u2013              <p>add_title.</p> </li> <li> <code>add_unordered_list</code>             \u2013              <p>add_unordered_list.</p> </li> <li> <code>append_child_item</code>             \u2013              <p>Adds an item.</p> </li> <li> <code>check_version_is_compatible</code>             \u2013              <p>Check if this document version is compatible with SDK schema version.</p> </li> <li> <code>concatenate</code>             \u2013              <p>Concatenate multiple documents into a single document.</p> </li> <li> <code>delete_items</code>             \u2013              <p>Deletes an item, given its instance or ref, and any children it has.</p> </li> <li> <code>delete_items_range</code>             \u2013              <p>Deletes all NodeItems and their children in the range from the start NodeItem to the end NodeItem.</p> </li> <li> <code>export_to_dict</code>             \u2013              <p>Export to dict.</p> </li> <li> <code>export_to_doctags</code>             \u2013              <p>Exports the document content to a DocumentToken format.</p> </li> <li> <code>export_to_document_tokens</code>             \u2013              <p>Export to DocTags format.</p> </li> <li> <code>export_to_element_tree</code>             \u2013              <p>Export_to_element_tree.</p> </li> <li> <code>export_to_html</code>             \u2013              <p>Serialize to HTML.</p> </li> <li> <code>export_to_markdown</code>             \u2013              <p>Serialize to Markdown.</p> </li> <li> <code>export_to_text</code>             \u2013              <p>export_to_text.</p> </li> <li> <code>extract_items_range</code>             \u2013              <p>Extracts NodeItems and children in the range from the start NodeItem to the end as a new DoclingDocument.</p> </li> <li> <code>filter</code>             \u2013              <p>Create a new document based on the provided filter parameters.</p> </li> <li> <code>get_visualization</code>             \u2013              <p>Get visualization of the document as images by page.</p> </li> <li> <code>insert_code</code>             \u2013              <p>Creates a new CodeItem item and inserts it into the document.</p> </li> <li> <code>insert_document</code>             \u2013              <p>Inserts the content from the body of a DoclingDocument into this document at a specific position.</p> </li> <li> <code>insert_form</code>             \u2013              <p>Creates a new FormItem item and inserts it into the document.</p> </li> <li> <code>insert_formula</code>             \u2013              <p>Creates a new FormulaItem item and inserts it into the document.</p> </li> <li> <code>insert_group</code>             \u2013              <p>Creates a new GroupItem item and inserts it into the document.</p> </li> <li> <code>insert_heading</code>             \u2013              <p>Creates a new SectionHeaderItem item and inserts it into the document.</p> </li> <li> <code>insert_inline_group</code>             \u2013              <p>Creates a new InlineGroup item and inserts it into the document.</p> </li> <li> <code>insert_item_after_sibling</code>             \u2013              <p>Inserts an item, given its node_item instance, after other as a sibling.</p> </li> <li> <code>insert_item_before_sibling</code>             \u2013              <p>Inserts an item, given its node_item instance, before other as a sibling.</p> </li> <li> <code>insert_key_values</code>             \u2013              <p>Creates a new KeyValueItem item and inserts it into the document.</p> </li> <li> <code>insert_list_group</code>             \u2013              <p>Creates a new ListGroup item and inserts it into the document.</p> </li> <li> <code>insert_list_item</code>             \u2013              <p>Creates a new ListItem item and inserts it into the document.</p> </li> <li> <code>insert_node_items</code>             \u2013              <p>Insert multiple NodeItems and their children at a specific position in the document.</p> </li> <li> <code>insert_picture</code>             \u2013              <p>Creates a new PictureItem item and inserts it into the document.</p> </li> <li> <code>insert_table</code>             \u2013              <p>Creates a new TableItem item and inserts it into the document.</p> </li> <li> <code>insert_text</code>             \u2013              <p>Creates a new TextItem item and inserts it into the document.</p> </li> <li> <code>insert_title</code>             \u2013              <p>Creates a new TitleItem item and inserts it into the document.</p> </li> <li> <code>iterate_items</code>             \u2013              <p>Iterate elements with level.</p> </li> <li> <code>load_from_doctags</code>             \u2013              <p>Load Docling document from lists of DocTags and Images.</p> </li> <li> <code>load_from_json</code>             \u2013              <p>load_from_json.</p> </li> <li> <code>load_from_yaml</code>             \u2013              <p>load_from_yaml.</p> </li> <li> <code>num_pages</code>             \u2013              <p>num_pages.</p> </li> <li> <code>print_element_tree</code>             \u2013              <p>Print_element_tree.</p> </li> <li> <code>replace_item</code>             \u2013              <p>Replace item with new item.</p> </li> <li> <code>save_as_doctags</code>             \u2013              <p>Save the document content to DocTags format.</p> </li> <li> <code>save_as_document_tokens</code>             \u2013              <p>Save the document content to a DocumentToken format.</p> </li> <li> <code>save_as_html</code>             \u2013              <p>Save to HTML.</p> </li> <li> <code>save_as_json</code>             \u2013              <p>Save as json.</p> </li> <li> <code>save_as_markdown</code>             \u2013              <p>Save to markdown.</p> </li> <li> <code>save_as_yaml</code>             \u2013              <p>Save as yaml.</p> </li> <li> <code>transform_to_content_layer</code>             \u2013              <p>transform_to_content_layer.</p> </li> <li> <code>validate_document</code>             \u2013              <p>validate_document.</p> </li> <li> <code>validate_misplaced_list_items</code>             \u2013              <p>validate_misplaced_list_items.</p> </li> <li> <code>validate_tree</code>             \u2013              <p>validate_tree.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>body</code>               (<code>GroupItem</code>)           \u2013            </li> <li> <code>form_items</code>               (<code>List[FormItem]</code>)           \u2013            </li> <li> <code>furniture</code>               (<code>Annotated[GroupItem, Field(deprecated=True)]</code>)           \u2013            </li> <li> <code>groups</code>               (<code>List[Union[ListGroup, InlineGroup, GroupItem]]</code>)           \u2013            </li> <li> <code>key_value_items</code>               (<code>List[KeyValueItem]</code>)           \u2013            </li> <li> <code>name</code>               (<code>str</code>)           \u2013            </li> <li> <code>origin</code>               (<code>Optional[DocumentOrigin]</code>)           \u2013            </li> <li> <code>pages</code>               (<code>Dict[int, PageItem]</code>)           \u2013            </li> <li> <code>pictures</code>               (<code>List[PictureItem]</code>)           \u2013            </li> <li> <code>schema_name</code>               (<code>Literal['DoclingDocument']</code>)           \u2013            </li> <li> <code>tables</code>               (<code>List[TableItem]</code>)           \u2013            </li> <li> <code>texts</code>               (<code>List[Union[TitleItem, SectionHeaderItem, ListItem, CodeItem, FormulaItem, TextItem]]</code>)           \u2013            </li> <li> <code>version</code>               (<code>Annotated[str, StringConstraints(pattern=VERSION_PATTERN, strict=True)]</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.body","title":"body","text":"<pre><code>body: GroupItem = GroupItem(name='_root_', self_ref='#/body')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.form_items","title":"form_items","text":"<pre><code>form_items: List[FormItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.furniture","title":"furniture","text":"<pre><code>furniture: Annotated[GroupItem, Field(deprecated=True)] = GroupItem(name='_root_', self_ref='#/furniture', content_layer=FURNITURE)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.groups","title":"groups","text":"<pre><code>groups: List[Union[ListGroup, InlineGroup, GroupItem]] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.key_value_items","title":"key_value_items","text":"<pre><code>key_value_items: List[KeyValueItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.name","title":"name","text":"<pre><code>name: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.origin","title":"origin","text":"<pre><code>origin: Optional[DocumentOrigin] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.pages","title":"pages","text":"<pre><code>pages: Dict[int, PageItem] = {}\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.pictures","title":"pictures","text":"<pre><code>pictures: List[PictureItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.schema_name","title":"schema_name","text":"<pre><code>schema_name: Literal['DoclingDocument'] = 'DoclingDocument'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.tables","title":"tables","text":"<pre><code>tables: List[TableItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.texts","title":"texts","text":"<pre><code>texts: List[Union[TitleItem, SectionHeaderItem, ListItem, CodeItem, FormulaItem, TextItem]] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.version","title":"version","text":"<pre><code>version: Annotated[str, StringConstraints(pattern=VERSION_PATTERN, strict=True)] = CURRENT_VERSION\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_code","title":"add_code","text":"<pre><code>add_code(text: str, code_language: Optional[CodeLanguageLabel] = None, orig: Optional[str] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_code.</p> <p>Parameters:</p> <ul> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>code_language</code>               (<code>Optional[CodeLanguageLabel]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[CodeLanguageLabel]: (Default value = None)</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>caption</code>               (<code>Optional[Union[TextItem, RefItem]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[TextItem:</p> </li> <li> <code>RefItem]]</code>           \u2013            <p>(Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]:  (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_document","title":"add_document","text":"<pre><code>add_document(doc: DoclingDocument, parent: Optional[NodeItem] = None) -&gt; None\n</code></pre> <p>Adds the content from the body of a DoclingDocument to this document under a specific parent.</p> <p>Parameters:</p> <ul> <li> <code>doc</code>               (<code>DoclingDocument</code>)           \u2013            <p>DoclingDocument: The document whose content will be added</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]: The parent NodeItem under which new items are added (Default value = None)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code>           \u2013            <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_form","title":"add_form","text":"<pre><code>add_form(graph: GraphData, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None)\n</code></pre> <p>add_form.</p> <p>Parameters:</p> <ul> <li> <code>graph</code>               (<code>GraphData</code>)           \u2013            <p>GraphData:</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]:  (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_formula","title":"add_formula","text":"<pre><code>add_formula(text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_formula.</p> <p>Parameters:</p> <ul> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>level</code>           \u2013            <p>LevelNumber:  (Default value = 1)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]:  (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_group","title":"add_group","text":"<pre><code>add_group(label: Optional[GroupLabel] = None, name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -&gt; GroupItem\n</code></pre> <p>add_group.</p> <p>Parameters:</p> <ul> <li> <code>label</code>               (<code>Optional[GroupLabel]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[GroupLabel]:  (Default value = None)</p> </li> <li> <code>name</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]:  (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_heading","title":"add_heading","text":"<pre><code>add_heading(text: str, orig: Optional[str] = None, level: LevelNumber = 1, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_heading.</p> <p>Parameters:</p> <ul> <li> <code>label</code>           \u2013            <p>DocItemLabel:</p> </li> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>level</code>               (<code>LevelNumber</code>, default:                   <code>1</code> )           \u2013            <p>LevelNumber:  (Default value = 1)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]:  (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_inline_group","title":"add_inline_group","text":"<pre><code>add_inline_group(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -&gt; InlineGroup\n</code></pre> <p>add_inline_group.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_key_values","title":"add_key_values","text":"<pre><code>add_key_values(graph: GraphData, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None)\n</code></pre> <p>add_key_values.</p> <p>Parameters:</p> <ul> <li> <code>graph</code>               (<code>GraphData</code>)           \u2013            <p>GraphData:</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]:  (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_list_group","title":"add_list_group","text":"<pre><code>add_list_group(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -&gt; ListGroup\n</code></pre> <p>add_list_group.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_list_item","title":"add_list_item","text":"<pre><code>add_list_item(text: str, enumerated: bool = False, marker: Optional[str] = None, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_list_item.</p> <p>Parameters:</p> <ul> <li> <code>label</code>           \u2013            <p>str:</p> </li> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]:  (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_node_items","title":"add_node_items","text":"<pre><code>add_node_items(node_items: List[NodeItem], doc: DoclingDocument, parent: Optional[NodeItem] = None) -&gt; None\n</code></pre> <p>Adds multiple NodeItems and their children under a parent in this document.</p> <p>Parameters:</p> <ul> <li> <code>node_items</code>               (<code>List[NodeItem]</code>)           \u2013            <p>list[NodeItem]: The NodeItems to be added</p> </li> <li> <code>doc</code>               (<code>DoclingDocument</code>)           \u2013            <p>DoclingDocument: The document to which the NodeItems and their children belong</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]: The parent NodeItem under which new items are added (Default value = None)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code>           \u2013            <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_ordered_list","title":"add_ordered_list","text":"<pre><code>add_ordered_list(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -&gt; GroupItem\n</code></pre> <p>add_ordered_list.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_page","title":"add_page","text":"<pre><code>add_page(page_no: int, size: Size, image: Optional[ImageRef] = None) -&gt; PageItem\n</code></pre> <p>add_page.</p> <p>Parameters:</p> <ul> <li> <code>page_no</code>               (<code>int</code>)           \u2013            <p>int:</p> </li> <li> <code>size</code>               (<code>Size</code>)           \u2013            <p>Size:</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_picture","title":"add_picture","text":"<pre><code>add_picture(annotations: Optional[List[PictureDataType]] = None, image: Optional[ImageRef] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None)\n</code></pre> <p>add_picture.</p> <p>Parameters:</p> <ul> <li> <code>data</code>           \u2013            <p>Optional[List[PictureData]]: (Default value = None)</p> </li> <li> <code>caption</code>               (<code>Optional[Union[TextItem, RefItem]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[TextItem:</p> </li> <li> <code>RefItem]]</code>           \u2013            <p>(Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]:  (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_table","title":"add_table","text":"<pre><code>add_table(data: TableData, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, label: DocItemLabel = TABLE, content_layer: Optional[ContentLayer] = None, annotations: Optional[list[TableAnnotationType]] = None)\n</code></pre> <p>add_table.</p> <p>Parameters:</p> <ul> <li> <code>data</code>               (<code>TableData</code>)           \u2013            <p>TableData:</p> </li> <li> <code>caption</code>               (<code>Optional[Union[TextItem, RefItem]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[TextItem, RefItem]]:  (Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]:  (Default value = None)</p> </li> <li> <code>label</code>               (<code>DocItemLabel</code>, default:                   <code>TABLE</code> )           \u2013            <p>DocItemLabel:  (Default value = DocItemLabel.TABLE)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_table_cell","title":"add_table_cell","text":"<pre><code>add_table_cell(table_item: TableItem, cell: TableCell) -&gt; None\n</code></pre> <p>Add a table cell to the table.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_text","title":"add_text","text":"<pre><code>add_text(label: DocItemLabel, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_text.</p> <p>Parameters:</p> <ul> <li> <code>label</code>               (<code>DocItemLabel</code>)           \u2013            <p>str:</p> </li> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]:  (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_title","title":"add_title","text":"<pre><code>add_title(text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_title.</p> <p>Parameters:</p> <ul> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>level</code>           \u2013            <p>LevelNumber:  (Default value = 1)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>parent</code>               (<code>Optional[NodeItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[NodeItem]:  (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_unordered_list","title":"add_unordered_list","text":"<pre><code>add_unordered_list(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -&gt; GroupItem\n</code></pre> <p>add_unordered_list.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.append_child_item","title":"append_child_item","text":"<pre><code>append_child_item(*, child: NodeItem, parent: Optional[NodeItem] = None) -&gt; None\n</code></pre> <p>Adds an item.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.check_version_is_compatible","title":"check_version_is_compatible","text":"<pre><code>check_version_is_compatible(v: str) -&gt; str\n</code></pre> <p>Check if this document version is compatible with SDK schema version.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.concatenate","title":"concatenate","text":"<pre><code>concatenate(docs: Sequence[DoclingDocument]) -&gt; DoclingDocument\n</code></pre> <p>Concatenate multiple documents into a single document.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.delete_items","title":"delete_items","text":"<pre><code>delete_items(*, node_items: List[NodeItem]) -&gt; None\n</code></pre> <p>Deletes an item, given its instance or ref, and any children it has.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.delete_items_range","title":"delete_items_range","text":"<pre><code>delete_items_range(*, start: NodeItem, end: NodeItem, start_inclusive: bool = True, end_inclusive: bool = True) -&gt; None\n</code></pre> <p>Deletes all NodeItems and their children in the range from the start NodeItem to the end NodeItem.</p> <p>Parameters:</p> <ul> <li> <code>start</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:  The starting NodeItem of the range</p> </li> <li> <code>end</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:  The ending NodeItem of the range</p> </li> <li> <code>start_inclusive</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True):  If True, the start NodeItem will also be deleted</p> </li> <li> <code>end_inclusive</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True):  If True, the end NodeItem will also be deleted</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code>           \u2013            <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_dict","title":"export_to_dict","text":"<pre><code>export_to_dict(mode: str = 'json', by_alias: bool = True, exclude_none: bool = True, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None) -&gt; Dict[str, Any]\n</code></pre> <p>Export to dict.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_doctags","title":"export_to_doctags","text":"<pre><code>export_to_doctags(delim: str = '', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True, add_page_index: bool = True, add_table_cell_location: bool = False, add_table_cell_text: bool = True, minified: bool = False, pages: Optional[set[int]] = None) -&gt; str\n</code></pre> <p>Exports the document content to a DocumentToken format.</p> <p>Operates on a slice of the document's body as defined through arguments from_element and to_element; defaulting to the whole main_text.</p> <p>Parameters:</p> <ul> <li> <code>delim</code>               (<code>str</code>, default:                   <code>''</code> )           \u2013            <p>str:  (Default value = \"\")  Deprecated</p> </li> <li> <code>from_element</code>               (<code>int</code>, default:                   <code>0</code> )           \u2013            <p>int:  (Default value = 0)</p> </li> <li> <code>to_element</code>               (<code>int</code>, default:                   <code>maxsize</code> )           \u2013            <p>Optional[int]:  (Default value = None)</p> </li> <li> <code>labels</code>               (<code>Optional[set[DocItemLabel]]</code>, default:                   <code>None</code> )           \u2013            <p>set[DocItemLabel]</p> </li> <li> <code>xsize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>ysize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>add_location</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>add_content</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>add_page_index</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>flagsadd_table_cell_location</code>           \u2013            <p>bool</p> </li> <li> <code>add_table_cell_text</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>minified</code>               (<code>bool</code>, default:                   <code>False</code> )           \u2013            <p>bool:  (Default value = False)</p> </li> <li> <code>pages</code>               (<code>Optional[set[int]]</code>, default:                   <code>None</code> )           \u2013            <p>set[int]: (Default value = None)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>str</code>           \u2013            <p>The content of the document formatted as a DocTags string.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(*args, **kwargs)\n</code></pre> <p>Export to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_element_tree","title":"export_to_element_tree","text":"<pre><code>export_to_element_tree() -&gt; str\n</code></pre> <p>Export_to_element_tree.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_html","title":"export_to_html","text":"<pre><code>export_to_html(from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, enable_chart_tables: bool = True, image_mode: ImageRefMode = PLACEHOLDER, formula_to_mathml: bool = True, page_no: Optional[int] = None, html_lang: str = 'en', html_head: str = 'null', included_content_layers: Optional[set[ContentLayer]] = None, split_page_view: bool = False, include_annotations: bool = True) -&gt; str\n</code></pre> <p>Serialize to HTML.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_markdown","title":"export_to_markdown","text":"<pre><code>export_to_markdown(delim: str = '\\n\\n', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, strict_text: bool = False, escape_html: bool = True, escape_underscores: bool = True, image_placeholder: str = '&lt;!-- image --&gt;', enable_chart_tables: bool = True, image_mode: ImageRefMode = PLACEHOLDER, indent: int = 4, text_width: int = -1, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, page_break_placeholder: Optional[str] = None, include_annotations: bool = True, mark_annotations: bool = False, *, use_legacy_annotations: Optional[bool] = None, allowed_meta_names: Optional[set[str]] = None, blocked_meta_names: Optional[set[str]] = None, mark_meta: bool = False) -&gt; str\n</code></pre> <p>Serialize to Markdown.</p> <p>Operates on a slice of the document's body as defined through arguments from_element and to_element; defaulting to the whole document.</p> <p>Parameters:</p> <ul> <li> <code>delim</code>               (<code>str</code>, default:                   <code>'\\n\\n'</code> )           \u2013            <p>Deprecated.</p> </li> <li> <code>from_element</code>               (<code>int</code>, default:                   <code>0</code> )           \u2013            <p>Body slicing start index (inclusive). (Default value = 0).</p> </li> <li> <code>to_element</code>               (<code>int</code>, default:                   <code>maxsize</code> )           \u2013            <p>Body slicing stop index (exclusive). (Default value = maxint).</p> </li> <li> <code>labels</code>               (<code>Optional[set[DocItemLabel]]</code>, default:                   <code>None</code> )           \u2013            <p>The set of document labels to include in the export. None falls back to the system-defined default.</p> </li> <li> <code>strict_text</code>               (<code>bool</code>, default:                   <code>False</code> )           \u2013            <p>Deprecated.</p> </li> <li> <code>escape_html</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool: Whether to escape HTML reserved characters in the text content of the document. (Default value = True).</p> </li> <li> <code>escape_underscores</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool: Whether to escape underscores in the text content of the document. (Default value = True).</p> </li> <li> <code>image_placeholder</code>               (<code>str</code>, default:                   <code>'&lt;!-- image --&gt;'</code> )           \u2013            <p>The placeholder to include to position images in the markdown. (Default value = \"\\&lt;!-- image --&gt;\").</p> </li> <li> <code>image_mode</code>               (<code>ImageRefMode</code>, default:                   <code>PLACEHOLDER</code> )           \u2013            <p>The mode to use for including images in the markdown. (Default value = ImageRefMode.PLACEHOLDER).</p> </li> <li> <code>indent</code>               (<code>int</code>, default:                   <code>4</code> )           \u2013            <p>The indent in spaces of the nested lists. (Default value = 4).</p> </li> <li> <code>included_content_layers</code>               (<code>Optional[set[ContentLayer]]</code>, default:                   <code>None</code> )           \u2013            <p>The set of layels to include in the export. None falls back to the system-defined default.</p> </li> <li> <code>page_break_placeholder</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>The placeholder to include for marking page breaks. None means no page break placeholder will be used.</p> </li> <li> <code>include_annotations</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool: Whether to include annotations in the export; only considered if item does not have meta. (Default value = True).</p> </li> <li> <code>mark_annotations</code>               (<code>bool</code>, default:                   <code>False</code> )           \u2013            <p>bool: Whether to mark annotations in the export; only considered if item does not have meta. (Default value = False).</p> </li> <li> <code>use_legacy_annotations</code>               (<code>Optional[bool]</code>, default:                   <code>None</code> )           \u2013            <p>bool: Deprecated; legacy annotations considered only when meta not present.</p> </li> <li> <code>mark_meta</code>               (<code>bool</code>, default:                   <code>False</code> )           \u2013            <p>bool: Whether to mark meta in the export</p> </li> <li> <code>allowed_meta_names</code>               (<code>Optional[set[str]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[set[str]]: Meta names to allow; None means all meta names are allowed.</p> </li> <li> <code>blocked_meta_names</code>               (<code>Optional[set[str]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[set[str]]: Meta names to block; takes precedence over allowed_meta_names.</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>str</code>           \u2013            <p>The exported Markdown representation.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_text","title":"export_to_text","text":"<pre><code>export_to_text(delim: str = '\\n\\n', from_element: int = 0, to_element: int = 1000000, labels: Optional[set[DocItemLabel]] = None) -&gt; str\n</code></pre> <p>export_to_text.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.extract_items_range","title":"extract_items_range","text":"<pre><code>extract_items_range(*, start: NodeItem, end: NodeItem, start_inclusive: bool = True, end_inclusive: bool = True, delete: bool = False) -&gt; DoclingDocument\n</code></pre> <p>Extracts NodeItems and children in the range from the start NodeItem to the end as a new DoclingDocument.</p> <p>Parameters:</p> <ul> <li> <code>start</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:  The starting NodeItem of the range (must be a direct child of the document body)</p> </li> <li> <code>end</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:  The ending NodeItem of the range  (must be a direct child of the document body)</p> </li> <li> <code>start_inclusive</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True):  If True, the start NodeItem will also be extracted</p> </li> <li> <code>end_inclusive</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True):  If True, the end NodeItem will also be extracted</p> </li> <li> <code>delete</code>               (<code>bool</code>, default:                   <code>False</code> )           \u2013            <p>bool:  (Default value = False):  If True, extracted items are deleted in the original document</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>DoclingDocument</code>           \u2013            <p>DoclingDocument: A new document containing the extracted NodeItems and their children</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.filter","title":"filter","text":"<pre><code>filter(page_nrs: Optional[set[int]] = None) -&gt; DoclingDocument\n</code></pre> <p>Create a new document based on the provided filter parameters.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.get_visualization","title":"get_visualization","text":"<pre><code>get_visualization(show_label: bool = True, show_branch_numbering: bool = False, viz_mode: Literal['reading_order', 'key_value'] = 'reading_order', show_cell_id: bool = False) -&gt; dict[Optional[int], Image]\n</code></pre> <p>Get visualization of the document as images by page.</p> <p>Parameters:</p> <ul> <li> <code>show_label</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>Show labels on elements (applies to all visualizers).</p> </li> <li> <code>show_branch_numbering</code>               (<code>bool</code>, default:                   <code>False</code> )           \u2013            <p>Show branch numbering (reading order visualizer only).</p> </li> <li> <code>visualizer</code>               (<code>str</code>)           \u2013            <p>Which visualizer to use. One of 'reading_order' (default), 'key_value'.</p> </li> <li> <code>show_cell_id</code>               (<code>bool</code>, default:                   <code>False</code> )           \u2013            <p>Show cell IDs (key value visualizer only).</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>dict[Optional[int], PILImage.Image]</code>           \u2013            <p>Dictionary mapping page numbers to PIL images.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_code","title":"insert_code","text":"<pre><code>insert_code(sibling: NodeItem, text: str, code_language: Optional[CodeLanguageLabel] = None, orig: Optional[str] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -&gt; CodeItem\n</code></pre> <p>Creates a new CodeItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>code_language</code>               (<code>Optional[CodeLanguageLabel]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]: (Default value = None)</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>caption</code>               (<code>Optional[Union[TextItem, RefItem]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[TextItem, RefItem]]:  (Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>content_layer</code>               (<code>Optional[ContentLayer]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ContentLayer]:  (Default value = None)</p> </li> <li> <code>formatting</code>               (<code>Optional[Formatting]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Formatting]:  (Default value = None)</p> </li> <li> <code>hyperlink</code>               (<code>Optional[Union[AnyUrl, Path]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[AnyUrl, Path]]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>CodeItem</code>           \u2013            <p>CodeItem: The newly created CodeItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_document","title":"insert_document","text":"<pre><code>insert_document(doc: DoclingDocument, sibling: NodeItem, after: bool = True) -&gt; None\n</code></pre> <p>Inserts the content from the body of a DoclingDocument into this document at a specific position.</p> <p>Parameters:</p> <ul> <li> <code>doc</code>               (<code>DoclingDocument</code>)           \u2013            <p>DoclingDocument: The document whose content will be inserted</p> </li> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem: The NodeItem after/before which the new items will be inserted</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool: If True, insert after the sibling; if False, insert before (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code>           \u2013            <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_form","title":"insert_form","text":"<pre><code>insert_form(sibling: NodeItem, graph: GraphData, prov: Optional[ProvenanceItem] = None, after: bool = True) -&gt; FormItem\n</code></pre> <p>Creates a new FormItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>graph</code>               (<code>GraphData</code>)           \u2013            <p>GraphData:</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>FormItem</code>           \u2013            <p>FormItem: The newly created FormItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_formula","title":"insert_formula","text":"<pre><code>insert_formula(sibling: NodeItem, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -&gt; FormulaItem\n</code></pre> <p>Creates a new FormulaItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>content_layer</code>               (<code>Optional[ContentLayer]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ContentLayer]:  (Default value = None)</p> </li> <li> <code>formatting</code>               (<code>Optional[Formatting]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Formatting]:  (Default value = None)</p> </li> <li> <code>hyperlink</code>               (<code>Optional[Union[AnyUrl, Path]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[AnyUrl, Path]]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>FormulaItem</code>           \u2013            <p>FormulaItem: The newly created FormulaItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_group","title":"insert_group","text":"<pre><code>insert_group(sibling: NodeItem, label: Optional[GroupLabel] = None, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -&gt; GroupItem\n</code></pre> <p>Creates a new GroupItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>label</code>               (<code>Optional[GroupLabel]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[GroupLabel]:  (Default value = None)</p> </li> <li> <code>name</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>content_layer</code>               (<code>Optional[ContentLayer]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ContentLayer]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>GroupItem</code>           \u2013            <p>GroupItem: The newly created GroupItem.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_heading","title":"insert_heading","text":"<pre><code>insert_heading(sibling: NodeItem, text: str, orig: Optional[str] = None, level: LevelNumber = 1, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -&gt; SectionHeaderItem\n</code></pre> <p>Creates a new SectionHeaderItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>level</code>               (<code>LevelNumber</code>, default:                   <code>1</code> )           \u2013            <p>LevelNumber:  (Default value = 1)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>content_layer</code>               (<code>Optional[ContentLayer]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ContentLayer]:  (Default value = None)</p> </li> <li> <code>formatting</code>               (<code>Optional[Formatting]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Formatting]:  (Default value = None)</p> </li> <li> <code>hyperlink</code>               (<code>Optional[Union[AnyUrl, Path]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[AnyUrl, Path]]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>SectionHeaderItem</code>           \u2013            <p>SectionHeaderItem: The newly created SectionHeaderItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_inline_group","title":"insert_inline_group","text":"<pre><code>insert_inline_group(sibling: NodeItem, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -&gt; InlineGroup\n</code></pre> <p>Creates a new InlineGroup item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>name</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>content_layer</code>               (<code>Optional[ContentLayer]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ContentLayer]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>InlineGroup</code>           \u2013            <p>InlineGroup: The newly created InlineGroup item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_item_after_sibling","title":"insert_item_after_sibling","text":"<pre><code>insert_item_after_sibling(*, new_item: NodeItem, sibling: NodeItem) -&gt; None\n</code></pre> <p>Inserts an item, given its node_item instance, after other as a sibling.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_item_before_sibling","title":"insert_item_before_sibling","text":"<pre><code>insert_item_before_sibling(*, new_item: NodeItem, sibling: NodeItem) -&gt; None\n</code></pre> <p>Inserts an item, given its node_item instance, before other as a sibling.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_key_values","title":"insert_key_values","text":"<pre><code>insert_key_values(sibling: NodeItem, graph: GraphData, prov: Optional[ProvenanceItem] = None, after: bool = True) -&gt; KeyValueItem\n</code></pre> <p>Creates a new KeyValueItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>graph</code>               (<code>GraphData</code>)           \u2013            <p>GraphData:</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>KeyValueItem</code>           \u2013            <p>KeyValueItem: The newly created KeyValueItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_list_group","title":"insert_list_group","text":"<pre><code>insert_list_group(sibling: NodeItem, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -&gt; ListGroup\n</code></pre> <p>Creates a new ListGroup item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>name</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>content_layer</code>               (<code>Optional[ContentLayer]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ContentLayer]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>ListGroup</code>           \u2013            <p>ListGroup: The newly created ListGroup item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_list_item","title":"insert_list_item","text":"<pre><code>insert_list_item(sibling: NodeItem, text: str, enumerated: bool = False, marker: Optional[str] = None, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -&gt; ListItem\n</code></pre> <p>Creates a new ListItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>enumerated</code>               (<code>bool</code>, default:                   <code>False</code> )           \u2013            <p>bool:  (Default value = False)</p> </li> <li> <code>marker</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>content_layer</code>               (<code>Optional[ContentLayer]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ContentLayer]:  (Default value = None)</p> </li> <li> <code>formatting</code>               (<code>Optional[Formatting]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Formatting]:  (Default value = None)</p> </li> <li> <code>hyperlink</code>               (<code>Optional[Union[AnyUrl, Path]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[AnyUrl, Path]]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>ListItem</code>           \u2013            <p>ListItem: The newly created ListItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_node_items","title":"insert_node_items","text":"<pre><code>insert_node_items(sibling: NodeItem, node_items: List[NodeItem], doc: DoclingDocument, after: bool = True) -&gt; None\n</code></pre> <p>Insert multiple NodeItems and their children at a specific position in the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem: The NodeItem after/before which the new items will be inserted</p> </li> <li> <code>node_items</code>               (<code>List[NodeItem]</code>)           \u2013            <p>list[NodeItem]: The NodeItems to be inserted</p> </li> <li> <code>doc</code>               (<code>DoclingDocument</code>)           \u2013            <p>DoclingDocument: The document to which the NodeItems and their children belong</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool: If True, insert after the sibling; if False, insert before (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code>           \u2013            <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_picture","title":"insert_picture","text":"<pre><code>insert_picture(sibling: NodeItem, annotations: Optional[List[PictureDataType]] = None, image: Optional[ImageRef] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -&gt; PictureItem\n</code></pre> <p>Creates a new PictureItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>annotations</code>               (<code>Optional[List[PictureDataType]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[List[PictureDataType]]: (Default value = None)</p> </li> <li> <code>image</code>               (<code>Optional[ImageRef]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ImageRef]:  (Default value = None)</p> </li> <li> <code>caption</code>               (<code>Optional[Union[TextItem, RefItem]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[TextItem, RefItem]]:  (Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>content_layer</code>               (<code>Optional[ContentLayer]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ContentLayer]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>PictureItem</code>           \u2013            <p>PictureItem: The newly created PictureItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_table","title":"insert_table","text":"<pre><code>insert_table(sibling: NodeItem, data: TableData, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, label: DocItemLabel = TABLE, content_layer: Optional[ContentLayer] = None, annotations: Optional[list[TableAnnotationType]] = None, after: bool = True) -&gt; TableItem\n</code></pre> <p>Creates a new TableItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>data</code>               (<code>TableData</code>)           \u2013            <p>TableData:</p> </li> <li> <code>caption</code>               (<code>Optional[Union[TextItem, RefItem]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[TextItem, RefItem]]:  (Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>label</code>               (<code>DocItemLabel</code>, default:                   <code>TABLE</code> )           \u2013            <p>DocItemLabel:  (Default value = DocItemLabel.TABLE)</p> </li> <li> <code>content_layer</code>               (<code>Optional[ContentLayer]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ContentLayer]:  (Default value = None)</p> </li> <li> <code>annotations</code>               (<code>Optional[list[TableAnnotationType]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[List[TableAnnotationType]]: (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>TableItem</code>           \u2013            <p>TableItem: The newly created TableItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_text","title":"insert_text","text":"<pre><code>insert_text(sibling: NodeItem, label: DocItemLabel, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -&gt; TextItem\n</code></pre> <p>Creates a new TextItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>label</code>               (<code>DocItemLabel</code>)           \u2013            <p>DocItemLabel:</p> </li> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>content_layer</code>               (<code>Optional[ContentLayer]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ContentLayer]:  (Default value = None)</p> </li> <li> <code>formatting</code>               (<code>Optional[Formatting]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Formatting]:  (Default value = None)</p> </li> <li> <code>hyperlink</code>               (<code>Optional[Union[AnyUrl, Path]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[AnyUrl, Path]]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>TextItem</code>           \u2013            <p>TextItem: The newly created TextItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_title","title":"insert_title","text":"<pre><code>insert_title(sibling: NodeItem, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -&gt; TitleItem\n</code></pre> <p>Creates a new TitleItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code>               (<code>NodeItem</code>)           \u2013            <p>NodeItem:</p> </li> <li> <code>text</code>               (<code>str</code>)           \u2013            <p>str:</p> </li> <li> <code>orig</code>               (<code>Optional[str]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[str]:  (Default value = None)</p> </li> <li> <code>prov</code>               (<code>Optional[ProvenanceItem]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ProvenanceItem]:  (Default value = None)</p> </li> <li> <code>content_layer</code>               (<code>Optional[ContentLayer]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[ContentLayer]:  (Default value = None)</p> </li> <li> <code>formatting</code>               (<code>Optional[Formatting]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Formatting]:  (Default value = None)</p> </li> <li> <code>hyperlink</code>               (<code>Optional[Union[AnyUrl, Path]]</code>, default:                   <code>None</code> )           \u2013            <p>Optional[Union[AnyUrl, Path]]:  (Default value = None)</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>TitleItem</code>           \u2013            <p>TitleItem: The newly created TitleItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.iterate_items","title":"iterate_items","text":"<pre><code>iterate_items(root: Optional[NodeItem] = None, with_groups: bool = False, traverse_pictures: bool = False, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, _level: int = 0) -&gt; Iterable[Tuple[NodeItem, int]]\n</code></pre> <p>Iterate elements with level.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.load_from_doctags","title":"load_from_doctags","text":"<pre><code>load_from_doctags(doctag_document: DocTagsDocument, document_name: str = 'Document') -&gt; DoclingDocument\n</code></pre> <p>Load Docling document from lists of DocTags and Images.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.load_from_json","title":"load_from_json","text":"<pre><code>load_from_json(filename: Union[str, Path]) -&gt; DoclingDocument\n</code></pre> <p>load_from_json.</p> <p>Parameters:</p> <ul> <li> <code>filename</code>               (<code>Union[str, Path]</code>)           \u2013            <p>The filename to load a saved DoclingDocument from a .json.</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>DoclingDocument</code>           \u2013            <p>The loaded DoclingDocument.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.load_from_yaml","title":"load_from_yaml","text":"<pre><code>load_from_yaml(filename: Union[str, Path]) -&gt; DoclingDocument\n</code></pre> <p>load_from_yaml.</p> <p>Args:     filename: The filename to load a YAML-serialized DoclingDocument from.</p> <p>Returns:     DoclingDocument: the loaded DoclingDocument</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.num_pages","title":"num_pages","text":"<pre><code>num_pages()\n</code></pre> <p>num_pages.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.print_element_tree","title":"print_element_tree","text":"<pre><code>print_element_tree()\n</code></pre> <p>Print_element_tree.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.replace_item","title":"replace_item","text":"<pre><code>replace_item(*, new_item: NodeItem, old_item: NodeItem) -&gt; None\n</code></pre> <p>Replace item with new item.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_doctags","title":"save_as_doctags","text":"<pre><code>save_as_doctags(filename: Union[str, Path], delim: str = '', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True, add_page_index: bool = True, add_table_cell_location: bool = False, add_table_cell_text: bool = True, minified: bool = False)\n</code></pre> <p>Save the document content to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_document_tokens","title":"save_as_document_tokens","text":"<pre><code>save_as_document_tokens(*args, **kwargs)\n</code></pre> <p>Save the document content to a DocumentToken format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_html","title":"save_as_html","text":"<pre><code>save_as_html(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, image_mode: ImageRefMode = PLACEHOLDER, formula_to_mathml: bool = True, page_no: Optional[int] = None, html_lang: str = 'en', html_head: str = 'null', included_content_layers: Optional[set[ContentLayer]] = None, split_page_view: bool = False, include_annotations: bool = True)\n</code></pre> <p>Save to HTML.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_json","title":"save_as_json","text":"<pre><code>save_as_json(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, image_mode: ImageRefMode = EMBEDDED, indent: int = 2, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None)\n</code></pre> <p>Save as json.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_markdown","title":"save_as_markdown","text":"<pre><code>save_as_markdown(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, delim: str = '\\n\\n', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, strict_text: bool = False, escape_html: bool = True, escaping_underscores: bool = True, image_placeholder: str = '&lt;!-- image --&gt;', image_mode: ImageRefMode = PLACEHOLDER, indent: int = 4, text_width: int = -1, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, page_break_placeholder: Optional[str] = None, include_annotations: bool = True, *, mark_meta: bool = False, use_legacy_annotations: Optional[bool] = None)\n</code></pre> <p>Save to markdown.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_yaml","title":"save_as_yaml","text":"<pre><code>save_as_yaml(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, image_mode: ImageRefMode = EMBEDDED, default_flow_style: bool = False, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None)\n</code></pre> <p>Save as yaml.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.transform_to_content_layer","title":"transform_to_content_layer","text":"<pre><code>transform_to_content_layer(data: Any) -&gt; Any\n</code></pre> <p>transform_to_content_layer.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_document","title":"validate_document","text":"<pre><code>validate_document() -&gt; Self\n</code></pre> <p>validate_document.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_misplaced_list_items","title":"validate_misplaced_list_items","text":"<pre><code>validate_misplaced_list_items()\n</code></pre> <p>validate_misplaced_list_items.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_tree","title":"validate_tree","text":"<pre><code>validate_tree(root: NodeItem) -&gt; bool\n</code></pre> <p>validate_tree.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin","title":"DocumentOrigin","text":"<p>               Bases: <code>BaseModel</code></p> <p>FileSource.</p> <p>Methods:</p> <ul> <li> <code>parse_hex_string</code>             \u2013              <p>parse_hex_string.</p> </li> <li> <code>validate_mimetype</code>             \u2013              <p>validate_mimetype.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>binary_hash</code>               (<code>Uint64</code>)           \u2013            </li> <li> <code>filename</code>               (<code>str</code>)           \u2013            </li> <li> <code>mimetype</code>               (<code>str</code>)           \u2013            </li> <li> <code>uri</code>               (<code>Optional[AnyUrl]</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.binary_hash","title":"binary_hash","text":"<pre><code>binary_hash: Uint64\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.filename","title":"filename","text":"<pre><code>filename: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.mimetype","title":"mimetype","text":"<pre><code>mimetype: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.uri","title":"uri","text":"<pre><code>uri: Optional[AnyUrl] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.parse_hex_string","title":"parse_hex_string","text":"<pre><code>parse_hex_string(value)\n</code></pre> <p>parse_hex_string.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.validate_mimetype","title":"validate_mimetype","text":"<pre><code>validate_mimetype(v)\n</code></pre> <p>validate_mimetype.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem","title":"DocItem","text":"<p>               Bases: <code>NodeItem</code></p> <p>DocItem.</p> <p>Methods:</p> <ul> <li> <code>get_annotations</code>             \u2013              <p>Get the annotations of this DocItem.</p> </li> <li> <code>get_image</code>             \u2013              <p>Returns the image of this DocItem.</p> </li> <li> <code>get_location_tokens</code>             \u2013              <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code>             \u2013              <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>children</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>content_layer</code>               (<code>ContentLayer</code>)           \u2013            </li> <li> <code>label</code>               (<code>DocItemLabel</code>)           \u2013            </li> <li> <code>meta</code>               (<code>Optional[BaseMeta]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>parent</code>               (<code>Optional[RefItem]</code>)           \u2013            </li> <li> <code>prov</code>               (<code>List[ProvenanceItem]</code>)           \u2013            </li> <li> <code>self_ref</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.label","title":"label","text":"<pre><code>label: DocItemLabel\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.meta","title":"meta","text":"<pre><code>meta: Optional[BaseMeta] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -&gt; Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this DocItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -&gt; Optional[Image]\n</code></pre> <p>Returns the image of this DocItem.</p> <p>The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -&gt; str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -&gt; RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel","title":"DocItemLabel","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>DocItemLabel.</p> <p>Methods:</p> <ul> <li> <code>get_color</code>             \u2013              <p>Return the RGB color associated with a given label.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>CAPTION</code>           \u2013            </li> <li> <code>CHART</code>           \u2013            </li> <li> <code>CHECKBOX_SELECTED</code>           \u2013            </li> <li> <code>CHECKBOX_UNSELECTED</code>           \u2013            </li> <li> <code>CODE</code>           \u2013            </li> <li> <code>DOCUMENT_INDEX</code>           \u2013            </li> <li> <code>EMPTY_VALUE</code>           \u2013            </li> <li> <code>FOOTNOTE</code>           \u2013            </li> <li> <code>FORM</code>           \u2013            </li> <li> <code>FORMULA</code>           \u2013            </li> <li> <code>GRADING_SCALE</code>           \u2013            </li> <li> <code>HANDWRITTEN_TEXT</code>           \u2013            </li> <li> <code>KEY_VALUE_REGION</code>           \u2013            </li> <li> <code>LIST_ITEM</code>           \u2013            </li> <li> <code>PAGE_FOOTER</code>           \u2013            </li> <li> <code>PAGE_HEADER</code>           \u2013            </li> <li> <code>PARAGRAPH</code>           \u2013            </li> <li> <code>PICTURE</code>           \u2013            </li> <li> <code>REFERENCE</code>           \u2013            </li> <li> <code>SECTION_HEADER</code>           \u2013            </li> <li> <code>TABLE</code>           \u2013            </li> <li> <code>TEXT</code>           \u2013            </li> <li> <code>TITLE</code>           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CAPTION","title":"CAPTION","text":"<pre><code>CAPTION = 'caption'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHART","title":"CHART","text":"<pre><code>CHART = 'chart'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHECKBOX_SELECTED","title":"CHECKBOX_SELECTED","text":"<pre><code>CHECKBOX_SELECTED = 'checkbox_selected'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHECKBOX_UNSELECTED","title":"CHECKBOX_UNSELECTED","text":"<pre><code>CHECKBOX_UNSELECTED = 'checkbox_unselected'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CODE","title":"CODE","text":"<pre><code>CODE = 'code'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.DOCUMENT_INDEX","title":"DOCUMENT_INDEX","text":"<pre><code>DOCUMENT_INDEX = 'document_index'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.EMPTY_VALUE","title":"EMPTY_VALUE","text":"<pre><code>EMPTY_VALUE = 'empty_value'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FOOTNOTE","title":"FOOTNOTE","text":"<pre><code>FOOTNOTE = 'footnote'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FORM","title":"FORM","text":"<pre><code>FORM = 'form'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FORMULA","title":"FORMULA","text":"<pre><code>FORMULA = 'formula'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.GRADING_SCALE","title":"GRADING_SCALE","text":"<pre><code>GRADING_SCALE = 'grading_scale'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.HANDWRITTEN_TEXT","title":"HANDWRITTEN_TEXT","text":"<pre><code>HANDWRITTEN_TEXT = 'handwritten_text'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.KEY_VALUE_REGION","title":"KEY_VALUE_REGION","text":"<pre><code>KEY_VALUE_REGION = 'key_value_region'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.LIST_ITEM","title":"LIST_ITEM","text":"<pre><code>LIST_ITEM = 'list_item'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PAGE_FOOTER","title":"PAGE_FOOTER","text":"<pre><code>PAGE_FOOTER = 'page_footer'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PAGE_HEADER","title":"PAGE_HEADER","text":"<pre><code>PAGE_HEADER = 'page_header'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PARAGRAPH","title":"PARAGRAPH","text":"<pre><code>PARAGRAPH = 'paragraph'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PICTURE","title":"PICTURE","text":"<pre><code>PICTURE = 'picture'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.REFERENCE","title":"REFERENCE","text":"<pre><code>REFERENCE = 'reference'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.SECTION_HEADER","title":"SECTION_HEADER","text":"<pre><code>SECTION_HEADER = 'section_header'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TABLE","title":"TABLE","text":"<pre><code>TABLE = 'table'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TEXT","title":"TEXT","text":"<pre><code>TEXT = 'text'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TITLE","title":"TITLE","text":"<pre><code>TITLE = 'title'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.get_color","title":"get_color","text":"<pre><code>get_color(label: DocItemLabel) -&gt; Tuple[int, int, int]\n</code></pre> <p>Return the RGB color associated with a given label.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem","title":"ProvenanceItem","text":"<p>               Bases: <code>BaseModel</code></p> <p>ProvenanceItem.</p> <p>Attributes:</p> <ul> <li> <code>bbox</code>               (<code>BoundingBox</code>)           \u2013            </li> <li> <code>charspan</code>               (<code>Tuple[int, int]</code>)           \u2013            </li> <li> <code>page_no</code>               (<code>int</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem.bbox","title":"bbox","text":"<pre><code>bbox: BoundingBox\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem.charspan","title":"charspan","text":"<pre><code>charspan: Tuple[int, int]\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem.page_no","title":"page_no","text":"<pre><code>page_no: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem","title":"GroupItem","text":"<p>               Bases: <code>NodeItem</code></p> <p>GroupItem.</p> <p>Methods:</p> <ul> <li> <code>get_ref</code>             \u2013              <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>children</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>content_layer</code>               (<code>ContentLayer</code>)           \u2013            </li> <li> <code>label</code>               (<code>GroupLabel</code>)           \u2013            </li> <li> <code>meta</code>               (<code>Optional[BaseMeta]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>name</code>               (<code>str</code>)           \u2013            </li> <li> <code>parent</code>               (<code>Optional[RefItem]</code>)           \u2013            </li> <li> <code>self_ref</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.label","title":"label","text":"<pre><code>label: GroupLabel = UNSPECIFIED\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.meta","title":"meta","text":"<pre><code>meta: Optional[BaseMeta] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.name","title":"name","text":"<pre><code>name: str = 'group'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -&gt; RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel","title":"GroupLabel","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>GroupLabel.</p> <p>Attributes:</p> <ul> <li> <code>CHAPTER</code>           \u2013            </li> <li> <code>COMMENT_SECTION</code>           \u2013            </li> <li> <code>FORM_AREA</code>           \u2013            </li> <li> <code>INLINE</code>           \u2013            </li> <li> <code>KEY_VALUE_AREA</code>           \u2013            </li> <li> <code>LIST</code>           \u2013            </li> <li> <code>ORDERED_LIST</code>           \u2013            </li> <li> <code>PICTURE_AREA</code>           \u2013            </li> <li> <code>SECTION</code>           \u2013            </li> <li> <code>SHEET</code>           \u2013            </li> <li> <code>SLIDE</code>           \u2013            </li> <li> <code>UNSPECIFIED</code>           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.CHAPTER","title":"CHAPTER","text":"<pre><code>CHAPTER = 'chapter'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.COMMENT_SECTION","title":"COMMENT_SECTION","text":"<pre><code>COMMENT_SECTION = 'comment_section'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.FORM_AREA","title":"FORM_AREA","text":"<pre><code>FORM_AREA = 'form_area'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.INLINE","title":"INLINE","text":"<pre><code>INLINE = 'inline'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.KEY_VALUE_AREA","title":"KEY_VALUE_AREA","text":"<pre><code>KEY_VALUE_AREA = 'key_value_area'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.LIST","title":"LIST","text":"<pre><code>LIST = 'list'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.ORDERED_LIST","title":"ORDERED_LIST","text":"<pre><code>ORDERED_LIST = 'ordered_list'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.PICTURE_AREA","title":"PICTURE_AREA","text":"<pre><code>PICTURE_AREA = 'picture_area'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SECTION","title":"SECTION","text":"<pre><code>SECTION = 'section'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SHEET","title":"SHEET","text":"<pre><code>SHEET = 'sheet'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SLIDE","title":"SLIDE","text":"<pre><code>SLIDE = 'slide'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.UNSPECIFIED","title":"UNSPECIFIED","text":"<pre><code>UNSPECIFIED = 'unspecified'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem","title":"NodeItem","text":"<p>               Bases: <code>BaseModel</code></p> <p>NodeItem.</p> <p>Methods:</p> <ul> <li> <code>get_ref</code>             \u2013              <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>children</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>content_layer</code>               (<code>ContentLayer</code>)           \u2013            </li> <li> <code>meta</code>               (<code>Optional[BaseMeta]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>parent</code>               (<code>Optional[RefItem]</code>)           \u2013            </li> <li> <code>self_ref</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.meta","title":"meta","text":"<pre><code>meta: Optional[BaseMeta] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -&gt; RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem","title":"PageItem","text":"<p>               Bases: <code>BaseModel</code></p> <p>PageItem.</p> <p>Attributes:</p> <ul> <li> <code>image</code>               (<code>Optional[ImageRef]</code>)           \u2013            </li> <li> <code>page_no</code>               (<code>int</code>)           \u2013            </li> <li> <code>size</code>               (<code>Size</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem.image","title":"image","text":"<pre><code>image: Optional[ImageRef] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem.page_no","title":"page_no","text":"<pre><code>page_no: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem.size","title":"size","text":"<pre><code>size: Size\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem","title":"FloatingItem","text":"<p>               Bases: <code>DocItem</code></p> <p>FloatingItem.</p> <p>Methods:</p> <ul> <li> <code>caption_text</code>             \u2013              <p>Computes the caption as a single text.</p> </li> <li> <code>get_annotations</code>             \u2013              <p>Get the annotations of this DocItem.</p> </li> <li> <code>get_image</code>             \u2013              <p>Returns the image corresponding to this FloatingItem.</p> </li> <li> <code>get_location_tokens</code>             \u2013              <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code>             \u2013              <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>captions</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>children</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>content_layer</code>               (<code>ContentLayer</code>)           \u2013            </li> <li> <code>footnotes</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>image</code>               (<code>Optional[ImageRef]</code>)           \u2013            </li> <li> <code>label</code>               (<code>DocItemLabel</code>)           \u2013            </li> <li> <code>meta</code>               (<code>Optional[FloatingMeta]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>parent</code>               (<code>Optional[RefItem]</code>)           \u2013            </li> <li> <code>prov</code>               (<code>List[ProvenanceItem]</code>)           \u2013            </li> <li> <code>references</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>self_ref</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.captions","title":"captions","text":"<pre><code>captions: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.footnotes","title":"footnotes","text":"<pre><code>footnotes: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.image","title":"image","text":"<pre><code>image: Optional[ImageRef] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.label","title":"label","text":"<pre><code>label: DocItemLabel\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.meta","title":"meta","text":"<pre><code>meta: Optional[FloatingMeta] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.references","title":"references","text":"<pre><code>references: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.caption_text","title":"caption_text","text":"<pre><code>caption_text(doc: DoclingDocument) -&gt; str\n</code></pre> <p>Computes the caption as a single text.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -&gt; Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this DocItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -&gt; Optional[Image]\n</code></pre> <p>Returns the image corresponding to this FloatingItem.</p> <p>This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.</p> <p>In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -&gt; str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -&gt; RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem","title":"TextItem","text":"<p>               Bases: <code>DocItem</code></p> <p>TextItem.</p> <p>Methods:</p> <ul> <li> <code>export_to_doctags</code>             \u2013              <p>Export text element to document tokens format.</p> </li> <li> <code>export_to_document_tokens</code>             \u2013              <p>Export to DocTags format.</p> </li> <li> <code>get_annotations</code>             \u2013              <p>Get the annotations of this DocItem.</p> </li> <li> <code>get_image</code>             \u2013              <p>Returns the image of this DocItem.</p> </li> <li> <code>get_location_tokens</code>             \u2013              <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code>             \u2013              <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>children</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>content_layer</code>               (<code>ContentLayer</code>)           \u2013            </li> <li> <code>formatting</code>               (<code>Optional[Formatting]</code>)           \u2013            </li> <li> <code>hyperlink</code>               (<code>Optional[Union[AnyUrl, Path]]</code>)           \u2013            </li> <li> <code>label</code>               (<code>Literal[CAPTION, CHECKBOX_SELECTED, CHECKBOX_UNSELECTED, FOOTNOTE, PAGE_FOOTER, PAGE_HEADER, PARAGRAPH, REFERENCE, TEXT, EMPTY_VALUE]</code>)           \u2013            </li> <li> <code>meta</code>               (<code>Optional[BaseMeta]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>orig</code>               (<code>str</code>)           \u2013            </li> <li> <code>parent</code>               (<code>Optional[RefItem]</code>)           \u2013            </li> <li> <code>prov</code>               (<code>List[ProvenanceItem]</code>)           \u2013            </li> <li> <code>self_ref</code>               (<code>str</code>)           \u2013            </li> <li> <code>text</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.formatting","title":"formatting","text":"<pre><code>formatting: Optional[Formatting] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.hyperlink","title":"hyperlink","text":"<pre><code>hyperlink: Optional[Union[AnyUrl, Path]] = Field(union_mode='left_to_right', default=None)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.label","title":"label","text":"<pre><code>label: Literal[CAPTION, CHECKBOX_SELECTED, CHECKBOX_UNSELECTED, FOOTNOTE, PAGE_FOOTER, PAGE_HEADER, PARAGRAPH, REFERENCE, TEXT, EMPTY_VALUE]\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.meta","title":"meta","text":"<pre><code>meta: Optional[BaseMeta] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.orig","title":"orig","text":"<pre><code>orig: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.text","title":"text","text":"<pre><code>text: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.export_to_doctags","title":"export_to_doctags","text":"<pre><code>export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n</code></pre> <p>Export text element to document tokens format.</p> <p>Parameters:</p> <ul> <li> <code>doc</code>               (<code>DoclingDocument</code>)           \u2013            <p>\"DoclingDocument\":</p> </li> <li> <code>new_line</code>               (<code>str</code>, default:                   <code>''</code> )           \u2013            <p>str (Default value = \"\")  Deprecated</p> </li> <li> <code>xsize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>ysize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>add_location</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>add_content</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(*args, **kwargs)\n</code></pre> <p>Export to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -&gt; Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this DocItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -&gt; Optional[Image]\n</code></pre> <p>Returns the image of this DocItem.</p> <p>The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -&gt; str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -&gt; RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem","title":"TableItem","text":"<p>               Bases: <code>FloatingItem</code></p> <p>TableItem.</p> <p>Methods:</p> <ul> <li> <code>add_annotation</code>             \u2013              <p>Add an annotation to the table.</p> </li> <li> <code>caption_text</code>             \u2013              <p>Computes the caption as a single text.</p> </li> <li> <code>export_to_dataframe</code>             \u2013              <p>Export the table as a Pandas DataFrame.</p> </li> <li> <code>export_to_doctags</code>             \u2013              <p>Export table to document tokens format.</p> </li> <li> <code>export_to_document_tokens</code>             \u2013              <p>Export to DocTags format.</p> </li> <li> <code>export_to_html</code>             \u2013              <p>Export the table as html.</p> </li> <li> <code>export_to_markdown</code>             \u2013              <p>Export the table as markdown.</p> </li> <li> <code>export_to_otsl</code>             \u2013              <p>Export the table as OTSL.</p> </li> <li> <code>get_annotations</code>             \u2013              <p>Get the annotations of this TableItem.</p> </li> <li> <code>get_image</code>             \u2013              <p>Returns the image corresponding to this FloatingItem.</p> </li> <li> <code>get_location_tokens</code>             \u2013              <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code>             \u2013              <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>annotations</code>               (<code>Annotated[List[TableAnnotationType], deprecated('Field `annotations` is deprecated; use `meta` instead.')]</code>)           \u2013            </li> <li> <code>captions</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>children</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>content_layer</code>               (<code>ContentLayer</code>)           \u2013            </li> <li> <code>data</code>               (<code>TableData</code>)           \u2013            </li> <li> <code>footnotes</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>image</code>               (<code>Optional[ImageRef]</code>)           \u2013            </li> <li> <code>label</code>               (<code>Literal[DOCUMENT_INDEX, TABLE]</code>)           \u2013            </li> <li> <code>meta</code>               (<code>Optional[FloatingMeta]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>parent</code>               (<code>Optional[RefItem]</code>)           \u2013            </li> <li> <code>prov</code>               (<code>List[ProvenanceItem]</code>)           \u2013            </li> <li> <code>references</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>self_ref</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.annotations","title":"annotations","text":"<pre><code>annotations: Annotated[List[TableAnnotationType], deprecated('Field `annotations` is deprecated; use `meta` instead.')] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.captions","title":"captions","text":"<pre><code>captions: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.data","title":"data","text":"<pre><code>data: TableData\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.footnotes","title":"footnotes","text":"<pre><code>footnotes: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.image","title":"image","text":"<pre><code>image: Optional[ImageRef] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.label","title":"label","text":"<pre><code>label: Literal[DOCUMENT_INDEX, TABLE] = TABLE\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.meta","title":"meta","text":"<pre><code>meta: Optional[FloatingMeta] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.references","title":"references","text":"<pre><code>references: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.add_annotation","title":"add_annotation","text":"<pre><code>add_annotation(annotation: TableAnnotationType) -&gt; None\n</code></pre> <p>Add an annotation to the table.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.caption_text","title":"caption_text","text":"<pre><code>caption_text(doc: DoclingDocument) -&gt; str\n</code></pre> <p>Computes the caption as a single text.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_dataframe","title":"export_to_dataframe","text":"<pre><code>export_to_dataframe(doc: Optional[DoclingDocument] = None) -&gt; DataFrame\n</code></pre> <p>Export the table as a Pandas DataFrame.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_doctags","title":"export_to_doctags","text":"<pre><code>export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_cell_location: bool = True, add_cell_text: bool = True, add_caption: bool = True)\n</code></pre> <p>Export table to document tokens format.</p> <p>Parameters:</p> <ul> <li> <code>doc</code>               (<code>DoclingDocument</code>)           \u2013            <p>\"DoclingDocument\":</p> </li> <li> <code>new_line</code>               (<code>str</code>, default:                   <code>''</code> )           \u2013            <p>str (Default value = \"\")  Deprecated</p> </li> <li> <code>xsize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>ysize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>add_location</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>add_cell_location</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>add_cell_text</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>add_caption</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(*args, **kwargs)\n</code></pre> <p>Export to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_html","title":"export_to_html","text":"<pre><code>export_to_html(doc: Optional[DoclingDocument] = None, add_caption: bool = True) -&gt; str\n</code></pre> <p>Export the table as html.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_markdown","title":"export_to_markdown","text":"<pre><code>export_to_markdown(doc: Optional[DoclingDocument] = None) -&gt; str\n</code></pre> <p>Export the table as markdown.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_otsl","title":"export_to_otsl","text":"<pre><code>export_to_otsl(doc: DoclingDocument, add_cell_location: bool = True, add_cell_text: bool = True, xsize: int = 500, ysize: int = 500, **kwargs: Any) -&gt; str\n</code></pre> <p>Export the table as OTSL.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -&gt; Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this TableItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -&gt; Optional[Image]\n</code></pre> <p>Returns the image corresponding to this FloatingItem.</p> <p>This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.</p> <p>In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -&gt; str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -&gt; RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell","title":"TableCell","text":"<p>               Bases: <code>BaseModel</code></p> <p>TableCell.</p> <p>Methods:</p> <ul> <li> <code>from_dict_format</code>             \u2013              <p>from_dict_format.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>bbox</code>               (<code>Optional[BoundingBox]</code>)           \u2013            </li> <li> <code>col_span</code>               (<code>int</code>)           \u2013            </li> <li> <code>column_header</code>               (<code>bool</code>)           \u2013            </li> <li> <code>end_col_offset_idx</code>               (<code>int</code>)           \u2013            </li> <li> <code>end_row_offset_idx</code>               (<code>int</code>)           \u2013            </li> <li> <code>fillable</code>               (<code>bool</code>)           \u2013            </li> <li> <code>row_header</code>               (<code>bool</code>)           \u2013            </li> <li> <code>row_section</code>               (<code>bool</code>)           \u2013            </li> <li> <code>row_span</code>               (<code>int</code>)           \u2013            </li> <li> <code>start_col_offset_idx</code>               (<code>int</code>)           \u2013            </li> <li> <code>start_row_offset_idx</code>               (<code>int</code>)           \u2013            </li> <li> <code>text</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.bbox","title":"bbox","text":"<pre><code>bbox: Optional[BoundingBox] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.col_span","title":"col_span","text":"<pre><code>col_span: int = 1\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.column_header","title":"column_header","text":"<pre><code>column_header: bool = False\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.end_col_offset_idx","title":"end_col_offset_idx","text":"<pre><code>end_col_offset_idx: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.end_row_offset_idx","title":"end_row_offset_idx","text":"<pre><code>end_row_offset_idx: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.fillable","title":"fillable","text":"<pre><code>fillable: bool = False\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_header","title":"row_header","text":"<pre><code>row_header: bool = False\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_section","title":"row_section","text":"<pre><code>row_section: bool = False\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_span","title":"row_span","text":"<pre><code>row_span: int = 1\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.start_col_offset_idx","title":"start_col_offset_idx","text":"<pre><code>start_col_offset_idx: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.start_row_offset_idx","title":"start_row_offset_idx","text":"<pre><code>start_row_offset_idx: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.text","title":"text","text":"<pre><code>text: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.from_dict_format","title":"from_dict_format","text":"<pre><code>from_dict_format(data: Any) -&gt; Any\n</code></pre> <p>from_dict_format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData","title":"TableData","text":"<p>               Bases: <code>BaseModel</code></p> <p>BaseTableData.</p> <p>Methods:</p> <ul> <li> <code>add_row</code>             \u2013              <p>Add a new row to the table from a list of strings.</p> </li> <li> <code>add_rows</code>             \u2013              <p>Add multiple new rows to the table from a list of lists of strings.</p> </li> <li> <code>get_column_bounding_boxes</code>             \u2013              <p>Get the minimal bounding box for each column in the table.</p> </li> <li> <code>get_row_bounding_boxes</code>             \u2013              <p>Get the minimal bounding box for each row in the table.</p> </li> <li> <code>insert_row</code>             \u2013              <p>Insert a new row from a list of strings before/after a specific index in the table.</p> </li> <li> <code>insert_rows</code>             \u2013              <p>Insert multiple new rows from a list of lists of strings before/after a specific index in the table.</p> </li> <li> <code>pop_row</code>             \u2013              <p>Remove and return the last row from the table.</p> </li> <li> <code>remove_row</code>             \u2013              <p>Remove a row from the table by its index.</p> </li> <li> <code>remove_rows</code>             \u2013              <p>Remove rows from the table by their indices.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>grid</code>               (<code>List[List[TableCell]]</code>)           \u2013            <p>grid.</p> </li> <li> <code>num_cols</code>               (<code>int</code>)           \u2013            </li> <li> <code>num_rows</code>               (<code>int</code>)           \u2013            </li> <li> <code>table_cells</code>               (<code>List[AnyTableCell]</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.grid","title":"grid","text":"<pre><code>grid: List[List[TableCell]]\n</code></pre> <p>grid.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.num_cols","title":"num_cols","text":"<pre><code>num_cols: int = 0\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.num_rows","title":"num_rows","text":"<pre><code>num_rows: int = 0\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.table_cells","title":"table_cells","text":"<pre><code>table_cells: List[AnyTableCell] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.add_row","title":"add_row","text":"<pre><code>add_row(row: List[str]) -&gt; None\n</code></pre> <p>Add a new row to the table from a list of strings.</p> <p>Parameters:</p> <ul> <li> <code>row</code>               (<code>List[str]</code>)           \u2013            <p>List[str]: A list of strings representing the content of the new row.</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code>           \u2013            <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.add_rows","title":"add_rows","text":"<pre><code>add_rows(rows: List[List[str]]) -&gt; None\n</code></pre> <p>Add multiple new rows to the table from a list of lists of strings.</p> <p>Parameters:</p> <ul> <li> <code>rows</code>               (<code>List[List[str]]</code>)           \u2013            <p>List[List[str]]: A list of lists, where each inner list represents the content of a new row.</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code>           \u2013            <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.get_column_bounding_boxes","title":"get_column_bounding_boxes","text":"<pre><code>get_column_bounding_boxes() -&gt; dict[int, BoundingBox]\n</code></pre> <p>Get the minimal bounding box for each column in the table.</p> <p>Returns:     List[Optional[BoundingBox]]: A list where each element is the minimal     bounding box that encompasses all cells in that column, or None if no     cells in the column have bounding boxes.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.get_row_bounding_boxes","title":"get_row_bounding_boxes","text":"<pre><code>get_row_bounding_boxes() -&gt; dict[int, BoundingBox]\n</code></pre> <p>Get the minimal bounding box for each row in the table.</p> <p>Returns: List[Optional[BoundingBox]]: A list where each element is the minimal bounding box that encompasses all cells in that row, or None if no cells in the row have bounding boxes.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.insert_row","title":"insert_row","text":"<pre><code>insert_row(row_index: int, row: List[str], after: bool = False) -&gt; None\n</code></pre> <p>Insert a new row from a list of strings before/after a specific index in the table.</p> <p>Parameters:</p> <ul> <li> <code>row_index</code>               (<code>int</code>)           \u2013            <p>int: The index at which to insert the new row. (Starting from 0)</p> </li> <li> <code>row</code>               (<code>List[str]</code>)           \u2013            <p>List[str]: A list of strings representing the content of the new row.</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>False</code> )           \u2013            <p>bool: If True, insert the row after the specified index, otherwise before it. (Default is False)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code>           \u2013            <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.insert_rows","title":"insert_rows","text":"<pre><code>insert_rows(row_index: int, rows: List[List[str]], after: bool = False) -&gt; None\n</code></pre> <p>Insert multiple new rows from a list of lists of strings before/after a specific index in the table.</p> <p>Parameters:</p> <ul> <li> <code>row_index</code>               (<code>int</code>)           \u2013            <p>int: The index at which to insert the new rows. (Starting from 0)</p> </li> <li> <code>rows</code>               (<code>List[List[str]]</code>)           \u2013            <p>List[List[str]]: A list of lists, where each inner list represents the content of a new row.</p> </li> <li> <code>after</code>               (<code>bool</code>, default:                   <code>False</code> )           \u2013            <p>bool: If True, insert the rows after the specified index, otherwise before it. (Default is False)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code>           \u2013            <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.pop_row","title":"pop_row","text":"<pre><code>pop_row(doc: Optional[DoclingDocument] = None) -&gt; List[TableCell]\n</code></pre> <p>Remove and return the last row from the table.</p> <p>Returns:</p> <ul> <li> <code>List[TableCell]</code>           \u2013            <p>List[TableCell]: A list of TableCell objects representing the popped row.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.remove_row","title":"remove_row","text":"<pre><code>remove_row(row_index: int, doc: Optional[DoclingDocument] = None) -&gt; List[TableCell]\n</code></pre> <p>Remove a row from the table by its index.</p> <p>Parameters:</p> <ul> <li> <code>row_index</code>               (<code>int</code>)           \u2013            <p>int: The index of the row to remove. (Starting from 0)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>List[TableCell]</code>           \u2013            <p>List[TableCell]: A list of TableCell objects representing the removed row.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.remove_rows","title":"remove_rows","text":"<pre><code>remove_rows(indices: List[int], doc: Optional[DoclingDocument] = None) -&gt; List[List[TableCell]]\n</code></pre> <p>Remove rows from the table by their indices.</p> <p>Parameters:</p> <ul> <li> <code>indices</code>               (<code>List[int]</code>)           \u2013            <p>List[int]: A list of indices of the rows to remove. (Starting from 0)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>List[List[TableCell]]</code>           \u2013            <p>List[List[TableCell]]: A list representation of the removed rows as lists of TableCell objects.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel","title":"TableCellLabel","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>TableCellLabel.</p> <p>Methods:</p> <ul> <li> <code>get_color</code>             \u2013              <p>Return the RGB color associated with a given label.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>BODY</code>           \u2013            </li> <li> <code>COLUMN_HEADER</code>           \u2013            </li> <li> <code>ROW_HEADER</code>           \u2013            </li> <li> <code>ROW_SECTION</code>           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.BODY","title":"BODY","text":"<pre><code>BODY = 'body'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.COLUMN_HEADER","title":"COLUMN_HEADER","text":"<pre><code>COLUMN_HEADER = 'col_header'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.ROW_HEADER","title":"ROW_HEADER","text":"<pre><code>ROW_HEADER = 'row_header'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.ROW_SECTION","title":"ROW_SECTION","text":"<pre><code>ROW_SECTION = 'row_section'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.get_color","title":"get_color","text":"<pre><code>get_color(label: TableCellLabel) -&gt; Tuple[int, int, int]\n</code></pre> <p>Return the RGB color associated with a given label.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem","title":"KeyValueItem","text":"<p>               Bases: <code>FloatingItem</code></p> <p>KeyValueItem.</p> <p>Methods:</p> <ul> <li> <code>caption_text</code>             \u2013              <p>Computes the caption as a single text.</p> </li> <li> <code>export_to_document_tokens</code>             \u2013              <p>Export key value item to document tokens format.</p> </li> <li> <code>get_annotations</code>             \u2013              <p>Get the annotations of this DocItem.</p> </li> <li> <code>get_image</code>             \u2013              <p>Returns the image corresponding to this FloatingItem.</p> </li> <li> <code>get_location_tokens</code>             \u2013              <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code>             \u2013              <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>captions</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>children</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>content_layer</code>               (<code>ContentLayer</code>)           \u2013            </li> <li> <code>footnotes</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>graph</code>               (<code>GraphData</code>)           \u2013            </li> <li> <code>image</code>               (<code>Optional[ImageRef]</code>)           \u2013            </li> <li> <code>label</code>               (<code>Literal[KEY_VALUE_REGION]</code>)           \u2013            </li> <li> <code>meta</code>               (<code>Optional[FloatingMeta]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>parent</code>               (<code>Optional[RefItem]</code>)           \u2013            </li> <li> <code>prov</code>               (<code>List[ProvenanceItem]</code>)           \u2013            </li> <li> <code>references</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>self_ref</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.captions","title":"captions","text":"<pre><code>captions: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.footnotes","title":"footnotes","text":"<pre><code>footnotes: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.graph","title":"graph","text":"<pre><code>graph: GraphData\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.image","title":"image","text":"<pre><code>image: Optional[ImageRef] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.label","title":"label","text":"<pre><code>label: Literal[KEY_VALUE_REGION] = KEY_VALUE_REGION\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.meta","title":"meta","text":"<pre><code>meta: Optional[FloatingMeta] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.references","title":"references","text":"<pre><code>references: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.caption_text","title":"caption_text","text":"<pre><code>caption_text(doc: DoclingDocument) -&gt; str\n</code></pre> <p>Computes the caption as a single text.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n</code></pre> <p>Export key value item to document tokens format.</p> <p>Parameters:</p> <ul> <li> <code>doc</code>               (<code>DoclingDocument</code>)           \u2013            <p>\"DoclingDocument\":</p> </li> <li> <code>new_line</code>               (<code>str</code>, default:                   <code>''</code> )           \u2013            <p>str (Default value = \"\")  Deprecated</p> </li> <li> <code>xsize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>ysize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>add_location</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>add_content</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -&gt; Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this DocItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -&gt; Optional[Image]\n</code></pre> <p>Returns the image corresponding to this FloatingItem.</p> <p>This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.</p> <p>In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -&gt; str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -&gt; RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem","title":"SectionHeaderItem","text":"<p>               Bases: <code>TextItem</code></p> <p>SectionItem.</p> <p>Methods:</p> <ul> <li> <code>export_to_doctags</code>             \u2013              <p>Export text element to document tokens format.</p> </li> <li> <code>export_to_document_tokens</code>             \u2013              <p>Export to DocTags format.</p> </li> <li> <code>get_annotations</code>             \u2013              <p>Get the annotations of this DocItem.</p> </li> <li> <code>get_image</code>             \u2013              <p>Returns the image of this DocItem.</p> </li> <li> <code>get_location_tokens</code>             \u2013              <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code>             \u2013              <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>children</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>content_layer</code>               (<code>ContentLayer</code>)           \u2013            </li> <li> <code>formatting</code>               (<code>Optional[Formatting]</code>)           \u2013            </li> <li> <code>hyperlink</code>               (<code>Optional[Union[AnyUrl, Path]]</code>)           \u2013            </li> <li> <code>label</code>               (<code>Literal[SECTION_HEADER]</code>)           \u2013            </li> <li> <code>level</code>               (<code>LevelNumber</code>)           \u2013            </li> <li> <code>meta</code>               (<code>Optional[BaseMeta]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>orig</code>               (<code>str</code>)           \u2013            </li> <li> <code>parent</code>               (<code>Optional[RefItem]</code>)           \u2013            </li> <li> <code>prov</code>               (<code>List[ProvenanceItem]</code>)           \u2013            </li> <li> <code>self_ref</code>               (<code>str</code>)           \u2013            </li> <li> <code>text</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.formatting","title":"formatting","text":"<pre><code>formatting: Optional[Formatting] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.hyperlink","title":"hyperlink","text":"<pre><code>hyperlink: Optional[Union[AnyUrl, Path]] = Field(union_mode='left_to_right', default=None)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.label","title":"label","text":"<pre><code>label: Literal[SECTION_HEADER] = SECTION_HEADER\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.level","title":"level","text":"<pre><code>level: LevelNumber = 1\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.meta","title":"meta","text":"<pre><code>meta: Optional[BaseMeta] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.orig","title":"orig","text":"<pre><code>orig: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.text","title":"text","text":"<pre><code>text: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.export_to_doctags","title":"export_to_doctags","text":"<pre><code>export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n</code></pre> <p>Export text element to document tokens format.</p> <p>Parameters:</p> <ul> <li> <code>doc</code>               (<code>DoclingDocument</code>)           \u2013            <p>\"DoclingDocument\":</p> </li> <li> <code>new_line</code>               (<code>str</code>, default:                   <code>''</code> )           \u2013            <p>str (Default value = \"\")  Deprecated</p> </li> <li> <code>xsize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>ysize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>add_location</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>add_content</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(*args, **kwargs)\n</code></pre> <p>Export to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -&gt; Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this DocItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -&gt; Optional[Image]\n</code></pre> <p>Returns the image of this DocItem.</p> <p>The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -&gt; str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -&gt; RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem","title":"PictureItem","text":"<p>               Bases: <code>FloatingItem</code></p> <p>PictureItem.</p> <p>Methods:</p> <ul> <li> <code>caption_text</code>             \u2013              <p>Computes the caption as a single text.</p> </li> <li> <code>export_to_doctags</code>             \u2013              <p>Export picture to document tokens format.</p> </li> <li> <code>export_to_document_tokens</code>             \u2013              <p>Export to DocTags format.</p> </li> <li> <code>export_to_html</code>             \u2013              <p>Export picture to HTML format.</p> </li> <li> <code>export_to_markdown</code>             \u2013              <p>Export picture to Markdown format.</p> </li> <li> <code>get_annotations</code>             \u2013              <p>Get the annotations of this PictureItem.</p> </li> <li> <code>get_image</code>             \u2013              <p>Returns the image corresponding to this FloatingItem.</p> </li> <li> <code>get_location_tokens</code>             \u2013              <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code>             \u2013              <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>annotations</code>               (<code>Annotated[List[PictureDataType], deprecated('Field `annotations` is deprecated; use `meta` instead.')]</code>)           \u2013            </li> <li> <code>captions</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>children</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>content_layer</code>               (<code>ContentLayer</code>)           \u2013            </li> <li> <code>footnotes</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>image</code>               (<code>Optional[ImageRef]</code>)           \u2013            </li> <li> <code>label</code>               (<code>Literal[PICTURE, CHART]</code>)           \u2013            </li> <li> <code>meta</code>               (<code>Optional[PictureMeta]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>parent</code>               (<code>Optional[RefItem]</code>)           \u2013            </li> <li> <code>prov</code>               (<code>List[ProvenanceItem]</code>)           \u2013            </li> <li> <code>references</code>               (<code>List[RefItem]</code>)           \u2013            </li> <li> <code>self_ref</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.annotations","title":"annotations","text":"<pre><code>annotations: Annotated[List[PictureDataType], deprecated('Field `annotations` is deprecated; use `meta` instead.')] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.captions","title":"captions","text":"<pre><code>captions: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.footnotes","title":"footnotes","text":"<pre><code>footnotes: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.image","title":"image","text":"<pre><code>image: Optional[ImageRef] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.label","title":"label","text":"<pre><code>label: Literal[PICTURE, CHART] = PICTURE\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.meta","title":"meta","text":"<pre><code>meta: Optional[PictureMeta] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.references","title":"references","text":"<pre><code>references: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.caption_text","title":"caption_text","text":"<pre><code>caption_text(doc: DoclingDocument) -&gt; str\n</code></pre> <p>Computes the caption as a single text.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_doctags","title":"export_to_doctags","text":"<pre><code>export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_caption: bool = True, add_content: bool = True)\n</code></pre> <p>Export picture to document tokens format.</p> <p>Parameters:</p> <ul> <li> <code>doc</code>               (<code>DoclingDocument</code>)           \u2013            <p>\"DoclingDocument\":</p> </li> <li> <code>new_line</code>               (<code>str</code>, default:                   <code>''</code> )           \u2013            <p>str (Default value = \"\")  Deprecated</p> </li> <li> <code>xsize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>ysize</code>               (<code>int</code>, default:                   <code>500</code> )           \u2013            <p>int:  (Default value = 500)</p> </li> <li> <code>add_location</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>add_caption</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> <li> <code>add_content</code>               (<code>bool</code>, default:                   <code>True</code> )           \u2013            <p>bool:  (Default value = True)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(*args, **kwargs)\n</code></pre> <p>Export to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_html","title":"export_to_html","text":"<pre><code>export_to_html(doc: DoclingDocument, add_caption: bool = True, image_mode: ImageRefMode = PLACEHOLDER) -&gt; str\n</code></pre> <p>Export picture to HTML format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_markdown","title":"export_to_markdown","text":"<pre><code>export_to_markdown(doc: DoclingDocument, add_caption: bool = True, image_mode: ImageRefMode = EMBEDDED, image_placeholder: str = '&lt;!-- image --&gt;') -&gt; str\n</code></pre> <p>Export picture to Markdown format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -&gt; Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this PictureItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -&gt; Optional[Image]\n</code></pre> <p>Returns the image corresponding to this FloatingItem.</p> <p>This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.</p> <p>In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -&gt; str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -&gt; RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef","title":"ImageRef","text":"<p>               Bases: <code>BaseModel</code></p> <p>ImageRef.</p> <p>Methods:</p> <ul> <li> <code>from_pil</code>             \u2013              <p>Construct ImageRef from a PIL Image.</p> </li> <li> <code>validate_mimetype</code>             \u2013              <p>validate_mimetype.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>dpi</code>               (<code>int</code>)           \u2013            </li> <li> <code>mimetype</code>               (<code>str</code>)           \u2013            </li> <li> <code>pil_image</code>               (<code>Optional[Image]</code>)           \u2013            <p>Return the PIL Image.</p> </li> <li> <code>size</code>               (<code>Size</code>)           \u2013            </li> <li> <code>uri</code>               (<code>Union[AnyUrl, Path]</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.dpi","title":"dpi","text":"<pre><code>dpi: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.mimetype","title":"mimetype","text":"<pre><code>mimetype: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.pil_image","title":"pil_image","text":"<pre><code>pil_image: Optional[Image]\n</code></pre> <p>Return the PIL Image.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.size","title":"size","text":"<pre><code>size: Size\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.uri","title":"uri","text":"<pre><code>uri: Union[AnyUrl, Path] = Field(union_mode='left_to_right')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.from_pil","title":"from_pil","text":"<pre><code>from_pil(image: Image, dpi: int) -&gt; Self\n</code></pre> <p>Construct ImageRef from a PIL Image.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.validate_mimetype","title":"validate_mimetype","text":"<pre><code>validate_mimetype(v)\n</code></pre> <p>validate_mimetype.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationClass","title":"PictureClassificationClass","text":"<p>               Bases: <code>BaseModel</code></p> <p>PictureClassificationData.</p> <p>Attributes:</p> <ul> <li> <code>class_name</code>               (<code>str</code>)           \u2013            </li> <li> <code>confidence</code>               (<code>float</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationClass.class_name","title":"class_name","text":"<pre><code>class_name: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationClass.confidence","title":"confidence","text":"<pre><code>confidence: float\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData","title":"PictureClassificationData","text":"<p>               Bases: <code>BaseAnnotation</code></p> <p>PictureClassificationData.</p> <p>Attributes:</p> <ul> <li> <code>kind</code>               (<code>Literal['classification']</code>)           \u2013            </li> <li> <code>predicted_classes</code>               (<code>List[PictureClassificationClass]</code>)           \u2013            </li> <li> <code>provenance</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData.kind","title":"kind","text":"<pre><code>kind: Literal['classification'] = 'classification'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData.predicted_classes","title":"predicted_classes","text":"<pre><code>predicted_classes: List[PictureClassificationClass]\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData.provenance","title":"provenance","text":"<pre><code>provenance: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem","title":"RefItem","text":"<p>               Bases: <code>BaseModel</code></p> <p>RefItem.</p> <p>Methods:</p> <ul> <li> <code>get_ref</code>             \u2013              <p>get_ref.</p> </li> <li> <code>resolve</code>             \u2013              <p>Resolve the path in the document.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>cref</code>               (<code>str</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.cref","title":"cref","text":"<pre><code>cref: str = Field(alias='$ref', pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(populate_by_name=True)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.get_ref","title":"get_ref","text":"<pre><code>get_ref()\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.resolve","title":"resolve","text":"<pre><code>resolve(doc: DoclingDocument)\n</code></pre> <p>Resolve the path in the document.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox","title":"BoundingBox","text":"<p>               Bases: <code>BaseModel</code></p> <p>BoundingBox.</p> <p>Methods:</p> <ul> <li> <code>area</code>             \u2013              <p>area.</p> </li> <li> <code>as_tuple</code>             \u2013              <p>as_tuple.</p> </li> <li> <code>enclosing_bbox</code>             \u2013              <p>Create a bounding box that covers all of the given boxes.</p> </li> <li> <code>expand_by_scale</code>             \u2013              <p>expand_to_size.</p> </li> <li> <code>from_tuple</code>             \u2013              <p>from_tuple.</p> </li> <li> <code>intersection_area_with</code>             \u2013              <p>Calculate the intersection area with another bounding box.</p> </li> <li> <code>intersection_over_self</code>             \u2013              <p>intersection_over_self.</p> </li> <li> <code>intersection_over_union</code>             \u2013              <p>intersection_over_union.</p> </li> <li> <code>is_above</code>             \u2013              <p>is_above.</p> </li> <li> <code>is_horizontally_connected</code>             \u2013              <p>is_horizontally_connected.</p> </li> <li> <code>is_left_of</code>             \u2013              <p>is_left_of.</p> </li> <li> <code>is_strictly_above</code>             \u2013              <p>is_strictly_above.</p> </li> <li> <code>is_strictly_left_of</code>             \u2013              <p>is_strictly_left_of.</p> </li> <li> <code>normalized</code>             \u2013              <p>normalized.</p> </li> <li> <code>overlaps</code>             \u2013              <p>overlaps.</p> </li> <li> <code>overlaps_horizontally</code>             \u2013              <p>Check if two bounding boxes overlap horizontally.</p> </li> <li> <code>overlaps_vertically</code>             \u2013              <p>Check if two bounding boxes overlap vertically.</p> </li> <li> <code>overlaps_vertically_with_iou</code>             \u2013              <p>overlaps_y_with_iou.</p> </li> <li> <code>resize_by_scale</code>             \u2013              <p>resize_by_scale.</p> </li> <li> <code>scale_to_size</code>             \u2013              <p>scale_to_size.</p> </li> <li> <code>scaled</code>             \u2013              <p>scaled.</p> </li> <li> <code>to_bottom_left_origin</code>             \u2013              <p>to_bottom_left_origin.</p> </li> <li> <code>to_top_left_origin</code>             \u2013              <p>to_top_left_origin.</p> </li> <li> <code>union_area_with</code>             \u2013              <p>Calculates the union area with another bounding box.</p> </li> <li> <code>x_overlap_with</code>             \u2013              <p>Calculates the horizontal overlap with another bounding box.</p> </li> <li> <code>x_union_with</code>             \u2013              <p>Calculates the horizontal union dimension with another bounding box.</p> </li> <li> <code>y_overlap_with</code>             \u2013              <p>Calculates the vertical overlap with another bounding box, respecting coordinate origin.</p> </li> <li> <code>y_union_with</code>             \u2013              <p>Calculates the vertical union dimension with another bounding box, respecting coordinate origin.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>b</code>               (<code>float</code>)           \u2013            </li> <li> <code>coord_origin</code>               (<code>CoordOrigin</code>)           \u2013            </li> <li> <code>height</code>           \u2013            <p>height.</p> </li> <li> <code>l</code>               (<code>float</code>)           \u2013            </li> <li> <code>r</code>               (<code>float</code>)           \u2013            </li> <li> <code>t</code>               (<code>float</code>)           \u2013            </li> <li> <code>width</code>           \u2013            <p>width.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.b","title":"b","text":"<pre><code>b: float\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.coord_origin","title":"coord_origin","text":"<pre><code>coord_origin: CoordOrigin = TOPLEFT\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.height","title":"height","text":"<pre><code>height\n</code></pre> <p>height.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.l","title":"l","text":"<pre><code>l: float\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.r","title":"r","text":"<pre><code>r: float\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.t","title":"t","text":"<pre><code>t: float\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.width","title":"width","text":"<pre><code>width\n</code></pre> <p>width.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.area","title":"area","text":"<pre><code>area() -&gt; float\n</code></pre> <p>area.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.as_tuple","title":"as_tuple","text":"<pre><code>as_tuple() -&gt; Tuple[float, float, float, float]\n</code></pre> <p>as_tuple.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.enclosing_bbox","title":"enclosing_bbox","text":"<pre><code>enclosing_bbox(boxes: List[BoundingBox]) -&gt; BoundingBox\n</code></pre> <p>Create a bounding box that covers all of the given boxes.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.expand_by_scale","title":"expand_by_scale","text":"<pre><code>expand_by_scale(x_scale: float, y_scale: float) -&gt; BoundingBox\n</code></pre> <p>expand_to_size.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.from_tuple","title":"from_tuple","text":"<pre><code>from_tuple(coord: Tuple[float, ...], origin: CoordOrigin)\n</code></pre> <p>from_tuple.</p> <p>Parameters:</p> <ul> <li> <code>coord</code>               (<code>Tuple[float, ...]</code>)           \u2013            <p>Tuple[float:</p> </li> <li> <code>...]</code>           \u2013            </li> <li> <code>origin</code>               (<code>CoordOrigin</code>)           \u2013            <p>CoordOrigin:</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.intersection_area_with","title":"intersection_area_with","text":"<pre><code>intersection_area_with(other: BoundingBox) -&gt; float\n</code></pre> <p>Calculate the intersection area with another bounding box.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.intersection_over_self","title":"intersection_over_self","text":"<pre><code>intersection_over_self(other: BoundingBox, eps: float = 1e-06) -&gt; float\n</code></pre> <p>intersection_over_self.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.intersection_over_union","title":"intersection_over_union","text":"<pre><code>intersection_over_union(other: BoundingBox, eps: float = 1e-06) -&gt; float\n</code></pre> <p>intersection_over_union.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_above","title":"is_above","text":"<pre><code>is_above(other: BoundingBox) -&gt; bool\n</code></pre> <p>is_above.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_horizontally_connected","title":"is_horizontally_connected","text":"<pre><code>is_horizontally_connected(elem_i: BoundingBox, elem_j: BoundingBox) -&gt; bool\n</code></pre> <p>is_horizontally_connected.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_left_of","title":"is_left_of","text":"<pre><code>is_left_of(other: BoundingBox) -&gt; bool\n</code></pre> <p>is_left_of.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_strictly_above","title":"is_strictly_above","text":"<pre><code>is_strictly_above(other: BoundingBox, eps: float = 0.001) -&gt; bool\n</code></pre> <p>is_strictly_above.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_strictly_left_of","title":"is_strictly_left_of","text":"<pre><code>is_strictly_left_of(other: BoundingBox, eps: float = 0.001) -&gt; bool\n</code></pre> <p>is_strictly_left_of.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.normalized","title":"normalized","text":"<pre><code>normalized(page_size: Size)\n</code></pre> <p>normalized.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps","title":"overlaps","text":"<pre><code>overlaps(other: BoundingBox) -&gt; bool\n</code></pre> <p>overlaps.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_horizontally","title":"overlaps_horizontally","text":"<pre><code>overlaps_horizontally(other: BoundingBox) -&gt; bool\n</code></pre> <p>Check if two bounding boxes overlap horizontally.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_vertically","title":"overlaps_vertically","text":"<pre><code>overlaps_vertically(other: BoundingBox) -&gt; bool\n</code></pre> <p>Check if two bounding boxes overlap vertically.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_vertically_with_iou","title":"overlaps_vertically_with_iou","text":"<pre><code>overlaps_vertically_with_iou(other: BoundingBox, iou: float) -&gt; bool\n</code></pre> <p>overlaps_y_with_iou.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.resize_by_scale","title":"resize_by_scale","text":"<pre><code>resize_by_scale(x_scale: float, y_scale: float)\n</code></pre> <p>resize_by_scale.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.scale_to_size","title":"scale_to_size","text":"<pre><code>scale_to_size(old_size: Size, new_size: Size)\n</code></pre> <p>scale_to_size.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.scaled","title":"scaled","text":"<pre><code>scaled(scale: float)\n</code></pre> <p>scaled.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.to_bottom_left_origin","title":"to_bottom_left_origin","text":"<pre><code>to_bottom_left_origin(page_height: float) -&gt; BoundingBox\n</code></pre> <p>to_bottom_left_origin.</p> <p>Parameters:</p> <ul> <li> <code>page_height</code>               (<code>float</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.to_top_left_origin","title":"to_top_left_origin","text":"<pre><code>to_top_left_origin(page_height: float) -&gt; BoundingBox\n</code></pre> <p>to_top_left_origin.</p> <p>Parameters:</p> <ul> <li> <code>page_height</code>               (<code>float</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.union_area_with","title":"union_area_with","text":"<pre><code>union_area_with(other: BoundingBox) -&gt; float\n</code></pre> <p>Calculates the union area with another bounding box.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.x_overlap_with","title":"x_overlap_with","text":"<pre><code>x_overlap_with(other: BoundingBox) -&gt; float\n</code></pre> <p>Calculates the horizontal overlap with another bounding box.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.x_union_with","title":"x_union_with","text":"<pre><code>x_union_with(other: BoundingBox) -&gt; float\n</code></pre> <p>Calculates the horizontal union dimension with another bounding box.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.y_overlap_with","title":"y_overlap_with","text":"<pre><code>y_overlap_with(other: BoundingBox) -&gt; float\n</code></pre> <p>Calculates the vertical overlap with another bounding box, respecting coordinate origin.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.y_union_with","title":"y_union_with","text":"<pre><code>y_union_with(other: BoundingBox) -&gt; float\n</code></pre> <p>Calculates the vertical union dimension with another bounding box, respecting coordinate origin.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.CoordOrigin","title":"CoordOrigin","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>CoordOrigin.</p> <p>Attributes:</p> <ul> <li> <code>BOTTOMLEFT</code>           \u2013            </li> <li> <code>TOPLEFT</code>           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.CoordOrigin.BOTTOMLEFT","title":"BOTTOMLEFT","text":"<pre><code>BOTTOMLEFT = 'BOTTOMLEFT'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.CoordOrigin.TOPLEFT","title":"TOPLEFT","text":"<pre><code>TOPLEFT = 'TOPLEFT'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode","title":"ImageRefMode","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>ImageRefMode.</p> <p>Attributes:</p> <ul> <li> <code>EMBEDDED</code>           \u2013            </li> <li> <code>PLACEHOLDER</code>           \u2013            </li> <li> <code>REFERENCED</code>           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode.EMBEDDED","title":"EMBEDDED","text":"<pre><code>EMBEDDED = 'embedded'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode.PLACEHOLDER","title":"PLACEHOLDER","text":"<pre><code>PLACEHOLDER = 'placeholder'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode.REFERENCED","title":"REFERENCED","text":"<pre><code>REFERENCED = 'referenced'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.Size","title":"Size","text":"<p>               Bases: <code>BaseModel</code></p> <p>Size.</p> <p>Methods:</p> <ul> <li> <code>as_tuple</code>             \u2013              <p>as_tuple.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>height</code>               (<code>float</code>)           \u2013            </li> <li> <code>width</code>               (<code>float</code>)           \u2013            </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.Size.height","title":"height","text":"<pre><code>height: float = 0.0\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.Size.width","title":"width","text":"<pre><code>width: float = 0.0\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.Size.as_tuple","title":"as_tuple","text":"<pre><code>as_tuple()\n</code></pre> <p>as_tuple.</p>"},{"location":"reference/document_converter/","title":"Document converter","text":"<p>This is an automatic generated API reference of the main components of Docling.</p>"},{"location":"reference/document_converter/#docling.document_converter","title":"document_converter","text":"<p>Classes:</p> <ul> <li> <code>DocumentConverter</code>           \u2013            </li> <li> <code>ConversionResult</code>           \u2013            </li> <li> <code>ConversionStatus</code>           \u2013            </li> <li> <code>FormatOption</code>           \u2013            </li> <li> <code>InputFormat</code>           \u2013            <p>A document format supported by document backend parsers.</p> </li> <li> <code>PdfFormatOption</code>           \u2013            </li> <li> <code>ImageFormatOption</code>           \u2013            </li> <li> <code>StandardPdfPipeline</code>           \u2013            <p>High-performance PDF pipeline with multi-threaded stages.</p> </li> <li> <code>WordFormatOption</code>           \u2013            </li> <li> <code>PowerpointFormatOption</code>           \u2013            </li> <li> <code>MarkdownFormatOption</code>           \u2013            </li> <li> <code>AsciiDocFormatOption</code>           \u2013            </li> <li> <code>HTMLFormatOption</code>           \u2013            </li> <li> <code>SimplePipeline</code>           \u2013            <p>SimpleModelPipeline.</p> </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter","title":"DocumentConverter","text":"<pre><code>DocumentConverter(allowed_formats: Optional[list[InputFormat]] = None, format_options: Optional[dict[InputFormat, FormatOption]] = None)\n</code></pre> <p>Methods:</p> <ul> <li> <code>convert</code>             \u2013              </li> <li> <code>convert_all</code>             \u2013              </li> <li> <code>convert_string</code>             \u2013              </li> <li> <code>initialize_pipeline</code>             \u2013              <p>Initialize the conversion pipeline for the selected format.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>allowed_formats</code>           \u2013            </li> <li> <code>format_to_options</code>               (<code>dict[InputFormat, FormatOption]</code>)           \u2013            </li> <li> <code>initialized_pipelines</code>               (<code>dict[tuple[Type[BasePipeline], str], BasePipeline]</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.allowed_formats","title":"allowed_formats  <code>instance-attribute</code>","text":"<pre><code>allowed_formats = allowed_formats if allowed_formats is not None else list(InputFormat)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.format_to_options","title":"format_to_options  <code>instance-attribute</code>","text":"<pre><code>format_to_options: dict[InputFormat, FormatOption] = {format: (_get_default_option(format=format) if (custom_option := (get(format))) is None else custom_option) for format in (allowed_formats)}\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.initialized_pipelines","title":"initialized_pipelines  <code>instance-attribute</code>","text":"<pre><code>initialized_pipelines: dict[tuple[Type[BasePipeline], str], BasePipeline] = {}\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.convert","title":"convert","text":"<pre><code>convert(source: Union[Path, str, DocumentStream], headers: Optional[dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -&gt; ConversionResult\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.convert_all","title":"convert_all","text":"<pre><code>convert_all(source: Iterable[Union[Path, str, DocumentStream]], headers: Optional[dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -&gt; Iterator[ConversionResult]\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.convert_string","title":"convert_string","text":"<pre><code>convert_string(content: str, format: InputFormat, name: Optional[str] = None) -&gt; ConversionResult\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.initialize_pipeline","title":"initialize_pipeline","text":"<pre><code>initialize_pipeline(format: InputFormat)\n</code></pre> <p>Initialize the conversion pipeline for the selected format.</p>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult","title":"ConversionResult","text":"<p>               Bases: <code>ConversionAssets</code></p> <p>Methods:</p> <ul> <li> <code>load</code>             \u2013              <p>Load a ConversionAssets.</p> </li> <li> <code>save</code>             \u2013              <p>Serialize the full ConversionAssets to JSON.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>assembled</code>               (<code>AssembledUnit</code>)           \u2013            </li> <li> <code>confidence</code>               (<code>ConfidenceReport</code>)           \u2013            </li> <li> <code>document</code>               (<code>DoclingDocument</code>)           \u2013            </li> <li> <code>errors</code>               (<code>list[ErrorItem]</code>)           \u2013            </li> <li> <code>input</code>               (<code>InputDocument</code>)           \u2013            </li> <li> <code>legacy_document</code>           \u2013            </li> <li> <code>pages</code>               (<code>list[Page]</code>)           \u2013            </li> <li> <code>status</code>               (<code>ConversionStatus</code>)           \u2013            </li> <li> <code>timestamp</code>               (<code>Optional[str]</code>)           \u2013            </li> <li> <code>timings</code>               (<code>dict[str, ProfilingItem]</code>)           \u2013            </li> <li> <code>version</code>               (<code>DoclingVersion</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.assembled","title":"assembled  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>assembled: AssembledUnit = AssembledUnit()\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.confidence","title":"confidence  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>confidence: ConfidenceReport = Field(default_factory=ConfidenceReport)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.document","title":"document  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document: DoclingDocument = _EMPTY_DOCLING_DOC\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.errors","title":"errors  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>errors: list[ErrorItem] = []\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.input","title":"input  <code>instance-attribute</code>","text":"<pre><code>input: InputDocument\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.legacy_document","title":"legacy_document  <code>property</code>","text":"<pre><code>legacy_document\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.pages","title":"pages  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pages: list[Page] = []\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.status","title":"status  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>status: ConversionStatus = PENDING\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.timestamp","title":"timestamp  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timestamp: Optional[str] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.timings","title":"timings  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timings: dict[str, ProfilingItem] = {}\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.version","title":"version  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>version: DoclingVersion = DoclingVersion()\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.load","title":"load  <code>classmethod</code>","text":"<pre><code>load(filename: Union[str, Path]) -&gt; ConversionAssets\n</code></pre> <p>Load a ConversionAssets.</p>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.save","title":"save","text":"<pre><code>save(*, filename: Union[str, Path], indent: Optional[int] = 2)\n</code></pre> <p>Serialize the full ConversionAssets to JSON.</p>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus","title":"ConversionStatus","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>Attributes:</p> <ul> <li> <code>FAILURE</code>           \u2013            </li> <li> <code>PARTIAL_SUCCESS</code>           \u2013            </li> <li> <code>PENDING</code>           \u2013            </li> <li> <code>SKIPPED</code>           \u2013            </li> <li> <code>STARTED</code>           \u2013            </li> <li> <code>SUCCESS</code>           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.FAILURE","title":"FAILURE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>FAILURE = 'failure'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.PARTIAL_SUCCESS","title":"PARTIAL_SUCCESS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PARTIAL_SUCCESS = 'partial_success'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.PENDING","title":"PENDING  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PENDING = 'pending'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.SKIPPED","title":"SKIPPED  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SKIPPED = 'skipped'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.STARTED","title":"STARTED  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>STARTED = 'started'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.SUCCESS","title":"SUCCESS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SUCCESS = 'success'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption","title":"FormatOption","text":"<p>               Bases: <code>BaseFormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code>             \u2013              </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code>               (<code>Type[AbstractDocumentBackend]</code>)           \u2013            </li> <li> <code>backend_options</code>               (<code>Optional[BackendOptions]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>pipeline_cls</code>               (<code>Type[BasePipeline]</code>)           \u2013            </li> <li> <code>pipeline_options</code>               (<code>Optional[PipelineOptions]</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.backend","title":"backend  <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend]\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.backend_options","title":"backend_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend_options: Optional[BackendOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.pipeline_cls","title":"pipeline_cls  <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type[BasePipeline]\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.pipeline_options","title":"pipeline_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -&gt; Self\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat","title":"InputFormat","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>A document format supported by document backend parsers.</p> <p>Attributes:</p> <ul> <li> <code>ASCIIDOC</code>           \u2013            </li> <li> <code>AUDIO</code>           \u2013            </li> <li> <code>CSV</code>           \u2013            </li> <li> <code>DOCX</code>           \u2013            </li> <li> <code>HTML</code>           \u2013            </li> <li> <code>IMAGE</code>           \u2013            </li> <li> <code>JSON_DOCLING</code>           \u2013            </li> <li> <code>MD</code>           \u2013            </li> <li> <code>METS_GBS</code>           \u2013            </li> <li> <code>PDF</code>           \u2013            </li> <li> <code>PPTX</code>           \u2013            </li> <li> <code>VTT</code>           \u2013            </li> <li> <code>XLSX</code>           \u2013            </li> <li> <code>XML_JATS</code>           \u2013            </li> <li> <code>XML_USPTO</code>           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.ASCIIDOC","title":"ASCIIDOC  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ASCIIDOC = 'asciidoc'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.AUDIO","title":"AUDIO  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>AUDIO = 'audio'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.CSV","title":"CSV  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CSV = 'csv'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.DOCX","title":"DOCX  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DOCX = 'docx'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.HTML","title":"HTML  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'html'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.IMAGE","title":"IMAGE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>IMAGE = 'image'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.JSON_DOCLING","title":"JSON_DOCLING  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON_DOCLING = 'json_docling'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.MD","title":"MD  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MD = 'md'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.METS_GBS","title":"METS_GBS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>METS_GBS = 'mets_gbs'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.PDF","title":"PDF  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'pdf'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.PPTX","title":"PPTX  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PPTX = 'pptx'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.VTT","title":"VTT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>VTT = 'vtt'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XLSX","title":"XLSX  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XLSX = 'xlsx'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XML_JATS","title":"XML_JATS  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML_JATS = 'xml_jats'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XML_USPTO","title":"XML_USPTO  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML_USPTO = 'xml_uspto'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption","title":"PdfFormatOption","text":"<p>               Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code>             \u2013              </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code>               (<code>Type[AbstractDocumentBackend]</code>)           \u2013            </li> <li> <code>backend_options</code>               (<code>Optional[PdfBackendOptions]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>pipeline_cls</code>               (<code>Type</code>)           \u2013            </li> <li> <code>pipeline_options</code>               (<code>Optional[PipelineOptions]</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.backend","title":"backend  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.backend_options","title":"backend_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend_options: Optional[PdfBackendOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.pipeline_cls","title":"pipeline_cls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = StandardPdfPipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.pipeline_options","title":"pipeline_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -&gt; Self\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption","title":"ImageFormatOption","text":"<p>               Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code>             \u2013              </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code>               (<code>Type[AbstractDocumentBackend]</code>)           \u2013            </li> <li> <code>backend_options</code>               (<code>Optional[BackendOptions]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>pipeline_cls</code>               (<code>Type</code>)           \u2013            </li> <li> <code>pipeline_options</code>               (<code>Optional[PipelineOptions]</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.backend","title":"backend  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = ImageDocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.backend_options","title":"backend_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend_options: Optional[BackendOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.pipeline_cls","title":"pipeline_cls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = StandardPdfPipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.pipeline_options","title":"pipeline_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -&gt; Self\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline","title":"StandardPdfPipeline","text":"<pre><code>StandardPdfPipeline(pipeline_options: ThreadedPdfPipelineOptions)\n</code></pre> <p>               Bases: <code>ConvertPipeline</code></p> <p>High-performance PDF pipeline with multi-threaded stages.</p> <p>Methods:</p> <ul> <li> <code>execute</code>             \u2013              </li> <li> <code>get_default_options</code>             \u2013              </li> <li> <code>is_backend_supported</code>             \u2013              </li> </ul> <p>Attributes:</p> <ul> <li> <code>artifacts_path</code>               (<code>Optional[Path]</code>)           \u2013            </li> <li> <code>build_pipe</code>               (<code>List[Callable]</code>)           \u2013            </li> <li> <code>enrichment_pipe</code>           \u2013            </li> <li> <code>keep_images</code>           \u2013            </li> <li> <code>pipeline_options</code>               (<code>ThreadedPdfPipelineOptions</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.artifacts_path","title":"artifacts_path  <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Path] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.build_pipe","title":"build_pipe  <code>instance-attribute</code>","text":"<pre><code>build_pipe: List[Callable] = []\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.enrichment_pipe","title":"enrichment_pipe  <code>instance-attribute</code>","text":"<pre><code>enrichment_pipe = [DocumentPictureClassifier(enabled=do_picture_classification, artifacts_path=artifacts_path, options=DocumentPictureClassifierOptions(), accelerator_options=accelerator_options), picture_description_model]\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.keep_images","title":"keep_images  <code>instance-attribute</code>","text":"<pre><code>keep_images = False\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.pipeline_options","title":"pipeline_options  <code>instance-attribute</code>","text":"<pre><code>pipeline_options: ThreadedPdfPipelineOptions = pipeline_options\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.execute","title":"execute","text":"<pre><code>execute(in_doc: InputDocument, raises_on_error: bool) -&gt; ConversionResult\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.get_default_options","title":"get_default_options  <code>classmethod</code>","text":"<pre><code>get_default_options() -&gt; ThreadedPdfPipelineOptions\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.is_backend_supported","title":"is_backend_supported  <code>classmethod</code>","text":"<pre><code>is_backend_supported(backend: AbstractDocumentBackend) -&gt; bool\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption","title":"WordFormatOption","text":"<p>               Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code>             \u2013              </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code>               (<code>Type[AbstractDocumentBackend]</code>)           \u2013            </li> <li> <code>backend_options</code>               (<code>Optional[BackendOptions]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>pipeline_cls</code>               (<code>Type</code>)           \u2013            </li> <li> <code>pipeline_options</code>               (<code>Optional[PipelineOptions]</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.backend","title":"backend  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = MsWordDocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.backend_options","title":"backend_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend_options: Optional[BackendOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.pipeline_cls","title":"pipeline_cls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = SimplePipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.pipeline_options","title":"pipeline_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -&gt; Self\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption","title":"PowerpointFormatOption","text":"<p>               Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code>             \u2013              </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code>               (<code>Type[AbstractDocumentBackend]</code>)           \u2013            </li> <li> <code>backend_options</code>               (<code>Optional[BackendOptions]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>pipeline_cls</code>               (<code>Type</code>)           \u2013            </li> <li> <code>pipeline_options</code>               (<code>Optional[PipelineOptions]</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.backend","title":"backend  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = MsPowerpointDocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.backend_options","title":"backend_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend_options: Optional[BackendOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.pipeline_cls","title":"pipeline_cls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = SimplePipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.pipeline_options","title":"pipeline_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -&gt; Self\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption","title":"MarkdownFormatOption","text":"<p>               Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code>             \u2013              </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code>               (<code>Type[AbstractDocumentBackend]</code>)           \u2013            </li> <li> <code>backend_options</code>               (<code>Optional[MarkdownBackendOptions]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>pipeline_cls</code>               (<code>Type</code>)           \u2013            </li> <li> <code>pipeline_options</code>               (<code>Optional[PipelineOptions]</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.backend","title":"backend  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = MarkdownDocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.backend_options","title":"backend_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend_options: Optional[MarkdownBackendOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.pipeline_cls","title":"pipeline_cls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = SimplePipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.pipeline_options","title":"pipeline_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -&gt; Self\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption","title":"AsciiDocFormatOption","text":"<p>               Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code>             \u2013              </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code>               (<code>Type[AbstractDocumentBackend]</code>)           \u2013            </li> <li> <code>backend_options</code>               (<code>Optional[BackendOptions]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>pipeline_cls</code>               (<code>Type</code>)           \u2013            </li> <li> <code>pipeline_options</code>               (<code>Optional[PipelineOptions]</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.backend","title":"backend  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = AsciiDocBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.backend_options","title":"backend_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend_options: Optional[BackendOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.pipeline_cls","title":"pipeline_cls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = SimplePipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.pipeline_options","title":"pipeline_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -&gt; Self\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption","title":"HTMLFormatOption","text":"<p>               Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code>             \u2013              </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code>               (<code>Type[AbstractDocumentBackend]</code>)           \u2013            </li> <li> <code>backend_options</code>               (<code>Optional[HTMLBackendOptions]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>pipeline_cls</code>               (<code>Type</code>)           \u2013            </li> <li> <code>pipeline_options</code>               (<code>Optional[PipelineOptions]</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.backend","title":"backend  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = HTMLDocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.backend_options","title":"backend_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend_options: Optional[HTMLBackendOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.pipeline_cls","title":"pipeline_cls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = SimplePipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.pipeline_options","title":"pipeline_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -&gt; Self\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline","title":"SimplePipeline","text":"<pre><code>SimplePipeline(pipeline_options: ConvertPipelineOptions)\n</code></pre> <p>               Bases: <code>ConvertPipeline</code></p> <p>SimpleModelPipeline.</p> <p>This class is used at the moment for formats / backends which produce straight DoclingDocument output.</p> <p>Methods:</p> <ul> <li> <code>execute</code>             \u2013              </li> <li> <code>get_default_options</code>             \u2013              </li> <li> <code>is_backend_supported</code>             \u2013              </li> </ul> <p>Attributes:</p> <ul> <li> <code>artifacts_path</code>               (<code>Optional[Path]</code>)           \u2013            </li> <li> <code>build_pipe</code>               (<code>List[Callable]</code>)           \u2013            </li> <li> <code>enrichment_pipe</code>           \u2013            </li> <li> <code>keep_images</code>           \u2013            </li> <li> <code>pipeline_options</code>               (<code>ConvertPipelineOptions</code>)           \u2013            </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.artifacts_path","title":"artifacts_path  <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Path] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.build_pipe","title":"build_pipe  <code>instance-attribute</code>","text":"<pre><code>build_pipe: List[Callable] = []\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.enrichment_pipe","title":"enrichment_pipe  <code>instance-attribute</code>","text":"<pre><code>enrichment_pipe = [DocumentPictureClassifier(enabled=do_picture_classification, artifacts_path=artifacts_path, options=DocumentPictureClassifierOptions(), accelerator_options=accelerator_options), picture_description_model]\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.keep_images","title":"keep_images  <code>instance-attribute</code>","text":"<pre><code>keep_images = False\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.pipeline_options","title":"pipeline_options  <code>instance-attribute</code>","text":"<pre><code>pipeline_options: ConvertPipelineOptions\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.execute","title":"execute","text":"<pre><code>execute(in_doc: InputDocument, raises_on_error: bool) -&gt; ConversionResult\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.get_default_options","title":"get_default_options  <code>classmethod</code>","text":"<pre><code>get_default_options() -&gt; ConvertPipelineOptions\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.is_backend_supported","title":"is_backend_supported  <code>classmethod</code>","text":"<pre><code>is_backend_supported(backend: AbstractDocumentBackend)\n</code></pre>"},{"location":"reference/pipeline_options/","title":"Pipeline options","text":"<p>Pipeline options allow to customize the execution of the models during the conversion pipeline. This includes options for the OCR engines, the table model as well as enrichment options which can be enabled with <code>do_xyz = True</code>.</p> <p>This is an automatic generated API reference of the all the pipeline options available in Docling.</p>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options","title":"pipeline_options","text":"<p>Classes:</p> <ul> <li> <code>AsrPipelineOptions</code>           \u2013            </li> <li> <code>BaseLayoutOptions</code>           \u2013            <p>Base options for layout models.</p> </li> <li> <code>BaseOptions</code>           \u2013            <p>Base class for options.</p> </li> <li> <code>BaseTableStructureOptions</code>           \u2013            <p>Base options for table structure models.</p> </li> <li> <code>ConvertPipelineOptions</code>           \u2013            <p>Base convert pipeline options.</p> </li> <li> <code>EasyOcrOptions</code>           \u2013            <p>Options for the EasyOCR engine.</p> </li> <li> <code>LayoutOptions</code>           \u2013            <p>Options for layout processing.</p> </li> <li> <code>OcrAutoOptions</code>           \u2013            <p>Options for pick OCR engine automatically.</p> </li> <li> <code>OcrEngine</code>           \u2013            <p>Enum of valid OCR engines.</p> </li> <li> <code>OcrMacOptions</code>           \u2013            <p>Options for the Mac OCR engine.</p> </li> <li> <code>OcrOptions</code>           \u2013            <p>OCR options.</p> </li> <li> <code>PaginatedPipelineOptions</code>           \u2013            </li> <li> <code>PdfBackend</code>           \u2013            <p>Enum of valid PDF backends.</p> </li> <li> <code>PdfPipelineOptions</code>           \u2013            <p>Options for the PDF pipeline.</p> </li> <li> <code>PictureDescriptionApiOptions</code>           \u2013            </li> <li> <code>PictureDescriptionBaseOptions</code>           \u2013            </li> <li> <code>PictureDescriptionVlmOptions</code>           \u2013            </li> <li> <code>PipelineOptions</code>           \u2013            <p>Base pipeline options.</p> </li> <li> <code>ProcessingPipeline</code>           \u2013            </li> <li> <code>RapidOcrOptions</code>           \u2013            <p>Options for the RapidOCR engine.</p> </li> <li> <code>TableFormerMode</code>           \u2013            <p>Modes for the TableFormer model.</p> </li> <li> <code>TableStructureOptions</code>           \u2013            <p>Options for the table structure.</p> </li> <li> <code>TesseractCliOcrOptions</code>           \u2013            <p>Options for the TesseractCli engine.</p> </li> <li> <code>TesseractOcrOptions</code>           \u2013            <p>Options for the Tesseract engine.</p> </li> <li> <code>ThreadedPdfPipelineOptions</code>           \u2013            <p>Pipeline options for the threaded PDF pipeline with batching and backpressure control</p> </li> <li> <code>VlmExtractionPipelineOptions</code>           \u2013            <p>Options for extraction pipeline.</p> </li> <li> <code>VlmPipelineOptions</code>           \u2013            </li> </ul> <p>Attributes:</p> <ul> <li> <code>granite_picture_description</code>           \u2013            </li> <li> <code>smolvlm_picture_description</code>           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.granite_picture_description","title":"granite_picture_description  <code>module-attribute</code>","text":"<pre><code>granite_picture_description = PictureDescriptionVlmOptions(repo_id='ibm-granite/granite-vision-3.3-2b', prompt='What is shown in this image?')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.smolvlm_picture_description","title":"smolvlm_picture_description  <code>module-attribute</code>","text":"<pre><code>smolvlm_picture_description = PictureDescriptionVlmOptions(repo_id='HuggingFaceTB/SmolVLM-256M-Instruct')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions","title":"AsrPipelineOptions","text":"<p>               Bases: <code>PipelineOptions</code></p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code>               (<code>AcceleratorOptions</code>)           \u2013            </li> <li> <code>allow_external_plugins</code>               (<code>bool</code>)           \u2013            </li> <li> <code>artifacts_path</code>               (<code>Optional[Union[Path, str]]</code>)           \u2013            </li> <li> <code>asr_options</code>               (<code>Union[InlineAsrOptions]</code>)           \u2013            </li> <li> <code>document_timeout</code>               (<code>Optional[float]</code>)           \u2013            </li> <li> <code>enable_remote_services</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.accelerator_options","title":"accelerator_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.allow_external_plugins","title":"allow_external_plugins  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.artifacts_path","title":"artifacts_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.asr_options","title":"asr_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>asr_options: Union[InlineAsrOptions] = WHISPER_TINY\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.document_timeout","title":"document_timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.enable_remote_services","title":"enable_remote_services  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseLayoutOptions","title":"BaseLayoutOptions","text":"<p>               Bases: <code>BaseOptions</code></p> <p>Base options for layout models.</p> <p>Attributes:</p> <ul> <li> <code>keep_empty_clusters</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> <li> <code>skip_cell_assignment</code>               (<code>bool</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseLayoutOptions.keep_empty_clusters","title":"keep_empty_clusters  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>keep_empty_clusters: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseLayoutOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseLayoutOptions.skip_cell_assignment","title":"skip_cell_assignment  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>skip_cell_assignment: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseOptions","title":"BaseOptions","text":"<p>               Bases: <code>BaseModel</code></p> <p>Base class for options.</p> <p>Attributes:</p> <ul> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseTableStructureOptions","title":"BaseTableStructureOptions","text":"<p>               Bases: <code>BaseOptions</code></p> <p>Base options for table structure models.</p> <p>Attributes:</p> <ul> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseTableStructureOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions","title":"ConvertPipelineOptions","text":"<p>               Bases: <code>PipelineOptions</code></p> <p>Base convert pipeline options.</p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code>               (<code>AcceleratorOptions</code>)           \u2013            </li> <li> <code>allow_external_plugins</code>               (<code>bool</code>)           \u2013            </li> <li> <code>artifacts_path</code>               (<code>Optional[Union[Path, str]]</code>)           \u2013            </li> <li> <code>do_picture_classification</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_picture_description</code>               (<code>bool</code>)           \u2013            </li> <li> <code>document_timeout</code>               (<code>Optional[float]</code>)           \u2013            </li> <li> <code>enable_remote_services</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> <li> <code>picture_description_options</code>               (<code>PictureDescriptionBaseOptions</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.accelerator_options","title":"accelerator_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.allow_external_plugins","title":"allow_external_plugins  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.artifacts_path","title":"artifacts_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.do_picture_classification","title":"do_picture_classification  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_classification: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.do_picture_description","title":"do_picture_description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_description: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.document_timeout","title":"document_timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.enable_remote_services","title":"enable_remote_services  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ConvertPipelineOptions.picture_description_options","title":"picture_description_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions","title":"EasyOcrOptions","text":"<p>               Bases: <code>OcrOptions</code></p> <p>Options for the EasyOCR engine.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code>               (<code>float</code>)           \u2013            </li> <li> <code>confidence_threshold</code>               (<code>float</code>)           \u2013            </li> <li> <code>download_enabled</code>               (<code>bool</code>)           \u2013            </li> <li> <code>force_full_page_ocr</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>Literal['easyocr']</code>)           \u2013            </li> <li> <code>lang</code>               (<code>List[str]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>model_storage_directory</code>               (<code>Optional[str]</code>)           \u2013            </li> <li> <code>recog_network</code>               (<code>Optional[str]</code>)           \u2013            </li> <li> <code>suppress_mps_warnings</code>               (<code>bool</code>)           \u2013            </li> <li> <code>use_gpu</code>               (<code>Optional[bool]</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.bitmap_area_threshold","title":"bitmap_area_threshold  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.confidence_threshold","title":"confidence_threshold  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>confidence_threshold: float = 0.5\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.download_enabled","title":"download_enabled  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>download_enabled: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.force_full_page_ocr","title":"force_full_page_ocr  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: Literal['easyocr'] = 'easyocr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.lang","title":"lang  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lang: List[str] = ['fr', 'de', 'es', 'en']\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='forbid', protected_namespaces=())\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.model_storage_directory","title":"model_storage_directory  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_storage_directory: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.recog_network","title":"recog_network  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>recog_network: Optional[str] = 'standard'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.suppress_mps_warnings","title":"suppress_mps_warnings  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>suppress_mps_warnings: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.use_gpu","title":"use_gpu  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>use_gpu: Optional[bool] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions","title":"LayoutOptions","text":"<p>               Bases: <code>BaseLayoutOptions</code></p> <p>Options for layout processing.</p> <p>Attributes:</p> <ul> <li> <code>create_orphan_clusters</code>               (<code>bool</code>)           \u2013            </li> <li> <code>keep_empty_clusters</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> <li> <code>model_spec</code>               (<code>LayoutModelConfig</code>)           \u2013            </li> <li> <code>skip_cell_assignment</code>               (<code>bool</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.create_orphan_clusters","title":"create_orphan_clusters  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_orphan_clusters: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.keep_empty_clusters","title":"keep_empty_clusters  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>keep_empty_clusters: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str = 'docling_layout_default'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.model_spec","title":"model_spec  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_spec: LayoutModelConfig = DOCLING_LAYOUT_HERON\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.skip_cell_assignment","title":"skip_cell_assignment  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>skip_cell_assignment: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrAutoOptions","title":"OcrAutoOptions","text":"<p>               Bases: <code>OcrOptions</code></p> <p>Options for pick OCR engine automatically.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code>               (<code>float</code>)           \u2013            </li> <li> <code>force_full_page_ocr</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>Literal['auto']</code>)           \u2013            </li> <li> <code>lang</code>               (<code>List[str]</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrAutoOptions.bitmap_area_threshold","title":"bitmap_area_threshold  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrAutoOptions.force_full_page_ocr","title":"force_full_page_ocr  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrAutoOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: Literal['auto'] = 'auto'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrAutoOptions.lang","title":"lang  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lang: List[str] = []\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine","title":"OcrEngine","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>Enum of valid OCR engines.</p> <p>Attributes:</p> <ul> <li> <code>AUTO</code>           \u2013            </li> <li> <code>EASYOCR</code>           \u2013            </li> <li> <code>OCRMAC</code>           \u2013            </li> <li> <code>RAPIDOCR</code>           \u2013            </li> <li> <code>TESSERACT</code>           \u2013            </li> <li> <code>TESSERACT_CLI</code>           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.AUTO","title":"AUTO  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>AUTO = 'auto'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.EASYOCR","title":"EASYOCR  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>EASYOCR = 'easyocr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.OCRMAC","title":"OCRMAC  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>OCRMAC = 'ocrmac'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.RAPIDOCR","title":"RAPIDOCR  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>RAPIDOCR = 'rapidocr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.TESSERACT","title":"TESSERACT  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TESSERACT = 'tesseract'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.TESSERACT_CLI","title":"TESSERACT_CLI  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TESSERACT_CLI = 'tesseract_cli'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions","title":"OcrMacOptions","text":"<p>               Bases: <code>OcrOptions</code></p> <p>Options for the Mac OCR engine.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code>               (<code>float</code>)           \u2013            </li> <li> <code>force_full_page_ocr</code>               (<code>bool</code>)           \u2013            </li> <li> <code>framework</code>               (<code>str</code>)           \u2013            </li> <li> <code>kind</code>               (<code>Literal['ocrmac']</code>)           \u2013            </li> <li> <code>lang</code>               (<code>List[str]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>recognition</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.bitmap_area_threshold","title":"bitmap_area_threshold  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.force_full_page_ocr","title":"force_full_page_ocr  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.framework","title":"framework  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>framework: str = 'vision'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: Literal['ocrmac'] = 'ocrmac'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.lang","title":"lang  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lang: List[str] = ['fr-FR', 'de-DE', 'es-ES', 'en-US']\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.recognition","title":"recognition  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>recognition: str = 'accurate'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions","title":"OcrOptions","text":"<p>               Bases: <code>BaseOptions</code></p> <p>OCR options.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code>               (<code>float</code>)           \u2013            </li> <li> <code>force_full_page_ocr</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> <li> <code>lang</code>               (<code>List[str]</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.bitmap_area_threshold","title":"bitmap_area_threshold  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.force_full_page_ocr","title":"force_full_page_ocr  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.lang","title":"lang  <code>instance-attribute</code>","text":"<pre><code>lang: List[str]\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions","title":"PaginatedPipelineOptions","text":"<p>               Bases: <code>ConvertPipelineOptions</code></p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code>               (<code>AcceleratorOptions</code>)           \u2013            </li> <li> <code>allow_external_plugins</code>               (<code>bool</code>)           \u2013            </li> <li> <code>artifacts_path</code>               (<code>Optional[Union[Path, str]]</code>)           \u2013            </li> <li> <code>do_picture_classification</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_picture_description</code>               (<code>bool</code>)           \u2013            </li> <li> <code>document_timeout</code>               (<code>Optional[float]</code>)           \u2013            </li> <li> <code>enable_remote_services</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_page_images</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_picture_images</code>               (<code>bool</code>)           \u2013            </li> <li> <code>images_scale</code>               (<code>float</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> <li> <code>picture_description_options</code>               (<code>PictureDescriptionBaseOptions</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.accelerator_options","title":"accelerator_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.allow_external_plugins","title":"allow_external_plugins  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.artifacts_path","title":"artifacts_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.do_picture_classification","title":"do_picture_classification  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_classification: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.do_picture_description","title":"do_picture_description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_description: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.document_timeout","title":"document_timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.enable_remote_services","title":"enable_remote_services  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.generate_page_images","title":"generate_page_images  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_page_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.generate_picture_images","title":"generate_picture_images  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_picture_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.images_scale","title":"images_scale  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>images_scale: float = 1.0\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.picture_description_options","title":"picture_description_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend","title":"PdfBackend","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>Enum of valid PDF backends.</p> <p>Attributes:</p> <ul> <li> <code>DLPARSE_V1</code>           \u2013            </li> <li> <code>DLPARSE_V2</code>           \u2013            </li> <li> <code>DLPARSE_V4</code>           \u2013            </li> <li> <code>PYPDFIUM2</code>           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.DLPARSE_V1","title":"DLPARSE_V1  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DLPARSE_V1 = 'dlparse_v1'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.DLPARSE_V2","title":"DLPARSE_V2  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DLPARSE_V2 = 'dlparse_v2'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.DLPARSE_V4","title":"DLPARSE_V4  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DLPARSE_V4 = 'dlparse_v4'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.PYPDFIUM2","title":"PYPDFIUM2  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PYPDFIUM2 = 'pypdfium2'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions","title":"PdfPipelineOptions","text":"<p>               Bases: <code>PaginatedPipelineOptions</code></p> <p>Options for the PDF pipeline.</p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code>               (<code>AcceleratorOptions</code>)           \u2013            </li> <li> <code>allow_external_plugins</code>               (<code>bool</code>)           \u2013            </li> <li> <code>artifacts_path</code>               (<code>Optional[Union[Path, str]]</code>)           \u2013            </li> <li> <code>batch_polling_interval_seconds</code>               (<code>float</code>)           \u2013            </li> <li> <code>do_code_enrichment</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_formula_enrichment</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_ocr</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_picture_classification</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_picture_description</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_table_structure</code>               (<code>bool</code>)           \u2013            </li> <li> <code>document_timeout</code>               (<code>Optional[float]</code>)           \u2013            </li> <li> <code>enable_remote_services</code>               (<code>bool</code>)           \u2013            </li> <li> <code>force_backend_text</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_page_images</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_parsed_pages</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_picture_images</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_table_images</code>               (<code>bool</code>)           \u2013            </li> <li> <code>images_scale</code>               (<code>float</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> <li> <code>layout_batch_size</code>               (<code>int</code>)           \u2013            </li> <li> <code>layout_options</code>               (<code>BaseLayoutOptions</code>)           \u2013            </li> <li> <code>ocr_batch_size</code>               (<code>int</code>)           \u2013            </li> <li> <code>ocr_options</code>               (<code>OcrOptions</code>)           \u2013            </li> <li> <code>picture_description_options</code>               (<code>PictureDescriptionBaseOptions</code>)           \u2013            </li> <li> <code>queue_max_size</code>               (<code>int</code>)           \u2013            </li> <li> <code>table_batch_size</code>               (<code>int</code>)           \u2013            </li> <li> <code>table_structure_options</code>               (<code>BaseTableStructureOptions</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.accelerator_options","title":"accelerator_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.allow_external_plugins","title":"allow_external_plugins  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.artifacts_path","title":"artifacts_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.batch_polling_interval_seconds","title":"batch_polling_interval_seconds  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>batch_polling_interval_seconds: float = 0.5\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_code_enrichment","title":"do_code_enrichment  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_code_enrichment: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_formula_enrichment","title":"do_formula_enrichment  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_formula_enrichment: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_ocr","title":"do_ocr  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_ocr: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_picture_classification","title":"do_picture_classification  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_classification: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_picture_description","title":"do_picture_description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_description: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_table_structure","title":"do_table_structure  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_table_structure: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.document_timeout","title":"document_timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.enable_remote_services","title":"enable_remote_services  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.force_backend_text","title":"force_backend_text  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_backend_text: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_page_images","title":"generate_page_images  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_page_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_parsed_pages","title":"generate_parsed_pages  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_parsed_pages: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_picture_images","title":"generate_picture_images  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_picture_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_table_images","title":"generate_table_images  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_table_images: bool = Field(default=False, deprecated='Field `generate_table_images` is deprecated. To obtain table images, set `PdfPipelineOptions.generate_page_images = True` before conversion and then use the `TableItem.get_image` function.')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.images_scale","title":"images_scale  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>images_scale: float = 1.0\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.layout_batch_size","title":"layout_batch_size  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>layout_batch_size: int = 4\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.layout_options","title":"layout_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>layout_options: BaseLayoutOptions = LayoutOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.ocr_batch_size","title":"ocr_batch_size  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ocr_batch_size: int = 4\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.ocr_options","title":"ocr_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ocr_options: OcrOptions = OcrAutoOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.picture_description_options","title":"picture_description_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.queue_max_size","title":"queue_max_size  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>queue_max_size: int = 100\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.table_batch_size","title":"table_batch_size  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table_batch_size: int = 4\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.table_structure_options","title":"table_structure_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table_structure_options: BaseTableStructureOptions = TableStructureOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions","title":"PictureDescriptionApiOptions","text":"<p>               Bases: <code>PictureDescriptionBaseOptions</code></p> <p>Attributes:</p> <ul> <li> <code>batch_size</code>               (<code>int</code>)           \u2013            </li> <li> <code>concurrency</code>               (<code>int</code>)           \u2013            </li> <li> <code>headers</code>               (<code>Dict[str, str]</code>)           \u2013            </li> <li> <code>kind</code>               (<code>Literal['api']</code>)           \u2013            </li> <li> <code>params</code>               (<code>Dict[str, Any]</code>)           \u2013            </li> <li> <code>picture_area_threshold</code>               (<code>float</code>)           \u2013            </li> <li> <code>prompt</code>               (<code>str</code>)           \u2013            </li> <li> <code>provenance</code>               (<code>str</code>)           \u2013            </li> <li> <code>scale</code>               (<code>float</code>)           \u2013            </li> <li> <code>timeout</code>               (<code>float</code>)           \u2013            </li> <li> <code>url</code>               (<code>AnyUrl</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.batch_size","title":"batch_size  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>batch_size: int = 8\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.concurrency","title":"concurrency  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>concurrency: int = 1\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.headers","title":"headers  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>headers: Dict[str, str] = {}\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: Literal['api'] = 'api'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.params","title":"params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Dict[str, Any] = {}\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.picture_area_threshold","title":"picture_area_threshold  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.prompt","title":"prompt  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>prompt: str = 'Describe this image in a few sentences.'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.provenance","title":"provenance  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>provenance: str = ''\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.scale","title":"scale  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scale: float = 2\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.timeout","title":"timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timeout: float = 20\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.url","title":"url  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: AnyUrl = AnyUrl('http://localhost:8000/v1/chat/completions')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions","title":"PictureDescriptionBaseOptions","text":"<p>               Bases: <code>BaseOptions</code></p> <p>Attributes:</p> <ul> <li> <code>batch_size</code>               (<code>int</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> <li> <code>picture_area_threshold</code>               (<code>float</code>)           \u2013            </li> <li> <code>scale</code>               (<code>float</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.batch_size","title":"batch_size  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>batch_size: int = 8\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.picture_area_threshold","title":"picture_area_threshold  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.scale","title":"scale  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scale: float = 2\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions","title":"PictureDescriptionVlmOptions","text":"<p>               Bases: <code>PictureDescriptionBaseOptions</code></p> <p>Attributes:</p> <ul> <li> <code>batch_size</code>               (<code>int</code>)           \u2013            </li> <li> <code>generation_config</code>               (<code>Dict[str, Any]</code>)           \u2013            </li> <li> <code>kind</code>               (<code>Literal['vlm']</code>)           \u2013            </li> <li> <code>picture_area_threshold</code>               (<code>float</code>)           \u2013            </li> <li> <code>prompt</code>               (<code>str</code>)           \u2013            </li> <li> <code>repo_cache_folder</code>               (<code>str</code>)           \u2013            </li> <li> <code>repo_id</code>               (<code>str</code>)           \u2013            </li> <li> <code>scale</code>               (<code>float</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.batch_size","title":"batch_size  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>batch_size: int = 8\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.generation_config","title":"generation_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generation_config: Dict[str, Any] = dict(max_new_tokens=200, do_sample=False)\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: Literal['vlm'] = 'vlm'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.picture_area_threshold","title":"picture_area_threshold  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.prompt","title":"prompt  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>prompt: str = 'Describe this image in a few sentences.'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.repo_cache_folder","title":"repo_cache_folder  <code>property</code>","text":"<pre><code>repo_cache_folder: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.repo_id","title":"repo_id  <code>instance-attribute</code>","text":"<pre><code>repo_id: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.scale","title":"scale  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scale: float = 2\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions","title":"PipelineOptions","text":"<p>               Bases: <code>BaseOptions</code></p> <p>Base pipeline options.</p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code>               (<code>AcceleratorOptions</code>)           \u2013            </li> <li> <code>allow_external_plugins</code>               (<code>bool</code>)           \u2013            </li> <li> <code>artifacts_path</code>               (<code>Optional[Union[Path, str]]</code>)           \u2013            </li> <li> <code>document_timeout</code>               (<code>Optional[float]</code>)           \u2013            </li> <li> <code>enable_remote_services</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.accelerator_options","title":"accelerator_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.allow_external_plugins","title":"allow_external_plugins  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.artifacts_path","title":"artifacts_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.document_timeout","title":"document_timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.enable_remote_services","title":"enable_remote_services  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline","title":"ProcessingPipeline","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>Attributes:</p> <ul> <li> <code>ASR</code>           \u2013            </li> <li> <code>LEGACY</code>           \u2013            </li> <li> <code>STANDARD</code>           \u2013            </li> <li> <code>VLM</code>           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.ASR","title":"ASR  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ASR = 'asr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.LEGACY","title":"LEGACY  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>LEGACY = 'legacy'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.STANDARD","title":"STANDARD  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>STANDARD = 'standard'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.VLM","title":"VLM  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>VLM = 'vlm'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions","title":"RapidOcrOptions","text":"<p>               Bases: <code>OcrOptions</code></p> <p>Options for the RapidOCR engine.</p> <p>Attributes:</p> <ul> <li> <code>backend</code>               (<code>Literal['onnxruntime', 'openvino', 'paddle', 'torch']</code>)           \u2013            </li> <li> <code>bitmap_area_threshold</code>               (<code>float</code>)           \u2013            </li> <li> <code>cls_model_path</code>               (<code>Optional[str]</code>)           \u2013            </li> <li> <code>det_model_path</code>               (<code>Optional[str]</code>)           \u2013            </li> <li> <code>font_path</code>               (<code>Optional[str]</code>)           \u2013            </li> <li> <code>force_full_page_ocr</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>Literal['rapidocr']</code>)           \u2013            </li> <li> <code>lang</code>               (<code>List[str]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>print_verbose</code>               (<code>bool</code>)           \u2013            </li> <li> <code>rapidocr_params</code>               (<code>Dict[str, Any]</code>)           \u2013            </li> <li> <code>rec_font_path</code>               (<code>Optional[str]</code>)           \u2013            </li> <li> <code>rec_keys_path</code>               (<code>Optional[str]</code>)           \u2013            </li> <li> <code>rec_model_path</code>               (<code>Optional[str]</code>)           \u2013            </li> <li> <code>text_score</code>               (<code>float</code>)           \u2013            </li> <li> <code>use_cls</code>               (<code>Optional[bool]</code>)           \u2013            </li> <li> <code>use_det</code>               (<code>Optional[bool]</code>)           \u2013            </li> <li> <code>use_rec</code>               (<code>Optional[bool]</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.backend","title":"backend  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Literal['onnxruntime', 'openvino', 'paddle', 'torch'] = 'onnxruntime'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.bitmap_area_threshold","title":"bitmap_area_threshold  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.cls_model_path","title":"cls_model_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>cls_model_path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.det_model_path","title":"det_model_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>det_model_path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.font_path","title":"font_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>font_path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.force_full_page_ocr","title":"force_full_page_ocr  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: Literal['rapidocr'] = 'rapidocr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.lang","title":"lang  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lang: List[str] = ['english', 'chinese']\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.print_verbose","title":"print_verbose  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>print_verbose: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rapidocr_params","title":"rapidocr_params  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rapidocr_params: Dict[str, Any] = Field(default_factory=dict)\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rec_font_path","title":"rec_font_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rec_font_path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rec_keys_path","title":"rec_keys_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rec_keys_path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rec_model_path","title":"rec_model_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rec_model_path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.text_score","title":"text_score  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>text_score: float = 0.5\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_cls","title":"use_cls  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>use_cls: Optional[bool] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_det","title":"use_det  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>use_det: Optional[bool] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_rec","title":"use_rec  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>use_rec: Optional[bool] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableFormerMode","title":"TableFormerMode","text":"<p>               Bases: <code>str</code>, <code>Enum</code></p> <p>Modes for the TableFormer model.</p> <p>Attributes:</p> <ul> <li> <code>ACCURATE</code>           \u2013            </li> <li> <code>FAST</code>           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableFormerMode.ACCURATE","title":"ACCURATE  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ACCURATE = 'accurate'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableFormerMode.FAST","title":"FAST  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>FAST = 'fast'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions","title":"TableStructureOptions","text":"<p>               Bases: <code>BaseTableStructureOptions</code></p> <p>Options for the table structure.</p> <p>Attributes:</p> <ul> <li> <code>do_cell_matching</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> <li> <code>mode</code>               (<code>TableFormerMode</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions.do_cell_matching","title":"do_cell_matching  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_cell_matching: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str = 'docling_tableformer'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions.mode","title":"mode  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>mode: TableFormerMode = ACCURATE\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions","title":"TesseractCliOcrOptions","text":"<p>               Bases: <code>OcrOptions</code></p> <p>Options for the TesseractCli engine.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code>               (<code>float</code>)           \u2013            </li> <li> <code>force_full_page_ocr</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>Literal['tesseract']</code>)           \u2013            </li> <li> <code>lang</code>               (<code>List[str]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>path</code>               (<code>Optional[str]</code>)           \u2013            </li> <li> <code>psm</code>               (<code>Optional[int]</code>)           \u2013            </li> <li> <code>tesseract_cmd</code>               (<code>str</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.bitmap_area_threshold","title":"bitmap_area_threshold  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.force_full_page_ocr","title":"force_full_page_ocr  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: Literal['tesseract'] = 'tesseract'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.lang","title":"lang  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lang: List[str] = ['fra', 'deu', 'spa', 'eng']\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.psm","title":"psm  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>psm: Optional[int] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.tesseract_cmd","title":"tesseract_cmd  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>tesseract_cmd: str = 'tesseract'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions","title":"TesseractOcrOptions","text":"<p>               Bases: <code>OcrOptions</code></p> <p>Options for the Tesseract engine.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code>               (<code>float</code>)           \u2013            </li> <li> <code>force_full_page_ocr</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>Literal['tesserocr']</code>)           \u2013            </li> <li> <code>lang</code>               (<code>List[str]</code>)           \u2013            </li> <li> <code>model_config</code>           \u2013            </li> <li> <code>path</code>               (<code>Optional[str]</code>)           \u2013            </li> <li> <code>psm</code>               (<code>Optional[int]</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.bitmap_area_threshold","title":"bitmap_area_threshold  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.force_full_page_ocr","title":"force_full_page_ocr  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: Literal['tesserocr'] = 'tesserocr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.lang","title":"lang  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lang: List[str] = ['fra', 'deu', 'spa', 'eng']\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.model_config","title":"model_config  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.path","title":"path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.psm","title":"psm  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>psm: Optional[int] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions","title":"ThreadedPdfPipelineOptions","text":"<p>               Bases: <code>PdfPipelineOptions</code></p> <p>Pipeline options for the threaded PDF pipeline with batching and backpressure control</p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code>               (<code>AcceleratorOptions</code>)           \u2013            </li> <li> <code>allow_external_plugins</code>               (<code>bool</code>)           \u2013            </li> <li> <code>artifacts_path</code>               (<code>Optional[Union[Path, str]]</code>)           \u2013            </li> <li> <code>batch_polling_interval_seconds</code>               (<code>float</code>)           \u2013            </li> <li> <code>do_code_enrichment</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_formula_enrichment</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_ocr</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_picture_classification</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_picture_description</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_table_structure</code>               (<code>bool</code>)           \u2013            </li> <li> <code>document_timeout</code>               (<code>Optional[float]</code>)           \u2013            </li> <li> <code>enable_remote_services</code>               (<code>bool</code>)           \u2013            </li> <li> <code>force_backend_text</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_page_images</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_parsed_pages</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_picture_images</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_table_images</code>               (<code>bool</code>)           \u2013            </li> <li> <code>images_scale</code>               (<code>float</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> <li> <code>layout_batch_size</code>               (<code>int</code>)           \u2013            </li> <li> <code>layout_options</code>               (<code>BaseLayoutOptions</code>)           \u2013            </li> <li> <code>ocr_batch_size</code>               (<code>int</code>)           \u2013            </li> <li> <code>ocr_options</code>               (<code>OcrOptions</code>)           \u2013            </li> <li> <code>picture_description_options</code>               (<code>PictureDescriptionBaseOptions</code>)           \u2013            </li> <li> <code>queue_max_size</code>               (<code>int</code>)           \u2013            </li> <li> <code>table_batch_size</code>               (<code>int</code>)           \u2013            </li> <li> <code>table_structure_options</code>               (<code>BaseTableStructureOptions</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.accelerator_options","title":"accelerator_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.allow_external_plugins","title":"allow_external_plugins  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.artifacts_path","title":"artifacts_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.batch_polling_interval_seconds","title":"batch_polling_interval_seconds  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>batch_polling_interval_seconds: float = 0.5\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_code_enrichment","title":"do_code_enrichment  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_code_enrichment: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_formula_enrichment","title":"do_formula_enrichment  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_formula_enrichment: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_ocr","title":"do_ocr  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_ocr: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_picture_classification","title":"do_picture_classification  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_classification: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_picture_description","title":"do_picture_description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_description: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.do_table_structure","title":"do_table_structure  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_table_structure: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.document_timeout","title":"document_timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.enable_remote_services","title":"enable_remote_services  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.force_backend_text","title":"force_backend_text  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_backend_text: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.generate_page_images","title":"generate_page_images  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_page_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.generate_parsed_pages","title":"generate_parsed_pages  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_parsed_pages: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.generate_picture_images","title":"generate_picture_images  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_picture_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.generate_table_images","title":"generate_table_images  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_table_images: bool = Field(default=False, deprecated='Field `generate_table_images` is deprecated. To obtain table images, set `PdfPipelineOptions.generate_page_images = True` before conversion and then use the `TableItem.get_image` function.')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.images_scale","title":"images_scale  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>images_scale: float = 1.0\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.layout_batch_size","title":"layout_batch_size  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>layout_batch_size: int = 4\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.layout_options","title":"layout_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>layout_options: BaseLayoutOptions = LayoutOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.ocr_batch_size","title":"ocr_batch_size  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ocr_batch_size: int = 4\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.ocr_options","title":"ocr_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ocr_options: OcrOptions = OcrAutoOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.picture_description_options","title":"picture_description_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.queue_max_size","title":"queue_max_size  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>queue_max_size: int = 100\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.table_batch_size","title":"table_batch_size  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table_batch_size: int = 4\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ThreadedPdfPipelineOptions.table_structure_options","title":"table_structure_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table_structure_options: BaseTableStructureOptions = TableStructureOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions","title":"VlmExtractionPipelineOptions","text":"<p>               Bases: <code>PipelineOptions</code></p> <p>Options for extraction pipeline.</p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code>               (<code>AcceleratorOptions</code>)           \u2013            </li> <li> <code>allow_external_plugins</code>               (<code>bool</code>)           \u2013            </li> <li> <code>artifacts_path</code>               (<code>Optional[Union[Path, str]]</code>)           \u2013            </li> <li> <code>document_timeout</code>               (<code>Optional[float]</code>)           \u2013            </li> <li> <code>enable_remote_services</code>               (<code>bool</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> <li> <code>vlm_options</code>               (<code>Union[InlineVlmOptions]</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.accelerator_options","title":"accelerator_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.allow_external_plugins","title":"allow_external_plugins  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.artifacts_path","title":"artifacts_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.document_timeout","title":"document_timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.enable_remote_services","title":"enable_remote_services  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmExtractionPipelineOptions.vlm_options","title":"vlm_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>vlm_options: Union[InlineVlmOptions] = NU_EXTRACT_2B_TRANSFORMERS\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions","title":"VlmPipelineOptions","text":"<p>               Bases: <code>PaginatedPipelineOptions</code></p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code>               (<code>AcceleratorOptions</code>)           \u2013            </li> <li> <code>allow_external_plugins</code>               (<code>bool</code>)           \u2013            </li> <li> <code>artifacts_path</code>               (<code>Optional[Union[Path, str]]</code>)           \u2013            </li> <li> <code>do_picture_classification</code>               (<code>bool</code>)           \u2013            </li> <li> <code>do_picture_description</code>               (<code>bool</code>)           \u2013            </li> <li> <code>document_timeout</code>               (<code>Optional[float]</code>)           \u2013            </li> <li> <code>enable_remote_services</code>               (<code>bool</code>)           \u2013            </li> <li> <code>force_backend_text</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_page_images</code>               (<code>bool</code>)           \u2013            </li> <li> <code>generate_picture_images</code>               (<code>bool</code>)           \u2013            </li> <li> <code>images_scale</code>               (<code>float</code>)           \u2013            </li> <li> <code>kind</code>               (<code>str</code>)           \u2013            </li> <li> <code>picture_description_options</code>               (<code>PictureDescriptionBaseOptions</code>)           \u2013            </li> <li> <code>vlm_options</code>               (<code>Union[InlineVlmOptions, ApiVlmOptions]</code>)           \u2013            </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.accelerator_options","title":"accelerator_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.allow_external_plugins","title":"allow_external_plugins  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.artifacts_path","title":"artifacts_path  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.do_picture_classification","title":"do_picture_classification  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_classification: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.do_picture_description","title":"do_picture_description  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_description: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.document_timeout","title":"document_timeout  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.enable_remote_services","title":"enable_remote_services  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.force_backend_text","title":"force_backend_text  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_backend_text: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.generate_page_images","title":"generate_page_images  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_page_images: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.generate_picture_images","title":"generate_picture_images  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_picture_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.images_scale","title":"images_scale  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>images_scale: float = 1.0\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.kind","title":"kind  <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.picture_description_options","title":"picture_description_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.vlm_options","title":"vlm_options  <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>vlm_options: Union[InlineVlmOptions, ApiVlmOptions] = GRANITEDOCLING_TRANSFORMERS\n</code></pre>"},{"location":"usage/","title":"Index","text":""},{"location":"usage/#basic-usage","title":"Basic usage","text":""},{"location":"usage/#python","title":"Python","text":"<p>In Docling, working with documents is as simple as:</p> <ol> <li>converting your source file to a Docling document</li> <li>using that Docling document for your workflow</li> </ol> <p>For example, the snippet below shows conversion with export to Markdown:</p> <pre><code>from docling.document_converter import DocumentConverter\n\nsource = \"https://arxiv.org/pdf/2408.09869\"  # file path or URL\nconverter = DocumentConverter()\ndoc = converter.convert(source).document\n\nprint(doc.export_to_markdown())  # output: \"### Docling Technical Report[...]\"\n</code></pre> <p>Docling supports a wide array of file formats and, as outlined in the architecture guide, provides a versatile document model along with a full suite of supported operations.</p>"},{"location":"usage/#cli","title":"CLI","text":"<p>You can additionally use Docling directly from your terminal, for instance:</p> <pre><code>docling https://arxiv.org/pdf/2206.01062\n</code></pre> <p>The CLI provides various options, such as \ud83e\udd5aGraniteDocling (incl. MLX acceleration) &amp; other VLMs: <pre><code>docling --pipeline vlm --vlm-model granite_docling https://arxiv.org/pdf/2206.01062\n</code></pre></p> <p>For all available options, run <code>docling --help</code> or check the CLI reference.</p>"},{"location":"usage/#whats-next","title":"What's next","text":"<p>Check out the Usage subpages (navigation menu on the left) as well as our featured examples for additional usage workflows, including conversion customization, RAG, framework integrations, chunking, serialization, enrichments, and much more!</p>"},{"location":"usage/advanced_options/","title":"Advanced options","text":""},{"location":"usage/advanced_options/#model-prefetching-and-offline-usage","title":"Model prefetching and offline usage","text":"<p>By default, models are downloaded automatically upon first usage. If you would prefer to explicitly prefetch them for offline use (e.g. in air-gapped environments) you can do that as follows:</p> <p>Step 1: Prefetch the models</p> <p>Use the <code>docling-tools models download</code> utility:</p> <pre><code>$ docling-tools models download\nDownloading layout model...\nDownloading tableformer model...\nDownloading picture classifier model...\nDownloading code formula model...\nDownloading easyocr models...\nModels downloaded into $HOME/.cache/docling/models.\n</code></pre> <p>Alternatively, models can be programmatically downloaded using <code>docling.utils.model_downloader.download_models()</code>.</p> <p>Also, you can use <code>download-hf-repo</code> parameter to download arbitrary models from HuggingFace by specifying repo id:</p> <pre><code>$ docling-tools models download-hf-repo ds4sd/SmolDocling-256M-preview\nDownloading ds4sd/SmolDocling-256M-preview model from HuggingFace...\n</code></pre> <p>Step 2: Use the prefetched models</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\nartifacts_path = \"/local/path/to/models\"\n\npipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path)\ndoc_converter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n    }\n)\n</code></pre> <p>Or using the CLI:</p> <pre><code>docling --artifacts-path=\"/local/path/to/models\" FILE\n</code></pre> <p>Or using the <code>DOCLING_ARTIFACTS_PATH</code> environment variable:</p> <pre><code>export DOCLING_ARTIFACTS_PATH=\"/local/path/to/models\"\npython my_docling_script.py\n</code></pre>"},{"location":"usage/advanced_options/#using-remote-services","title":"Using remote services","text":"<p>The main purpose of Docling is to run local models which are not sharing any user data with remote services. Anyhow, there are valid use cases for processing part of the pipeline using remote services, for example invoking OCR engines from cloud vendors or the usage of hosted LLMs.</p> <p>In Docling we decided to allow such models, but we require the user to explicitly opt-in in communicating with external services.</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions(enable_remote_services=True)\ndoc_converter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n    }\n)\n</code></pre> <p>When the value <code>enable_remote_services=True</code> is not set, the system will raise an exception <code>OperationNotAllowed()</code>.</p> <p>Note: This option is only related to the system sending user data to remote services. Control of pulling data (e.g. model weights) follows the logic described in Model prefetching and offline usage.</p>"},{"location":"usage/advanced_options/#list-of-remote-model-services","title":"List of remote model services","text":"<p>The options in this list require the explicit <code>enable_remote_services=True</code> when processing the documents.</p> <ul> <li><code>PictureDescriptionApiOptions</code>: Using vision models via API calls.</li> </ul>"},{"location":"usage/advanced_options/#adjust-pipeline-features","title":"Adjust pipeline features","text":"<p>The example file custom_convert.py contains multiple ways one can adjust the conversion pipeline and features.</p>"},{"location":"usage/advanced_options/#control-pdf-table-extraction-options","title":"Control PDF table extraction options","text":"<p>You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself. This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\n\npipeline_options = PdfPipelineOptions(do_table_structure=True)\npipeline_options.table_structure_options.do_cell_matching = False  # uses text cells predicted from table structure model\n\ndoc_converter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n    }\n)\n</code></pre> <p>Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between <code>TableFormerMode.FAST</code> (faster but less accurate) and <code>TableFormerMode.ACCURATE</code> (default) to receive better quality with difficult table structures.</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode\n\npipeline_options = PdfPipelineOptions(do_table_structure=True)\npipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # use more accurate TableFormer model\n\ndoc_converter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n    }\n)\n</code></pre>"},{"location":"usage/advanced_options/#impose-limits-on-the-document-size","title":"Impose limits on the document size","text":"<p>You can limit the file size and number of pages which should be allowed to process per document:</p> <pre><code>from pathlib import Path\nfrom docling.document_converter import DocumentConverter\n\nsource = \"https://arxiv.org/pdf/2408.09869\"\nconverter = DocumentConverter()\nresult = converter.convert(source, max_num_pages=100, max_file_size=20971520)\n</code></pre>"},{"location":"usage/advanced_options/#convert-from-binary-pdf-streams","title":"Convert from binary PDF streams","text":"<p>You can convert PDFs from a binary stream instead of from the filesystem as follows:</p> <pre><code>from io import BytesIO\nfrom docling.datamodel.base_models import DocumentStream\nfrom docling.document_converter import DocumentConverter\n\nbuf = BytesIO(your_binary_stream)\nsource = DocumentStream(name=\"my_doc.pdf\", stream=buf)\nconverter = DocumentConverter()\nresult = converter.convert(source)\n</code></pre>"},{"location":"usage/advanced_options/#limit-resource-usage","title":"Limit resource usage","text":"<p>You can limit the CPU threads used by Docling by setting the environment variable <code>OMP_NUM_THREADS</code> accordingly. The default setting is using 4 CPU threads.</p>"},{"location":"usage/enrichments/","title":"Enrichment features","text":"<p>Docling allows to enrich the conversion pipeline with additional steps which process specific document components, e.g. code blocks, pictures, etc. The extra steps usually require extra models executions which may increase the processing time consistently. For this reason most enrichment models are disabled by default.</p> <p>The following table provides an overview of the default enrichment models available in Docling.</p> Feature Parameter Processed item Description Code understanding <code>do_code_enrichment</code> <code>CodeItem</code> See docs below. Formula understanding <code>do_formula_enrichment</code> <code>TextItem</code> with label <code>FORMULA</code> See docs below. Picture classification <code>do_picture_classification</code> <code>PictureItem</code> See docs below. Picture description <code>do_picture_description</code> <code>PictureItem</code> See docs below."},{"location":"usage/enrichments/#enrichments-details","title":"Enrichments details","text":""},{"location":"usage/enrichments/#code-understanding","title":"Code understanding","text":"<p>The code understanding step allows to use advanced parsing for code blocks found in the document. This enrichment model also set the <code>code_language</code> property of the <code>CodeItem</code>.</p> <p>Model specs: see the <code>CodeFormula</code> model card.</p> <p>Example command line:</p> <pre><code>docling --enrich-code FILE\n</code></pre> <p>Example code:</p> <pre><code>from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_code_enrichment = True\n\nconverter = DocumentConverter(format_options={\n    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n</code></pre>"},{"location":"usage/enrichments/#formula-understanding","title":"Formula understanding","text":"<p>The formula understanding step will analyze the equation formulas in documents and extract their LaTeX representation. The HTML export functions in the DoclingDocument will leverage the formula and visualize the result using the mathml html syntax.</p> <p>Model specs: see the <code>CodeFormula</code> model card.</p> <p>Example command line:</p> <pre><code>docling --enrich-formula FILE\n</code></pre> <p>Example code:</p> <pre><code>from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_formula_enrichment = True\n\nconverter = DocumentConverter(format_options={\n    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n</code></pre>"},{"location":"usage/enrichments/#picture-classification","title":"Picture classification","text":"<p>The picture classification step classifies the <code>PictureItem</code> elements in the document with the <code>DocumentFigureClassifier</code> model. This model is specialized to understand the classes of pictures found in documents, e.g. different chart types, flow diagrams, logos, signatures, etc.</p> <p>Model specs: see the <code>DocumentFigureClassifier</code> model card.</p> <p>Example command line:</p> <pre><code>docling --enrich-picture-classes FILE\n</code></pre> <p>Example code:</p> <pre><code>from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.generate_picture_images = True\npipeline_options.images_scale = 2\npipeline_options.do_picture_classification = True\n\nconverter = DocumentConverter(format_options={\n    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n</code></pre>"},{"location":"usage/enrichments/#picture-description","title":"Picture description","text":"<p>The picture description step allows to annotate a picture with a vision model. This is also known as a \"captioning\" task. The Docling pipeline allows to load and run models completely locally as well as connecting to remote API which support the chat template. Below follow a few examples on how to use some common vision model and remote services.</p> <pre><code>from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\n\nconverter = DocumentConverter(format_options={\n    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n</code></pre>"},{"location":"usage/enrichments/#granite-vision-model","title":"Granite Vision model","text":"<p>Model specs: see the <code>ibm-granite/granite-vision-3.1-2b-preview</code> model card.</p> <p>Usage in Docling:</p> <pre><code>from docling.datamodel.pipeline_options import granite_picture_description\n\npipeline_options.picture_description_options = granite_picture_description\n</code></pre>"},{"location":"usage/enrichments/#smolvlm-model","title":"SmolVLM model","text":"<p>Model specs: see the <code>HuggingFaceTB/SmolVLM-256M-Instruct</code> model card.</p> <p>Usage in Docling:</p> <pre><code>from docling.datamodel.pipeline_options import smolvlm_picture_description\n\npipeline_options.picture_description_options = smolvlm_picture_description\n</code></pre>"},{"location":"usage/enrichments/#other-vision-models","title":"Other vision models","text":"<p>The option class <code>PictureDescriptionVlmOptions</code> allows to use any another model from the Hugging Face Hub.</p> <pre><code>from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions\n\npipeline_options.picture_description_options = PictureDescriptionVlmOptions(\n    repo_id=\"\",  # &lt;-- add here the Hugging Face repo_id of your favorite VLM\n    prompt=\"Describe the image in three sentences. Be concise and accurate.\",\n)\n</code></pre>"},{"location":"usage/enrichments/#remote-vision-model","title":"Remote vision model","text":"<p>The option class <code>PictureDescriptionApiOptions</code> allows to use models hosted on remote platforms, e.g. on local endpoints served by VLLM, Ollama and others, or cloud providers like IBM watsonx.ai, etc.</p> <p>Note: in most cases this option will send your data to the remote service provider.</p> <p>Usage in Docling:</p> <pre><code>from docling.datamodel.pipeline_options import PictureDescriptionApiOptions\n\n# Enable connections to remote services\npipeline_options.enable_remote_services=True  # &lt;-- this is required!\n\n# Example using a model running locally, e.g. via VLLM\n# $ vllm serve MODEL_NAME\npipeline_options.picture_description_options = PictureDescriptionApiOptions(\n    url=\"http://localhost:8000/v1/chat/completions\",\n    params=dict(\n        model=\"MODEL NAME\",\n        seed=42,\n        max_completion_tokens=200,\n    ),\n    prompt=\"Describe the image in three sentences. Be concise and accurate.\",\n    timeout=90,\n)\n</code></pre> <p>End-to-end code snippets for cloud providers are available in the examples section:</p> <ul> <li>IBM watsonx.ai</li> </ul>"},{"location":"usage/enrichments/#develop-new-enrichment-models","title":"Develop new enrichment models","text":"<p>Besides looking at the implementation of all the models listed above, the Docling documentation has a few examples dedicated to the implementation of enrichment models.</p> <ul> <li>Develop picture enrichment</li> <li>Develop formula enrichment</li> </ul>"},{"location":"usage/gpu/","title":"GPU support","text":""},{"location":"usage/gpu/#achieving-optimal-gpu-performance-with-docling","title":"Achieving Optimal GPU Performance with Docling","text":"<p>This guide describes how to maximize GPU performance for Docling pipelines. It covers device selection, pipeline differences, and provides example snippets for configuring batch size and concurrency in the VLM pipeline for both Linux and Windows.</p> <p>Note</p> <p>Improvements and optimizations strategies for maximizing the GPU performance is an active topic. Regularly check these guidelines for updates.</p>"},{"location":"usage/gpu/#standard-pipeline","title":"Standard Pipeline","text":"<p>Enable GPU acceleration by configuring the accelerator device and concurrency options using Docling's API:</p> <pre><code>from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\n\n# Configure accelerator options for GPU\naccelerator_options = AcceleratorOptions(\n    device=AcceleratorDevice.CUDA,  # or AcceleratorDevice.AUTO\n)\n</code></pre> <p>Batch size and concurrency for document processing are controlled for each stage of the pipeline as:</p> <pre><code>from docling.datamodel.pipeline_options import (\n    ThreadedPdfPipelineOptions,\n)\n\npipeline_options = ThreadedPdfPipelineOptions(\n    ocr_batch_size=64,  # default 4\n    layout_batch_size=64,  # default 4\n    table_batch_size=4,  # currently not using GPU batching\n)\n</code></pre> <p>Setting a higher <code>page_batch_size</code> will run the Docling models (in particular the layout detection stage) with a GPU batch inference mode.</p>"},{"location":"usage/gpu/#complete-example","title":"Complete example","text":"<p>For a complete example see gpu_standard_pipeline.py.</p>"},{"location":"usage/gpu/#ocr-engines","title":"OCR engines","text":"<p>The current Docling OCR engines rely on third-party libraries, hence GPU support depends on the availability in the respective engines.</p> <p>The only setup which is known to work at the moment is RapidOCR with the torch backend, which can be enabled via</p> <pre><code>pipeline_options = PdfPipelineOptions()\npipeline_options.ocr_options = RapidOcrOptions(\n    backend=\"torch\",\n)\n</code></pre> <p>More details in the GitHub discussion #2451.</p>"},{"location":"usage/gpu/#vlm-pipeline","title":"VLM Pipeline","text":"<p>For best GPU utilization, use a local inference server. Docling supports inference servers which exposes the OpenAI-compatible chat completion endpoints. For example:</p> <ul> <li>vllm: <code>http://localhost:8000/v1/chat/completions</code> (available only on Linux)</li> <li>LM Studio: <code>http://localhost:1234/v1/chat/completions</code> (available both on Linux and Windows)</li> <li>Ollama: <code>http://localhost:11434/v1/chat/completions</code> (available both on Linux and Windows)</li> </ul>"},{"location":"usage/gpu/#start-the-inference-server","title":"Start the inference server","text":"<p>Here is an example on how to start the vllm inference server with optimum parameters for Granite Docling.</p> <pre><code>vllm serve ibm-granite/granite-docling-258M \\\n  --host 127.0.0.1 --port 8000 \\\n  --max-num-seqs 512 \\\n  --max-num-batched-tokens 8192 \\\n  --enable-chunked-prefill \\\n  --gpu-memory-utilization 0.9\n</code></pre>"},{"location":"usage/gpu/#configure-docling","title":"Configure Docling","text":"<p>Configure the VLM pipeline using Docling's VLM options:</p> <pre><code>from docling.datamodel.pipeline_options import VlmPipelineOptions\n\nvlm_options = VlmPipelineOptions(\n    enable_remote_services=True,\n    vlm_options={\n        \"url\": \"http://localhost:8000/v1/chat/completions\",  # or any other compatible endpoint\n        \"params\": {\n            \"model\": \"ibm-granite/granite-docling-258M\",\n            \"max_tokens\": 4096,\n        },\n        \"concurrency\": 64,  # default is 1\n        \"prompt\": \"Convert this page to docling.\",\n        \"timeout\": 90,\n    }\n)\n</code></pre> <p>Additionally to the concurrency, we also have to set the <code>page_batch_size</code> Docling parameter. Make sure to set <code>settings.perf.page_batch_size &gt;= vlm_options.concurrency</code>.</p> <pre><code>from docling.datamodel.settings import settings\n\nsettings.perf.page_batch_size = 64  # default is 4\n</code></pre>"},{"location":"usage/gpu/#complete-example_1","title":"Complete example","text":"<p>For a complete example see gpu_vlm_pipeline.py.</p>"},{"location":"usage/gpu/#available-models","title":"Available models","text":"<p>Both LM Studio and Ollama rely on llama.cpp as runtime engine. For using this engine, models have to be converted to the gguf format.</p> <p>Here is a list of known models which are available in gguf format and how to use them.</p> <p>TBA.</p>"},{"location":"usage/gpu/#performance-results","title":"Performance results","text":""},{"location":"usage/gpu/#test-data","title":"Test data","text":"PDF doc ViDoRe V3 HR Num docs 1 14 Num pages 192 1110 Num tables 95 258 Format type PDF Parquet of images"},{"location":"usage/gpu/#test-infrastructure","title":"Test infrastructure","text":"g6e.2xlarge RTX 5090 RTX 5070 Description AWS instance <code>g6e.2xlarge</code> Linux bare metal machine Windows 11 bare metal machine CPU 8 vCPUs, AMD EPYC 7R13 16 vCPU, AMD Ryzen 7 9800 16 vCPU, AMD Ryzen 7 9800 RAM 64GB 128GB 64GB GPU NVIDIA L40S 48GB NVIDIA GeForce RTX 5090 NVIDIA GeForce RTX 5070 CUDA Version 13.0, driver 580.95.05 13.0, driver 580.105.08 13.0, driver 581.57"},{"location":"usage/gpu/#results","title":"Results","text":"Pipelineg6e.2xlargeRTX 5090RTX 5070 PDF docViDoRe V3 HRPDF docViDoRe V3 HRPDF docViDoRe V3 HR Standard - Inline (no OCR)3.1 pages/second-7.9 pages/second[cpu-only]* 1.5 pages/second-4.2 pages/second[cpu-only]* 1.2 pages/second- VLM - Inference server (GraniteDocling)2.4 pages/second-3.8 pages/second3.6-4.5 pages/second-- <p>* cpu-only timing computed with 16 pytorch threads.</p>"},{"location":"usage/jobkit/","title":"Jobkit","text":"<p>Docling's document conversion can be executed as distributed jobs using Docling Jobkit.</p> <p>This library provides:</p> <ul> <li>Pipelines for running jobs with Kubeflow pipelines, Ray, or locally.</li> <li>Connectors to import and export documents via HTTP endpoints, S3, or Google Drive.</li> </ul>"},{"location":"usage/jobkit/#usage","title":"Usage","text":""},{"location":"usage/jobkit/#cli","title":"CLI","text":"<p>You can run Jobkit locally via the CLI:</p> <pre><code>uv run docling-jobkit-local [configuration-file-path]\n</code></pre> <p>The configuration file defines:</p> <ul> <li>Docling conversion options (e.g. OCR settings)</li> <li>Source location of input documents</li> <li>Target location for the converted outputs</li> </ul> <p>Example configuration file:</p> <pre><code>options:               # Example Docling's conversion options\n  do_ocr: false         \nsources:               # Source location (here Google Drive)\n  - kind: google_drive\n    path_id: 1X6B3j7GWlHfIPSF9VUkasN-z49yo1sGFA9xv55L2hSE\n    token_path: \"./dev/google_drive/google_drive_token.json\" \n    credentials_path: \"./dev/google_drive/google_drive_credentials.json\"  \ntarget:                # Target location (here S3)\n  kind: s3\n  endpoint: localhost:9000\n  verify_ssl: false\n  bucket: docling-target\n  access_key: minioadmin\n  secret_key: minioadmin\n</code></pre>"},{"location":"usage/jobkit/#connectors","title":"Connectors","text":"<p>Connectors are used to import documents for processing with Docling and to export results after conversion.</p> <p>The currently supported connectors are:</p> <ul> <li>HTTP endpoints</li> <li>S3</li> <li>Google Drive</li> </ul>"},{"location":"usage/jobkit/#google-drive","title":"Google Drive","text":"<p>To use Google Drive as a source or target, you need to enable the API and set up credentials.</p> <p>Step 1: Enable the Google Drive API.</p> <ul> <li>Go to the Google Cloud Console.</li> <li>Search for \u201cGoogle Drive API\u201d and enable it.</li> </ul> <p>Step 2: Create OAuth credentials. </p> <ul> <li>Go to APIs &amp; Services &gt; Credentials.</li> <li>Click \u201c+ Create credentials\u201d &gt; OAuth client ID.</li> <li>If prompted, configure the OAuth consent screen with \"Audience: External\".</li> <li>Select application type: \"Desktop app\".</li> <li>Create the application</li> <li>Download the credentials JSON and rename it to <code>google_drive_credentials.json</code>.</li> </ul> <p>Step 3: Add test users.</p> <ul> <li>Go to OAuth consent screen &gt; Test users.</li> <li>Add your email address.</li> </ul> <p>Step 4: Edit configuration file.</p> <ul> <li>Edit <code>credentials_path</code> with your path to <code>google_drive_credentials.json</code>.</li> <li>Edit <code>path_id</code> with your source or target location. It can be obtained from the URL as follows:<ul> <li>Folder: <code>https://drive.google.com/drive/u/0/folders/1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5</code> &gt; folder id is <code>1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5</code>.</li> <li>File: <code>https://docs.google.com/document/d/1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw/edit</code> &gt; document id is <code>1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw</code>.</li> </ul> </li> </ul> <p>Step 5: Authenticate via CLI.</p> <ul> <li>Run the CLI with your configuration file.</li> <li>A browser window will open for authentication and gerate a token file that will be save on the configured <code>token_path</code> and reused for next runs.</li> </ul>"},{"location":"usage/mcp/","title":"MCP server","text":"<p>New AI trends focus on Agentic AI, an artificial intelligence system that can accomplish a specific goal with limited supervision. Agents can act autonomously to understand, plan, and execute a specific task.</p> <p>To address the integration problem, the Model Context Protocol (MCP) emerges as a popular standard for connecting AI applications to external tools.</p>"},{"location":"usage/mcp/#docling-mcp","title":"Docling MCP","text":"<p>Docling supports the development of AI agents by providing an MCP Server. It allows you to experiment with document processing in different MCP Clients. Adding Docling MCP in your favorite client is usually as simple as adding the following entry in the configuration file:</p> <pre><code>{\n  \"mcpServers\": {\n    \"docling\": {\n      \"command\": \"uvx\",\n      \"args\": [\n        \"--from=docling-mcp\",\n        \"docling-mcp-server\"\n      ]\n    }\n  }\n}\n</code></pre> <p>When using Claude on your desktop, just edit the config file <code>claude_desktop_config.json</code> with the snippet above or the example provided here.</p> <p>In LM Studio, edit the <code>mcp.json</code> file with the appropriate section or simply click on the button below for a direct install.</p> <p></p> <p>Docling MCP also provides tools specific for some applications and frameworks. See the Docling MCP Server repository for more details. You will find examples of building agents powered by Docling capabilities and leveraging frameworks like LlamaIndex, Llama Stack, Pydantic AI, or smolagents.</p>"},{"location":"usage/supported_formats/","title":"Supported formats","text":"<p>Docling can parse various documents formats into a unified representation (Docling Document), which it can export to different formats too \u2014 check out Architecture for more details.</p> <p>Below you can find a listing of all supported input and output formats.</p>"},{"location":"usage/supported_formats/#supported-input-formats","title":"Supported input formats","text":"Format Description PDF DOCX, XLSX, PPTX Default formats in MS Office 2007+, based on Office Open XML Markdown AsciiDoc Human-readable, plain-text markup language for structured technical content HTML, XHTML CSV PNG, JPEG, TIFF, BMP, WEBP Image formats WebVTT Web Video Text Tracks format for displaying timed text <p>Schema-specific support:</p> Format Description USPTO XML XML format followed by USPTO patents JATS XML XML format followed by JATS articles Docling JSON JSON-serialized Docling Document"},{"location":"usage/supported_formats/#supported-output-formats","title":"Supported output formats","text":"Format Description HTML Both image embedding and referencing are supported Markdown JSON Lossless serialization of Docling Document Text Plain text, i.e. without Markdown markers Doctags Markup format for efficiently representing the full content and layout characteristics of a document"},{"location":"usage/vision_models/","title":"Vision models","text":"<p>The <code>VlmPipeline</code> in Docling allows you to convert documents end-to-end using a vision-language model.</p> <p>Docling supports vision-language models which output:</p> <ul> <li>DocTags (e.g. SmolDocling), the preferred choice</li> <li>Markdown</li> <li>HTML</li> </ul> <p>For running Docling using local models with the <code>VlmPipeline</code>:</p> CLIPython <pre><code>docling --pipeline vlm FILE\n</code></pre> <p>See also the example minimal_vlm_pipeline.py.</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\nconverter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(\n            pipeline_cls=VlmPipeline,\n        ),\n    }\n)\n\ndoc = converter.convert(source=\"FILE\").document\n</code></pre>"},{"location":"usage/vision_models/#available-local-models","title":"Available local models","text":"<p>By default, the vision-language models are running locally. Docling allows to choose between the Hugging Face Transformers framework and the MLX (for Apple devices with MPS acceleration) one.</p> <p>The following table reports the models currently available out-of-the-box.</p> Model instance Model Framework Device Num pages Inference time (sec) <code>vlm_model_specs.GRANITEDOCLING_TRANSFORMERS</code> ibm-granite/granite-docling-258M <code>Transformers/AutoModelForVision2Seq</code> MPS 1 - <code>vlm_model_specs.GRANITEDOCLING_MLX</code> ibm-granite/granite-docling-258M-mlx-bf16 <code>MLX</code> MPS 1 - <code>vlm_model_specs.SMOLDOCLING_TRANSFORMERS</code> ds4sd/SmolDocling-256M-preview <code>Transformers/AutoModelForVision2Seq</code> MPS 1 102.212 <code>vlm_model_specs.SMOLDOCLING_MLX</code> ds4sd/SmolDocling-256M-preview-mlx-bf16 <code>MLX</code> MPS 1 6.15453 <code>vlm_model_specs.QWEN25_VL_3B_MLX</code> mlx-community/Qwen2.5-VL-3B-Instruct-bf16 <code>MLX</code> MPS 1 23.4951 <code>vlm_model_specs.PIXTRAL_12B_MLX</code> mlx-community/pixtral-12b-bf16 <code>MLX</code> MPS 1 308.856 <code>vlm_model_specs.GEMMA3_12B_MLX</code> mlx-community/gemma-3-12b-it-bf16 <code>MLX</code> MPS 1 378.486 <code>vlm_model_specs.GRANITE_VISION_TRANSFORMERS</code> ibm-granite/granite-vision-3.2-2b <code>Transformers/AutoModelForVision2Seq</code> MPS 1 104.75 <code>vlm_model_specs.PHI4_TRANSFORMERS</code> microsoft/Phi-4-multimodal-instruct <code>Transformers/AutoModelForCasualLM</code> CPU 1 1175.67 <code>vlm_model_specs.PIXTRAL_12B_TRANSFORMERS</code> mistral-community/pixtral-12b <code>Transformers/AutoModelForVision2Seq</code> CPU 1 1828.21 <p>Inference time is computed on a Macbook M3 Max using the example page <code>tests/data/pdf/2305.03393v1-pg9.pdf</code>. The comparison is done with the example compare_vlm_models.py.</p> <p>For choosing the model, the code snippet above can be extended as follow</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\nfrom docling.datamodel.pipeline_options import (\n    VlmPipelineOptions,\n)\nfrom docling.datamodel import vlm_model_specs\n\npipeline_options = VlmPipelineOptions(\n    vlm_options=vlm_model_specs.SMOLDOCLING_MLX,  # &lt;-- change the model here\n)\n\nconverter = DocumentConverter(\n    format_options={\n        InputFormat.PDF: PdfFormatOption(\n            pipeline_cls=VlmPipeline,\n            pipeline_options=pipeline_options,\n        ),\n    }\n)\n\ndoc = converter.convert(source=\"FILE\").document\n</code></pre>"},{"location":"usage/vision_models/#other-models","title":"Other models","text":"<p>Other models can be configured by directly providing the Hugging Face <code>repo_id</code>, the prompt and a few more options.</p> <p>For example:</p> <pre><code>from docling.datamodel.pipeline_options_vlm_model import InlineVlmOptions, InferenceFramework, TransformersModelType\n\npipeline_options = VlmPipelineOptions(\n    vlm_options=InlineVlmOptions(\n        repo_id=\"ibm-granite/granite-vision-3.2-2b\",\n        prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\",\n        response_format=ResponseFormat.MARKDOWN,\n        inference_framework=InferenceFramework.TRANSFORMERS,\n        transformers_model_type=TransformersModelType.AUTOMODEL_VISION2SEQ,\n        supported_devices=[\n            AcceleratorDevice.CPU,\n            AcceleratorDevice.CUDA,\n            AcceleratorDevice.MPS,\n        ],\n        scale=2.0,\n        temperature=0.0,\n    )\n)\n</code></pre>"},{"location":"usage/vision_models/#remote-models","title":"Remote models","text":"<p>Additionally to local models, the <code>VlmPipeline</code> allows to offload the inference to a remote service hosting the models. Many remote inference services are provided, the key requirement is to offer an OpenAI-compatible API. This includes vLLM, Ollama, etc.</p> <p>More examples on how to connect with the remote inference services can be found in the following examples:</p> <ul> <li>vlm_pipeline_api_model.py</li> </ul>"}]}