mirror of
https://github.com/DS4SD/docling.git
synced 2025-07-30 14:04:27 +00:00
1 line
816 KiB
JSON
1 line
816 KiB
JSON
{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Docling","text":"<p>Docling simplifies document processing, parsing diverse formats \u2014 including advanced PDF understanding \u2014 and providing seamless integrations with the gen AI ecosystem.</p>"},{"location":"#features","title":"Features","text":"<ul> <li>\ud83d\uddc2\ufe0f Parsing of multiple document formats incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more</li> <li>\ud83d\udcd1 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more</li> <li>\ud83e\uddec Unified, expressive DoclingDocument representation format</li> <li>\u21aa\ufe0f Various export formats and options, including Markdown, HTML, DocTags and lossless JSON</li> <li>\ud83d\udd12 Local execution capabilities for sensitive data and air-gapped environments</li> <li>\ud83e\udd16 Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI</li> <li>\ud83d\udd0d Extensive OCR support for scanned PDFs and images</li> <li>\ud83d\udc53 Support of several Visual Language Models (SmolDocling)</li> <li>\ud83c\udf99\ufe0f Support for Audio with Automatic Speech Recognition (ASR) models</li> <li>\ud83d\udcbb Simple and convenient CLI</li> </ul>"},{"location":"#coming-soon","title":"Coming soon","text":"<ul> <li>\ud83d\udcdd Metadata extraction, including title, authors, references & language</li> <li>\ud83d\udcdd Chart understanding (Barchart, Piechart, LinePlot, etc)</li> <li>\ud83d\udcdd Complex chemistry understanding (Molecular structures)</li> </ul>"},{"location":"#get-started","title":"Get started","text":"ConceptsLearn Docling fundamentals ExamplesTry out recipes for various use cases, including conversion, RAG, and more IntegrationsCheck out integrations with popular frameworks and tools ReferenceSee more API details"},{"location":"#lf-ai-data","title":"LF AI & Data","text":"<p>Docling is hosted as a project in the LF AI & Data Foundation.</p>"},{"location":"#ibm-open-source-ai","title":"IBM \u2764\ufe0f Open Source AI","text":"<p>The project was started by the AI for knowledge team at IBM Research Zurich.</p>"},{"location":"v2/","title":"V2","text":""},{"location":"v2/#whats-new","title":"What's new","text":"<p>Docling v2 introduces several new features:</p> <ul> <li>Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats</li> <li>Produces a new, universal document representation which can encapsulate document hierarchy</li> <li>Comes with a fresh new API and CLI</li> </ul>"},{"location":"v2/#changes-in-docling-v2","title":"Changes in Docling v2","text":""},{"location":"v2/#cli","title":"CLI","text":"<p>We updated the command line syntax of Docling v2 to support many formats. Examples are seen below. <pre><code># Convert a single file to Markdown (default)\ndocling myfile.pdf\n\n# Convert a single file to Markdown and JSON, without OCR\ndocling myfile.pdf --to json --to md --no-ocr\n\n# Convert PDF files in input directory to Markdown (default)\ndocling ./input/dir --from pdf\n\n# Convert PDF and Word files in input directory to Markdown and JSON\ndocling ./input/dir --from pdf --from docx --to md --to json --output ./scratch\n\n# Convert all supported files in input directory to Markdown, but abort on first error\ndocling ./input/dir --output ./scratch --abort-on-error\n</code></pre></p> <p>Notable changes from Docling v1:</p> <ul> <li>The standalone switches for different export formats are removed, and replaced with <code>--from</code> and <code>--to</code> arguments, to define input and output formats respectively.</li> <li>The new <code>--abort-on-error</code> will abort any batch conversion as soon an error is encountered</li> <li>The <code>--backend</code> option for PDFs was removed</li> </ul>"},{"location":"v2/#setting-up-a-documentconverter","title":"Setting up a <code>DocumentConverter</code>","text":"<p>To accommodate many input formats, we changed the way you need to set up your <code>DocumentConverter</code> object. You can now define a list of allowed formats on the <code>DocumentConverter</code> initialization, and specify custom options per-format if desired. By default, all supported formats are allowed. If you don't provide <code>format_options</code>, defaults will be used for all <code>allowed_formats</code>.</p> <p>Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend. They are provided as format-specific types, such as <code>PdfFormatOption</code> or <code>WordFormatOption</code>, as seen below.</p> <pre><code>from docling.document_converter import DocumentConverter\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import (\n DocumentConverter,\n PdfFormatOption,\n WordFormatOption,\n)\nfrom docling.pipeline.simple_pipeline import SimplePipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\n\n## Default initialization still works as before:\n# doc_converter = DocumentConverter()\n\n\n# previous `PipelineOptions` is now `PdfPipelineOptions`\npipeline_options = PdfPipelineOptions()\npipeline_options.do_ocr = False\npipeline_options.do_table_structure = True\n#...\n\n## Custom options are now defined per format.\ndoc_converter = (\n DocumentConverter( # all of the below is optional, has internal defaults.\n allowed_formats=[\n InputFormat.PDF,\n InputFormat.IMAGE,\n InputFormat.DOCX,\n InputFormat.HTML,\n InputFormat.PPTX,\n ], # whitelist formats, non-matching files are ignored.\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options, # pipeline options go here.\n backend=PyPdfiumDocumentBackend # optional: pick an alternative backend\n ),\n InputFormat.DOCX: WordFormatOption(\n pipeline_cls=SimplePipeline # default for office formats and HTML\n ),\n },\n )\n)\n</code></pre> <p>Note: If you work only with defaults, all remains the same as in Docling v1.</p> <p>More options are shown in the following example units:</p> <ul> <li>run_with_formats.py</li> <li>custom_convert.py</li> </ul>"},{"location":"v2/#converting-documents","title":"Converting documents","text":"<p>We have simplified the way you can feed input to the <code>DocumentConverter</code> and renamed the conversion methods for better semantics. You can now call the conversion directly with a single file, or a list of input files, or <code>DocumentStream</code> objects, without constructing a <code>DocumentConversionInput</code> object first.</p> <ul> <li><code>DocumentConverter.convert</code> now converts a single file input (previously <code>DocumentConverter.convert_single</code>).</li> <li><code>DocumentConverter.convert_all</code> now converts many files at once (previously <code>DocumentConverter.convert</code>).</li> </ul> <p><pre><code>...\nfrom docling.datamodel.document import ConversionResult\n## Convert a single file (from URL or local path)\nconv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Convert several files at once:\n\ninput_files = [\n \"tests/data/html/wiki_duck.html\",\n \"tests/data/docx/word_sample.docx\",\n \"tests/data/docx/lorem_ipsum.docx\",\n \"tests/data/pptx/powerpoint_sample.pptx\",\n \"tests/data/2305.03393v1-pg9-img.png\",\n \"tests/data/pdf/2206.01062.pdf\",\n]\n\n# Directly pass list of files or streams to `convert_all`\nconv_results_iter = doc_converter.convert_all(input_files) # previously `convert`\n</code></pre> Through the <code>raises_on_error</code> argument, you can also control if the conversion should raise exceptions when first encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status. By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed).</p> <pre><code>...\nconv_results_iter = doc_converter.convert_all(input_files, raises_on_error=False) # previously `convert`\n</code></pre>"},{"location":"v2/#access-document-structures","title":"Access document structures","text":"<p>We have simplified how you can access and export the converted document data, too. Our universal document representation is now available in conversion results as a <code>DoclingDocument</code> object. <code>DoclingDocument</code> provides a neat set of APIs to construct, iterate and export content in the document, as shown below.</p> <pre><code>conv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Inspect the converted document:\nconv_result.document.print_element_tree()\n\n## Iterate the elements in reading order, including hierarchy level:\nfor item, level in conv_result.document.iterate_items():\n if isinstance(item, TextItem):\n print(item.text)\n elif isinstance(item, TableItem):\n table_df: pd.DataFrame = item.export_to_dataframe()\n print(table_df.to_markdown())\n elif ...:\n #...\n</code></pre> <p>Note: While it is deprecated, you can still work with the Docling v1 document representation, it is available as: <pre><code>conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type\n</code></pre></p>"},{"location":"v2/#export-into-json-markdown-doctags","title":"Export into JSON, Markdown, Doctags","text":"<p>Note: All <code>render_...</code> methods in <code>ConversionResult</code> have been removed in Docling v2, and are now available on <code>DoclingDocument</code> as:</p> <ul> <li><code>DoclingDocument.export_to_dict</code></li> <li><code>DoclingDocument.export_to_markdown</code></li> <li><code>DoclingDocument.export_to_document_tokens</code></li> </ul> <pre><code>conv_result: ConversionResult = doc_converter.convert(\"https://arxiv.org/pdf/2408.09869\") # previously `convert_single`\n\n## Export to desired format:\nprint(json.dumps(conv_res.document.export_to_dict()))\nprint(conv_res.document.export_to_markdown())\nprint(conv_res.document.export_to_document_tokens())\n</code></pre> <p>Note: While it is deprecated, you can still export Docling v1 JSON format. This is available through the same methods as on the <code>DoclingDocument</code> type: <pre><code>## Export legacy document representation to desired format, for v1 compatibility:\nprint(json.dumps(conv_res.legacy_document.export_to_dict()))\nprint(conv_res.legacy_document.export_to_markdown())\nprint(conv_res.legacy_document.export_to_document_tokens())\n</code></pre></p>"},{"location":"v2/#reload-a-doclingdocument-stored-as-json","title":"Reload a <code>DoclingDocument</code> stored as JSON","text":"<p>You can save and reload a <code>DoclingDocument</code> to disk in JSON format using the following codes:</p> <pre><code># Save to disk:\ndoc: DoclingDocument = conv_res.document # produced from conversion result...\n\nwith Path(\"./doc.json\").open(\"w\") as fp:\n fp.write(json.dumps(doc.export_to_dict())) # use `export_to_dict` to ensure consistency\n\n# Load from disk:\nwith Path(\"./doc.json\").open(\"r\") as fp:\n doc_dict = json.loads(fp.read())\n doc = DoclingDocument.model_validate(doc_dict) # use standard pydantic API to populate doc\n</code></pre>"},{"location":"v2/#chunking","title":"Chunking","text":"<p>Docling v2 defines new base classes for chunking:</p> <ul> <li><code>BaseMeta</code> for chunk metadata</li> <li><code>BaseChunk</code> containing the chunk text and metadata, and</li> <li><code>BaseChunker</code> for chunkers, producing chunks out of a <code>DoclingDocument</code>.</li> </ul> <p>Additionally, it provides an updated <code>HierarchicalChunker</code> implementation, which leverages the new <code>DoclingDocument</code> and provides a new, richer chunk output format, including:</p> <ul> <li>the respective doc items for grounding</li> <li>any applicable headings for context</li> <li>any applicable captions for context</li> </ul> <p>For an example, check out Chunking usage.</p>"},{"location":"concepts/","title":"Concepts","text":"<p>Use the navigation on the left to browse through some core Docling concepts.</p>"},{"location":"concepts/architecture/","title":"Architecture","text":"<p>In a nutshell, Docling's architecture is outlined in the diagram above.</p> <p>For each document format, the document converter knows which format-specific backend to employ for parsing the document and which pipeline to use for orchestrating the execution, along with any relevant options.</p> <p>Tip</p> <p>While the document converter holds a default mapping, this configuration is parametrizable, so e.g. for the PDF format, different backends and different pipeline options can be used \u2014 see Usage.</p> <p>The conversion result contains the Docling document, Docling's fundamental document representation.</p> <p>Some typical scenarios for using a Docling document include directly calling its export methods, such as for markdown, dictionary etc., or having it serialized by a serializer or chunked by a chunker.</p> <p>For more details on Docling's architecture, check out the Docling Technical Report.</p> <p>Note</p> <p>The components illustrated with dashed outline indicate base classes that can be subclassed for specialized implementations.</p>"},{"location":"concepts/chunking/","title":"Chunking","text":""},{"location":"concepts/chunking/#introduction","title":"Introduction","text":"<p>Chunking approaches</p> <p>Starting from a <code>DoclingDocument</code>, there are in principle two possible chunking approaches:</p> <ol> <li>exporting the <code>DoclingDocument</code> to Markdown (or similar format) and then performing user-defined chunking as a post-processing step, or</li> <li>using native Docling chunkers, i.e. operating directly on the <code>DoclingDocument</code></li> </ol> <p>This page is about the latter, i.e. using native Docling chunkers. For an example of using approach (1) check out e.g. this recipe looking at the Markdown export mode.</p> <p>A chunker is a Docling abstraction that, given a <code>DoclingDocument</code>, returns a stream of chunks, each of which captures some part of the document as a string accompanied by respective metadata.</p> <p>To enable both flexibility for downstream applications and out-of-the-box utility, Docling defines a chunker class hierarchy, providing a base type, <code>BaseChunker</code>, as well as specific subclasses.</p> <p>Docling integration with gen AI frameworks like LlamaIndex is done using the <code>BaseChunker</code> interface, so users can easily plug in any built-in, self-defined, or third-party <code>BaseChunker</code> implementation.</p>"},{"location":"concepts/chunking/#base-chunker","title":"Base Chunker","text":"<p>The <code>BaseChunker</code> base class API defines that any chunker should provide the following:</p> <ul> <li><code>def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]</code>: Returning the chunks for the provided document.</li> <li><code>def contextualize(self, chunk: BaseChunk) -> str</code>: Returning the potentially metadata-enriched serialization of the chunk, typically used to feed an embedding model (or generation model).</li> </ul>"},{"location":"concepts/chunking/#hybrid-chunker","title":"Hybrid Chunker","text":"<p>To access <code>HybridChunker</code></p> <ul> <li>If you are using the <code>docling</code> package, you can import as follows: <pre><code>from docling.chunking import HybridChunker\n</code></pre></li> <li>If you are only using the <code>docling-core</code> package, you must ensure to install the <code>chunking</code> extra if you want to use HuggingFace tokenizers, e.g. <pre><code>pip install 'docling-core[chunking]'\n</code></pre> or the <code>chunking-openai</code> extra if you prefer Open AI tokenizers (tiktoken), e.g. <pre><code>pip install 'docling-core[chunking-openai]'\n</code></pre> and then you can import as follows: <pre><code>from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\n</code></pre></li> </ul> <p>The <code>HybridChunker</code> implementation uses a hybrid approach, applying tokenization-aware refinements on top of document-based hierarchical chunking.</p> <p>More precisely:</p> <ul> <li>it starts from the result of the hierarchical chunker and, based on the user-provided tokenizer (typically to be aligned to the embedding model tokenizer), it:</li> <li>does one pass where it splits chunks only when needed (i.e. oversized w.r.t. tokens), &</li> <li>another pass where it merges chunks only when possible (i.e. undersized successive chunks with same headings & captions) \u2014 users can opt out of this step via param <code>merge_peers</code> (by default <code>True</code>)</li> </ul> <p>\ud83d\udc49 Usage examples:</p> <ul> <li>Hybrid chunking</li> <li>Advanced chunking & serialization</li> </ul>"},{"location":"concepts/chunking/#hierarchical-chunker","title":"Hierarchical Chunker","text":"<p>The <code>HierarchicalChunker</code> implementation uses the document structure information from the <code>DoclingDocument</code> to create one chunk for each individual detected document element, by default only merging together list items (can be opted out via param <code>merge_list_items</code>). It also takes care of attaching all relevant document metadata, including headers and captions.</p>"},{"location":"concepts/confidence_scores/","title":"Confidence Scores","text":""},{"location":"concepts/confidence_scores/#introduction","title":"Introduction","text":"<p>Confidence grades were introduced in v2.34.0 to help users understand how well a conversion performed and guide decisions about post-processing workflows. They are available in the <code>confidence</code> field of the <code>ConversionResult</code> object returned by the document converter.</p>"},{"location":"concepts/confidence_scores/#purpose","title":"Purpose","text":"<p>Complex layouts, poor scan quality, or challenging formatting can lead to suboptimal document conversion results that may require additional attention or alternative conversion pipelines.</p> <p>Confidence scores provide a quantitative assessment of document conversion quality. Each confidence report includes a numerical score (0.0 to 1.0) measuring conversion accuracy, and a quality grade (poor, fair, good, excellent) for quick interpretation.</p> <p>Focus on quality grades!</p> <p>Users can and should safely focus on the document-level grade fields \u2014 <code>mean_grade</code> and <code>low_grade</code> \u2014 to assess overall conversion quality. Numerical scores are used internally and are for informational purposes only; their computation and weighting may change in the future.</p> <p>Use cases for confidence grades include:</p> <ul> <li>Identify documents requiring manual review after the conversion</li> <li>Adjust conversion pipelines to the most appropriate for each document type</li> <li>Set confidence thresholds for unattended batch conversions</li> <li>Catch potential conversion issues early in your workflow.</li> </ul>"},{"location":"concepts/confidence_scores/#concepts","title":"Concepts","text":""},{"location":"concepts/confidence_scores/#scores-and-grades","title":"Scores and grades","text":"<p>A confidence report contains scores and grades:</p> <ul> <li>Scores: Numerical values between 0.0 and 1.0, where higher values indicate better conversion quality, for internal use only</li> <li>Grades: Categorical quality assessments based on score thresholds, used to assess the overall conversion confidence:</li> <li><code>POOR</code></li> <li><code>FAIR</code></li> <li><code>GOOD</code></li> <li><code>EXCELLENT</code></li> </ul>"},{"location":"concepts/confidence_scores/#types-of-confidence-calculated","title":"Types of confidence calculated","text":"<p>Each confidence report includes four component scores and grades:</p> <ul> <li><code>layout_score</code>: Overall quality of document element recognition </li> <li><code>ocr_score</code>: Quality of OCR-extracted content</li> <li><code>parse_score</code>: 10th percentile score of digital text cells (emphasizes problem areas)</li> <li><code>table_score</code>: Table extraction quality (not yet implemented)</li> </ul>"},{"location":"concepts/confidence_scores/#summary-grades","title":"Summary grades","text":"<p>Two aggregate grades provide overall document quality assessment:</p> <ul> <li><code>mean_grade</code>: Average of the four component scores</li> <li><code>low_grade</code>: 5th percentile score (highlights worst-performing areas)</li> </ul>"},{"location":"concepts/confidence_scores/#page-level-vs-document-level","title":"Page-level vs document-level","text":"<p>Confidence grades are calculated at two levels:</p> <ul> <li>Page-level: Individual scores and grades for each page, stored in the <code>pages</code> field</li> <li>Document-level: Overall scores and grades for the entire document, calculated as averages of the page-level grades and stored in fields equally named in the root <code>ConfidenceReport</code></li> </ul>"},{"location":"concepts/confidence_scores/#example","title":"Example","text":""},{"location":"concepts/docling_document/","title":"Docling Document","text":"<p>With Docling v2, we introduce a unified document representation format called <code>DoclingDocument</code>. It is defined as a pydantic datatype, which can express several features common to documents, such as:</p> <ul> <li>Text, Tables, Pictures, and more</li> <li>Document hierarchy with sections and groups</li> <li>Disambiguation between main body and headers, footers (furniture)</li> <li>Layout information (i.e. bounding boxes) for all items, if available</li> <li>Provenance information</li> </ul> <p>The definition of the Pydantic types is implemented in the module <code>docling_core.types.doc</code>, more details in source code definitions.</p> <p>It also brings a set of document construction APIs to build up a <code>DoclingDocument</code> from scratch.</p>"},{"location":"concepts/docling_document/#example-document-structures","title":"Example document structures","text":"<p>To illustrate the features of the <code>DoclingDocument</code> format, in the subsections below we consider the <code>DoclingDocument</code> converted from <code>tests/data/word_sample.docx</code> and we present some side-by-side comparisons, where the left side shows snippets from the converted document serialized as YAML and the right one shows the corresponding parts of the original MS Word.</p>"},{"location":"concepts/docling_document/#basic-structure","title":"Basic structure","text":"<p>A <code>DoclingDocument</code> exposes top-level fields for the document content, organized in two categories. The first category is the content items, which are stored in these fields:</p> <ul> <li><code>texts</code>: All items that have a text representation (paragraph, section heading, equation, ...). Base class is <code>TextItem</code>.</li> <li><code>tables</code>: All tables, type <code>TableItem</code>. Can carry structure annotations.</li> <li><code>pictures</code>: All pictures, type <code>PictureItem</code>. Can carry structure annotations.</li> <li><code>key_value_items</code>: All key-value items.</li> </ul> <p>All of the above fields are lists and store items inheriting from the <code>DocItem</code> type. They can express different data structures depending on their type, and reference parents and children through JSON pointers.</p> <p>The second category is content structure, which is encapsulated in:</p> <ul> <li><code>body</code>: The root node of a tree-structure for the main document body</li> <li><code>furniture</code>: The root node of a tree-structure for all items that don't belong into the body (headers, footers, ...)</li> <li><code>groups</code>: A set of items that don't represent content, but act as containers for other content items (e.g. a list, a chapter)</li> </ul> <p>All of the above fields are only storing <code>NodeItem</code> instances, which reference children and parents through JSON pointers.</p> <p>The reading order of the document is encapsulated through the <code>body</code> tree and the order of children in each item in the tree.</p> <p>Below example shows how all items in the first page are nested below the <code>title</code> item (<code>#/texts/1</code>).</p> <p></p>"},{"location":"concepts/docling_document/#grouping","title":"Grouping","text":"<p>Below example shows how all items under the heading \"Let's swim\" (<code>#/texts/5</code>) are nested as children. The children of \"Let's swim\" are both text items and groups, which contain the list elements. The group items are stored in the top-level <code>groups</code> field.</p> <p></p>"},{"location":"concepts/plugins/","title":"Plugins","text":"<p>Docling allows to be extended with third-party plugins which extend the choice of options provided in several steps of the pipeline.</p> <p>Plugins are loaded via the pluggy system which allows third-party developers to register the new capabilities using the setuptools entrypoint.</p> <p>The actual entrypoint definition might vary, depending on the packaging system you are using. Here are a few examples:</p> pyproject.tomlpoetry v1 pyproject.tomlsetup.cfgsetup.py <pre><code>[project.entry-points.\"docling\"]\nyour_plugin_name = \"your_package.module\"\n</code></pre> <pre><code>[tool.poetry.plugins.\"docling\"]\nyour_plugin_name = \"your_package.module\"\n</code></pre> <pre><code>[options.entry_points]\ndocling =\n your_plugin_name = your_package.module\n</code></pre> <pre><code>from setuptools import setup\n\nsetup(\n # ...,\n entry_points = {\n 'docling': [\n 'your_plugin_name = \"your_package.module\"'\n ]\n }\n)\n</code></pre> <ul> <li><code>your_plugin_name</code> is the name you choose for your plugin. This must be unique among the broader Docling ecosystem.</li> <li><code>your_package.module</code> is the reference to the module in your package which is responsible for the plugin registration.</li> </ul>"},{"location":"concepts/plugins/#plugin-factories","title":"Plugin factories","text":""},{"location":"concepts/plugins/#ocr-factory","title":"OCR factory","text":"<p>The OCR factory allows to provide more OCR engines to the Docling users.</p> <p>The content of <code>your_package.module</code> registers the OCR engines with a code similar to:</p> <pre><code># Factory registration\ndef ocr_engines():\n return {\n \"ocr_engines\": [\n YourOcrModel,\n ]\n }\n</code></pre> <p>where <code>YourOcrModel</code> must implement the <code>BaseOcrModel</code> and provide an options class derived from <code>OcrOptions</code>.</p> <p>If you look for an example, the default Docling plugins is a good starting point.</p>"},{"location":"concepts/plugins/#third-party-plugins","title":"Third-party plugins","text":"<p>When the plugin is not provided by the main <code>docling</code> package but by a third-party package this have to be enabled explicitly via the <code>allow_external_plugins</code> option.</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions()\npipeline_options.allow_external_plugins = True # <-- enabled the external plugins\npipeline_options.ocr_options = YourOptions # <-- your options here\n\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options\n )\n }\n)\n</code></pre>"},{"location":"concepts/plugins/#using-the-docling-cli","title":"Using the <code>docling</code> CLI","text":"<p>Similarly, when using the <code>docling</code> users have to enable external plugins before selecting the new one.</p> <pre><code># Show the external plugins\ndocling --show-external-plugins\n\n# Run docling with the new plugin\ndocling --allow-external-plugins --ocr-engine=NAME\n</code></pre>"},{"location":"concepts/serialization/","title":"Serialization","text":""},{"location":"concepts/serialization/#introduction","title":"Introduction","text":"<p>A document serializer (AKA simply serializer) is a Docling abstraction that is initialized with a given <code>DoclingDocument</code> and returns a textual representation for that document.</p> <p>Besides the document serializer, Docling defines similar abstractions for several document subcomponents, for example: text serializer, table serializer, picture serializer, list serializer, inline serializer, and more.</p> <p>Last but not least, a serializer provider is a wrapper that abstracts the document serialization strategy from the document instance.</p>"},{"location":"concepts/serialization/#base-classes","title":"Base classes","text":"<p>To enable both flexibility for downstream applications and out-of-the-box utility, Docling defines a serialization class hierarchy, providing:</p> <ul> <li>base types for the above abstractions: <code>BaseDocSerializer</code>, as well as <code>BaseTextSerializer</code>, <code>BaseTableSerializer</code> etc, and <code>BaseSerializerProvider</code>, and</li> <li>specific subclasses for the above-mentioned base types, e.g. <code>MarkdownDocSerializer</code>.</li> </ul> <p>You can review all methods required to define the above base classes here.</p> <p>From a client perspective, the most relevant is <code>BaseDocSerializer.serialize()</code>, which returns the textual representation,\u00a0as well as relevant metadata on which document components contributed to that serialization.</p>"},{"location":"concepts/serialization/#use-in-doclingdocument-export-methods","title":"Use in <code>DoclingDocument</code> export methods","text":"<p>Docling provides predefined serializers for Markdown, HTML, and DocTags.</p> <p>The respective <code>DoclingDocument</code> export methods (e.g. <code>export_to_markdown()</code>) are provided as user shorthands \u2014 internally directly instantiating and delegating to respective serializers.</p>"},{"location":"concepts/serialization/#examples","title":"Examples","text":"<p>For an example showcasing how to use serializers, see here.</p>"},{"location":"examples/","title":"Examples","text":"<p>Use the navigation on the left to browse through examples covering a range of possible workflows and use cases.</p>"},{"location":"examples/advanced_chunking_and_serialization/","title":"Advanced chunking & serialization","text":"<p>In this notebook we show how to customize the serialization strategies that come into play during chunking.</p> <p>We will work with a document that contains some picture annotations:</p> In\u00a0[1]: Copied! <pre>from docling_core.types.doc.document import DoclingDocument\n\nSOURCE = \"./data/2408.09869v3_enriched.json\"\n\ndoc = DoclingDocument.load_from_json(SOURCE)\n</pre> from docling_core.types.doc.document import DoclingDocument SOURCE = \"./data/2408.09869v3_enriched.json\" doc = DoclingDocument.load_from_json(SOURCE) <p>Below we define the chunker (for more details check out Hybrid Chunking):</p> In\u00a0[2]: Copied! <pre>from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\nfrom docling_core.transforms.chunker.tokenizer.base import BaseTokenizer\nfrom docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\nfrom transformers import AutoTokenizer\n\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n\ntokenizer: BaseTokenizer = HuggingFaceTokenizer(\n tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n)\nchunker = HybridChunker(tokenizer=tokenizer)\n</pre> from docling_core.transforms.chunker.hybrid_chunker import HybridChunker from docling_core.transforms.chunker.tokenizer.base import BaseTokenizer from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer from transformers import AutoTokenizer EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" tokenizer: BaseTokenizer = HuggingFaceTokenizer( tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID), ) chunker = HybridChunker(tokenizer=tokenizer) In\u00a0[3]: Copied! <pre>print(f\"{tokenizer.get_max_tokens()=}\")\n</pre> print(f\"{tokenizer.get_max_tokens()=}\") <pre>tokenizer.get_max_tokens()=512\n</pre> <p>Defining some helper methods:</p> In\u00a0[4]: Copied! <pre>from typing import Iterable, Optional\n\nfrom docling_core.transforms.chunker.base import BaseChunk\nfrom docling_core.transforms.chunker.hierarchical_chunker import DocChunk\nfrom docling_core.types.doc.labels import DocItemLabel\nfrom rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console(\n width=200, # for getting Markdown tables rendered nicely\n)\n\n\ndef find_n_th_chunk_with_label(\n iter: Iterable[BaseChunk], n: int, label: DocItemLabel\n) -> Optional[DocChunk]:\n num_found = -1\n for i, chunk in enumerate(iter):\n doc_chunk = DocChunk.model_validate(chunk)\n for it in doc_chunk.meta.doc_items:\n if it.label == label:\n num_found += 1\n if num_found == n:\n return i, chunk\n return None, None\n\n\ndef print_chunk(chunks, chunk_pos):\n chunk = chunks[chunk_pos]\n ctx_text = chunker.contextualize(chunk=chunk)\n num_tokens = tokenizer.count_tokens(text=ctx_text)\n doc_items_refs = [it.self_ref for it in chunk.meta.doc_items]\n title = f\"{chunk_pos=} {num_tokens=} {doc_items_refs=}\"\n console.print(Panel(ctx_text, title=title))\n</pre> from typing import Iterable, Optional from docling_core.transforms.chunker.base import BaseChunk from docling_core.transforms.chunker.hierarchical_chunker import DocChunk from docling_core.types.doc.labels import DocItemLabel from rich.console import Console from rich.panel import Panel console = Console( width=200, # for getting Markdown tables rendered nicely ) def find_n_th_chunk_with_label( iter: Iterable[BaseChunk], n: int, label: DocItemLabel ) -> Optional[DocChunk]: num_found = -1 for i, chunk in enumerate(iter): doc_chunk = DocChunk.model_validate(chunk) for it in doc_chunk.meta.doc_items: if it.label == label: num_found += 1 if num_found == n: return i, chunk return None, None def print_chunk(chunks, chunk_pos): chunk = chunks[chunk_pos] ctx_text = chunker.contextualize(chunk=chunk) num_tokens = tokenizer.count_tokens(text=ctx_text) doc_items_refs = [it.self_ref for it in chunk.meta.doc_items] title = f\"{chunk_pos=} {num_tokens=} {doc_items_refs=}\" console.print(Panel(ctx_text, title=title)) <p>Below we inspect the first chunk containing a table \u2014 using the default serialization strategy:</p> In\u00a0[5]: Copied! <pre>chunker = HybridChunker(tokenizer=tokenizer)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\n</pre> chunker = HybridChunker(tokenizer=tokenizer) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE) print_chunk( chunks=chunks, chunk_pos=i, ) <pre>Token indices sequence length is longer than the specified maximum sequence length for this model (652 > 512). Running this sequence through the model will result in indexing errors\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=13 num_tokens=426 doc_items_refs=['#/texts/72', '#/tables/0'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 4 Performance \u2502\n\u2502 Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u2502\n\u2502 (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. \u2502\n\u2502 \u2502\n\u2502 Apple M3 Max, Thread budget. = 4. Apple M3 Max, native backend.TTS = 177 s 167 s. Apple M3 Max, native backend.Pages/s = 1.27 1.34. Apple M3 Max, native backend.Mem = 6.20 GB. Apple M3 Max, \u2502\n\u2502 pypdfium backend.TTS = 103 s 92 s. Apple M3 Max, pypdfium backend.Pages/s = 2.18 2.45. Apple M3 Max, pypdfium backend.Mem = 2.56 GB. (16 cores) Intel(R) Xeon E5-2690, Thread budget. = 16 4 16. (16 \u2502\n\u2502 cores) Intel(R) Xeon E5-2690, native backend.TTS = 375 s 244 s. (16 cores) Intel(R) Xeon E5-2690, native backend.Pages/s = 0.60 0.92. (16 cores) Intel(R) Xeon E5-2690, native backend.Mem = 6.16 \u2502\n\u2502 GB. (16 cores) Intel(R) Xeon E5-2690, pypdfium backend.TTS = 239 s 143 s. (16 cores) Intel(R) Xeon E5-2690, pypdfium backend.Pages/s = 0.94 1.57. (16 cores) Intel(R) Xeon E5-2690, pypdfium \u2502\n\u2502 backend.Mem = 2.42 GB \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> INFO: As you see above, using the <code>HybridChunker</code> can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here. <p>We can configure a different serialization strategy. In the example below, we specify a different table serializer that serializes tables to Markdown instead of the triplet notation used by default:</p> In\u00a0[6]: Copied! <pre>from docling_core.transforms.chunker.hierarchical_chunker import (\n ChunkingDocSerializer,\n ChunkingSerializerProvider,\n)\nfrom docling_core.transforms.serializer.markdown import MarkdownTableSerializer\n\n\nclass MDTableSerializerProvider(ChunkingSerializerProvider):\n def get_serializer(self, doc):\n return ChunkingDocSerializer(\n doc=doc,\n table_serializer=MarkdownTableSerializer(), # configuring a different table serializer\n )\n\n\nchunker = HybridChunker(\n tokenizer=tokenizer,\n serializer_provider=MDTableSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\n</pre> from docling_core.transforms.chunker.hierarchical_chunker import ( ChunkingDocSerializer, ChunkingSerializerProvider, ) from docling_core.transforms.serializer.markdown import MarkdownTableSerializer class MDTableSerializerProvider(ChunkingSerializerProvider): def get_serializer(self, doc): return ChunkingDocSerializer( doc=doc, table_serializer=MarkdownTableSerializer(), # configuring a different table serializer ) chunker = HybridChunker( tokenizer=tokenizer, serializer_provider=MDTableSerializerProvider(), ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE) print_chunk( chunks=chunks, chunk_pos=i, ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=13 num_tokens=431 doc_items_refs=['#/texts/72', '#/tables/0'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 4 Performance \u2502\n\u2502 Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution \u2502\n\u2502 (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. \u2502\n\u2502 \u2502\n\u2502 | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | \u2502\n\u2502 |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| \u2502\n\u2502 | | | TTS | Pages/s | Mem | TTS | Pages/s | Mem | \u2502\n\u2502 | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB | \u2502\n\u2502 | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB | \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>Below we inspect the first chunk containing a picture.</p> <p>Even when using the default strategy, we can modify the relevant parameters, e.g. which placeholder is used for pictures:</p> In\u00a0[7]: Copied! <pre>from docling_core.transforms.serializer.markdown import MarkdownParams\n\n\nclass ImgPlaceholderSerializerProvider(ChunkingSerializerProvider):\n def get_serializer(self, doc):\n return ChunkingDocSerializer(\n doc=doc,\n params=MarkdownParams(\n image_placeholder=\"<!-- image -->\",\n ),\n )\n\n\nchunker = HybridChunker(\n tokenizer=tokenizer,\n serializer_provider=ImgPlaceholderSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\n</pre> from docling_core.transforms.serializer.markdown import MarkdownParams class ImgPlaceholderSerializerProvider(ChunkingSerializerProvider): def get_serializer(self, doc): return ChunkingDocSerializer( doc=doc, params=MarkdownParams( image_placeholder=\"\", ), ) chunker = HybridChunker( tokenizer=tokenizer, serializer_provider=ImgPlaceholderSerializerProvider(), ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE) print_chunk( chunks=chunks, chunk_pos=i, ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=0 num_tokens=117 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 <!-- image --> \u2502\n\u2502 Version 1.0 \u2502\n\u2502 Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta \u2502\n\u2502 Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar \u2502\n\u2502 AI4K Group, IBM Research R\u00a8 uschlikon, Switzerland \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>Below we define and use our custom picture serialization strategy which leverages picture annotations:</p> In\u00a0[8]: Copied! <pre>from typing import Any\n\nfrom docling_core.transforms.serializer.base import (\n BaseDocSerializer,\n SerializationResult,\n)\nfrom docling_core.transforms.serializer.common import create_ser_result\nfrom docling_core.transforms.serializer.markdown import MarkdownPictureSerializer\nfrom docling_core.types.doc.document import (\n PictureClassificationData,\n PictureDescriptionData,\n PictureItem,\n PictureMoleculeData,\n)\nfrom typing_extensions import override\n\n\nclass AnnotationPictureSerializer(MarkdownPictureSerializer):\n @override\n def serialize(\n self,\n *,\n item: PictureItem,\n doc_serializer: BaseDocSerializer,\n doc: DoclingDocument,\n **kwargs: Any,\n ) -> SerializationResult:\n text_parts: list[str] = []\n for annotation in item.annotations:\n if isinstance(annotation, PictureClassificationData):\n predicted_class = (\n annotation.predicted_classes[0].class_name\n if annotation.predicted_classes\n else None\n )\n if predicted_class is not None:\n text_parts.append(f\"Picture type: {predicted_class}\")\n elif isinstance(annotation, PictureMoleculeData):\n text_parts.append(f\"SMILES: {annotation.smi}\")\n elif isinstance(annotation, PictureDescriptionData):\n text_parts.append(f\"Picture description: {annotation.text}\")\n\n text_res = \"\\n\".join(text_parts)\n text_res = doc_serializer.post_process(text=text_res)\n return create_ser_result(text=text_res, span_source=item)\n</pre> from typing import Any from docling_core.transforms.serializer.base import ( BaseDocSerializer, SerializationResult, ) from docling_core.transforms.serializer.common import create_ser_result from docling_core.transforms.serializer.markdown import MarkdownPictureSerializer from docling_core.types.doc.document import ( PictureClassificationData, PictureDescriptionData, PictureItem, PictureMoleculeData, ) from typing_extensions import override class AnnotationPictureSerializer(MarkdownPictureSerializer): @override def serialize( self, *, item: PictureItem, doc_serializer: BaseDocSerializer, doc: DoclingDocument, **kwargs: Any, ) -> SerializationResult: text_parts: list[str] = [] for annotation in item.annotations: if isinstance(annotation, PictureClassificationData): predicted_class = ( annotation.predicted_classes[0].class_name if annotation.predicted_classes else None ) if predicted_class is not None: text_parts.append(f\"Picture type: {predicted_class}\") elif isinstance(annotation, PictureMoleculeData): text_parts.append(f\"SMILES: {annotation.smi}\") elif isinstance(annotation, PictureDescriptionData): text_parts.append(f\"Picture description: {annotation.text}\") text_res = \"\\n\".join(text_parts) text_res = doc_serializer.post_process(text=text_res) return create_ser_result(text=text_res, span_source=item) In\u00a0[9]: Copied! <pre>class ImgAnnotationSerializerProvider(ChunkingSerializerProvider):\n def get_serializer(self, doc: DoclingDocument):\n return ChunkingDocSerializer(\n doc=doc,\n picture_serializer=AnnotationPictureSerializer(), # configuring a different picture serializer\n )\n\n\nchunker = HybridChunker(\n tokenizer=tokenizer,\n serializer_provider=ImgAnnotationSerializerProvider(),\n)\n\nchunk_iter = chunker.chunk(dl_doc=doc)\n\nchunks = list(chunk_iter)\ni, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)\nprint_chunk(\n chunks=chunks,\n chunk_pos=i,\n)\n</pre> class ImgAnnotationSerializerProvider(ChunkingSerializerProvider): def get_serializer(self, doc: DoclingDocument): return ChunkingDocSerializer( doc=doc, picture_serializer=AnnotationPictureSerializer(), # configuring a different picture serializer ) chunker = HybridChunker( tokenizer=tokenizer, serializer_provider=ImgAnnotationSerializerProvider(), ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE) print_chunk( chunks=chunks, chunk_pos=i, ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 chunk_pos=0 num_tokens=128 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Docling Technical Report \u2502\n\u2502 Picture description: In this image we can see a cartoon image of a duck holding a paper. \u2502\n\u2502 Version 1.0 \u2502\n\u2502 Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta \u2502\n\u2502 Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar \u2502\n\u2502 AI4K Group, IBM Research R\u00a8 uschlikon, Switzerland \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/advanced_chunking_and_serialization/#advanced-chunking-serialization","title":"Advanced chunking & serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#overview","title":"Overview\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#setup","title":"Setup\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#table-serialization","title":"Table serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-the-default-strategy","title":"Using the default strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#configuring-a-different-strategy","title":"Configuring a different strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#picture-serialization","title":"Picture serialization\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-the-default-strategy","title":"Using the default strategy\u00b6","text":""},{"location":"examples/advanced_chunking_and_serialization/#using-a-custom-strategy","title":"Using a custom strategy\u00b6","text":""},{"location":"examples/backend_csv/","title":"Conversion of CSV files","text":"In\u00a0[59]: Copied! <pre>from pathlib import Path\n\nfrom docling.document_converter import DocumentConverter\n\n# Convert CSV to Docling document\nconverter = DocumentConverter()\nresult = converter.convert(Path(\"../../tests/data/csv/csv-comma.csv\"))\noutput = result.document.export_to_markdown()\n</pre> from pathlib import Path from docling.document_converter import DocumentConverter # Convert CSV to Docling document converter = DocumentConverter() result = converter.convert(Path(\"../../tests/data/csv/csv-comma.csv\")) output = result.document.export_to_markdown() <p>This code generates the following output:</p> Index Customer Id First Name Last Name Company City Country Phone 1 Phone 2 Email Subscription Date Website 1 DD37Cf93aecA6Dc Sheryl Baxter Rasmussen Group East Leonard Chile 229.077.5154 397.884.0519x718 zunigavanessa@smith.info 2020-08-24 http://www.stephenson.com/ 2 1Ef7b82A4CAAD10 Preston Lozano, Dr Vega-Gentry East Jimmychester Djibouti 5153435776 686-620-1820x944 vmata@colon.com 2021-04-23 http://www.hobbs.com/ 3 6F94879bDAfE5a6 Roy Berry Murillo-Perry Isabelborough Antigua and Barbuda +1-539-402-0259 (496)978-3969x58947 beckycarr@hogan.com 2020-03-25 http://www.lawrence.com/ 4 5Cef8BFA16c5e3c Linda Olsen Dominguez, Mcmillan and Donovan Bensonview Dominican Republic 001-808-617-6467x12895 +1-813-324-8756 stanleyblackwell@benson.org 2020-06-02 http://www.good-lyons.com/ 5 053d585Ab6b3159 Joanna Bender Martin, Lang and Andrade West Priscilla Slovakia (Slovak Republic) 001-234-203-0635x76146 001-199-446-3860x3486 colinalvarado@miles.net 2021-04-17 https://goodwin-ingram.com/"},{"location":"examples/backend_csv/#conversion-of-csv-files","title":"Conversion of CSV files\u00b6","text":"<p>This example shows how to convert CSV files to a structured Docling Document.</p> <ul> <li>Multiple delimiters are supported: <code>,</code> <code>;</code> <code>|</code> <code>[tab]</code></li> <li>Additional CSV dialect settings are detected automatically (e.g. quotes, line separator, escape character)</li> </ul>"},{"location":"examples/backend_csv/#example-code","title":"Example Code\u00b6","text":""},{"location":"examples/backend_xml_rag/","title":"Conversion of custom XML","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote <p>This is an example of using Docling for converting structured data (XML) into a unified document representation format, <code>DoclingDocument</code>, and leverage its riched structured content for RAG applications.</p> <p>Data used in this example consist of patents from the United States Patent and Trademark Office (USPTO) and medical articles from PubMed Central\u00ae (PMC).</p> <p>In this notebook, we accomplish the following:</p> <ul> <li>Simple conversion of supported XML files in a nutshell</li> <li>An end-to-end application using public collections of XML files supported by Docling<ul> <li>Setup the API access for generative AI</li> <li>Fetch the data from USPTO and PubMed Central\u00ae sites, using Docling custom backends</li> <li>Parse, chunk, and index the documents in a vector database</li> <li>Perform RAG using LlamaIndex Docling extension</li> </ul> </li> </ul> <p>For more details on document chunking with Docling, refer to the Chunking documentation. For RAG with Docling and LlamaIndex, also check the example RAG with LlamaIndex.</p> In\u00a0[1]: Copied! <pre>from docling.document_converter import DocumentConverter\n\n# a sample PMC article:\nsource = \"../../tests/data/jats/elife-56337.nxml\"\nconverter = DocumentConverter()\nresult = converter.convert(source)\nprint(result.status)\n</pre> from docling.document_converter import DocumentConverter # a sample PMC article: source = \"../../tests/data/jats/elife-56337.nxml\" converter = DocumentConverter() result = converter.convert(source) print(result.status) <pre>ConversionStatus.SUCCESS\n</pre> <p>Once the document is converted, it can be exported to any format supported by Docling. For instance, to markdown (showing here the first lines only):</p> In\u00a0[2]: Copied! <pre>md_doc = result.document.export_to_markdown()\n\ndelim = \"\\n\"\nprint(delim.join(md_doc.split(delim)[:8]))\n</pre> md_doc = result.document.export_to_markdown() delim = \"\\n\" print(delim.join(md_doc.split(delim)[:8])) <pre># KRAB-zinc finger protein gene expansion in response to active retrotransposons in the murine lineage\n\nGernot Wolf, Alberto de Iaco, Ming-An Sun, Melania Bruno, Matthew Tinkham, Don Hoang, Apratim Mitra, Sherry Ralls, Didier Trono, Todd S Macfarlan\n\nThe Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health, Bethesda, United States; School of Life Sciences, \u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL), Lausanne, Switzerland\n\n## Abstract\n\n</pre> <p>If the XML file is not supported, a <code>ConversionError</code> message will be raised.</p> In\u00a0[3]: Copied! <pre>from io import BytesIO\n\nfrom docling.datamodel.base_models import DocumentStream\nfrom docling.exceptions import ConversionError\n\nxml_content = (\n b'<?xml version=\"1.0\" encoding=\"UTF-8\"?><!DOCTYPE docling_test SYSTEM '\n b'\"test.dtd\"><docling>Random content</docling>'\n)\nstream = DocumentStream(name=\"docling_test.xml\", stream=BytesIO(xml_content))\ntry:\n result = converter.convert(stream)\nexcept ConversionError as ce:\n print(ce)\n</pre> from io import BytesIO from docling.datamodel.base_models import DocumentStream from docling.exceptions import ConversionError xml_content = ( b' Random content' ) stream = DocumentStream(name=\"docling_test.xml\", stream=BytesIO(xml_content)) try: result = converter.convert(stream) except ConversionError as ce: print(ce) <pre>Input document docling_test.xml does not match any allowed format.\n</pre> <pre>File format not allowed: docling_test.xml\n</pre> <p>You can always refer to the Usage documentation page for a list of supported formats.</p> <p>Requirements can be installed as shown below. The <code>--no-warn-conflicts</code> argument is meant for Colab's pre-populated Python environment, feel free to remove for stricter usage.</p> In\u00a0[4]: Copied! <pre>%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv\n</pre> %pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> <p>This notebook uses HuggingFace's Inference API. For an increased LLM quota, a token can be provided via the environment variable <code>HF_TOKEN</code>.</p> <p>If you're running this notebook in Google Colab, make sure you add your API key as a secret.</p> In\u00a0[5]: Copied! <pre>import os\nfrom warnings import filterwarnings\n\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\nfilterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")\n</pre> import os from warnings import filterwarnings from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\") <p>We can now define the main parameters:</p> In\u00a0[6]: Copied! <pre>from pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n\nEMBED_MODEL_ID = \"BAAI/bge-small-en-v1.5\"\nEMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID)\nTEMP_DIR = Path(mkdtemp())\nMILVUS_URI = str(TEMP_DIR / \"docling.db\")\nGEN_MODEL = HuggingFaceInferenceAPI(\n token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n)\nembed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n</pre> from pathlib import Path from tempfile import mkdtemp from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI EMBED_MODEL_ID = \"BAAI/bge-small-en-v1.5\" EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID) TEMP_DIR = Path(mkdtemp()) MILVUS_URI = str(TEMP_DIR / \"docling.db\") GEN_MODEL = HuggingFaceInferenceAPI( token=_get_env_from_colab_or_os(\"HF_TOKEN\"), model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", ) embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\")) # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" <p>In this notebook we will use XML data from collections supported by Docling:</p> <ul> <li>Medical articles from the PubMed Central\u00ae (PMC). They are available in an FTP server as <code>.tar.gz</code> files. Each file contains the full article data in XML format, among other supplementary files like images or spreadsheets.</li> <li>Patents from the United States Patent and Trademark Office. They are available in the Bulk Data Storage System (BDSS) as zip files. Each zip file may contain several patents in XML format.</li> </ul> <p>The raw files will be downloaded form the source and saved in a temporary directory.</p> In\u00a0[7]: Copied! <pre>import tarfile\nfrom io import BytesIO\n\nimport requests\n\n# PMC article PMC11703268\nurl: str = \"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz\"\n\nprint(f\"Downloading {url}...\")\nbuf = BytesIO(requests.get(url).content)\nprint(\"Extracting and storing the XML file containing the article text...\")\nwith tarfile.open(fileobj=buf, mode=\"r:gz\") as tar_file:\n for tarinfo in tar_file:\n if tarinfo.isreg():\n file_path = Path(tarinfo.name)\n if file_path.suffix == \".nxml\":\n with open(TEMP_DIR / file_path.name, \"wb\") as file_obj:\n file_obj.write(tar_file.extractfile(tarinfo).read())\n print(f\"Stored XML file {file_path.name}\")\n</pre> import tarfile from io import BytesIO import requests # PMC article PMC11703268 url: str = \"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz\" print(f\"Downloading {url}...\") buf = BytesIO(requests.get(url).content) print(\"Extracting and storing the XML file containing the article text...\") with tarfile.open(fileobj=buf, mode=\"r:gz\") as tar_file: for tarinfo in tar_file: if tarinfo.isreg(): file_path = Path(tarinfo.name) if file_path.suffix == \".nxml\": with open(TEMP_DIR / file_path.name, \"wb\") as file_obj: file_obj.write(tar_file.extractfile(tarinfo).read()) print(f\"Stored XML file {file_path.name}\") <pre>Downloading https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz...\nExtracting and storing the XML file containing the article text...\nStored XML file nihpp-2024.12.26.630351v1.nxml\n</pre> In\u00a0[8]: Copied! <pre>import zipfile\n\n# Patent grants from December 17-23, 2024\nurl: str = (\n \"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip\"\n)\nXML_SPLITTER: str = '<?xml version=\"1.0\"'\ndoc_num: int = 0\n\nprint(f\"Downloading {url}...\")\nbuf = BytesIO(requests.get(url).content)\nprint(\"Parsing zip file, splitting into XML sections, and exporting to files...\")\nwith zipfile.ZipFile(buf) as zf:\n res = zf.testzip()\n if res:\n print(\"Error validating zip file\")\n else:\n with zf.open(zf.namelist()[0]) as xf:\n is_patent = False\n patent_buffer = BytesIO()\n for xf_line in xf:\n decoded_line = xf_line.decode(errors=\"ignore\").rstrip()\n xml_index = decoded_line.find(XML_SPLITTER)\n if xml_index != -1:\n if (\n xml_index > 0\n ): # cases like </sequence-cwu><?xml version=\"1.0\"...\n patent_buffer.write(xf_line[:xml_index])\n patent_buffer.write(b\"\\r\\n\")\n xf_line = xf_line[xml_index:]\n if patent_buffer.getbuffer().nbytes > 0 and is_patent:\n doc_num += 1\n patent_id = f\"ipg241217-{doc_num}\"\n with open(TEMP_DIR / f\"{patent_id}.xml\", \"wb\") as file_obj:\n file_obj.write(patent_buffer.getbuffer())\n is_patent = False\n patent_buffer = BytesIO()\n elif decoded_line.startswith(\"<!DOCTYPE\"):\n is_patent = True\n patent_buffer.write(xf_line)\n</pre> import zipfile # Patent grants from December 17-23, 2024 url: str = ( \"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip\" ) XML_SPLITTER: str = ' 0 ): # cases like 0 and is_patent: doc_num += 1 patent_id = f\"ipg241217-{doc_num}\" with open(TEMP_DIR / f\"{patent_id}.xml\", \"wb\") as file_obj: file_obj.write(patent_buffer.getbuffer()) is_patent = False patent_buffer = BytesIO() elif decoded_line.startswith(\" <pre>Downloading https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip...\nParsing zip file, splitting into XML sections, and exporting to files...\n</pre> In\u00a0[9]: Copied! <pre>print(f\"Fetched and exported {doc_num} documents.\")\n</pre> print(f\"Fetched and exported {doc_num} documents.\") <pre>Fetched and exported 4014 documents.\n</pre> In\u00a0[11]: Copied! <pre>from tqdm.notebook import tqdm\n\nfrom docling.backend.xml.jats_backend import JatsDocumentBackend\nfrom docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.document import InputDocument\n\n# check PMC\nin_doc = InputDocument(\n path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\",\n format=InputFormat.XML_JATS,\n backend=JatsDocumentBackend,\n)\nbackend = JatsDocumentBackend(\n in_doc=in_doc, path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"\n)\nprint(f\"Document {in_doc.file.name} is a valid PMC article? {backend.is_valid()}\")\n\n# check USPTO\nin_doc = InputDocument(\n path_or_stream=TEMP_DIR / \"ipg241217-1.xml\",\n format=InputFormat.XML_USPTO,\n backend=PatentUsptoDocumentBackend,\n)\nbackend = PatentUsptoDocumentBackend(\n in_doc=in_doc, path_or_stream=TEMP_DIR / \"ipg241217-1.xml\"\n)\nprint(f\"Document {in_doc.file.name} is a valid patent? {backend.is_valid()}\")\n\npatent_valid = 0\npbar = tqdm(TEMP_DIR.glob(\"*.xml\"), total=doc_num)\nfor in_path in pbar:\n in_doc = InputDocument(\n path_or_stream=in_path,\n format=InputFormat.XML_USPTO,\n backend=PatentUsptoDocumentBackend,\n )\n backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path)\n patent_valid += int(backend.is_valid())\n\nprint(f\"Found {patent_valid} patents out of {doc_num} XML files.\")\n</pre> from tqdm.notebook import tqdm from docling.backend.xml.jats_backend import JatsDocumentBackend from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend from docling.datamodel.base_models import InputFormat from docling.datamodel.document import InputDocument # check PMC in_doc = InputDocument( path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\", format=InputFormat.XML_JATS, backend=JatsDocumentBackend, ) backend = JatsDocumentBackend( in_doc=in_doc, path_or_stream=TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\" ) print(f\"Document {in_doc.file.name} is a valid PMC article? {backend.is_valid()}\") # check USPTO in_doc = InputDocument( path_or_stream=TEMP_DIR / \"ipg241217-1.xml\", format=InputFormat.XML_USPTO, backend=PatentUsptoDocumentBackend, ) backend = PatentUsptoDocumentBackend( in_doc=in_doc, path_or_stream=TEMP_DIR / \"ipg241217-1.xml\" ) print(f\"Document {in_doc.file.name} is a valid patent? {backend.is_valid()}\") patent_valid = 0 pbar = tqdm(TEMP_DIR.glob(\"*.xml\"), total=doc_num) for in_path in pbar: in_doc = InputDocument( path_or_stream=in_path, format=InputFormat.XML_USPTO, backend=PatentUsptoDocumentBackend, ) backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path) patent_valid += int(backend.is_valid()) print(f\"Found {patent_valid} patents out of {doc_num} XML files.\") <pre>Document nihpp-2024.12.26.630351v1.nxml is a valid PMC article? True\nDocument ipg241217-1.xml is a valid patent? True\n</pre> <pre> 0%| | 0/4014 [00:00<?, ?it/s]</pre> <pre>Found 3928 patents out of 4014 XML files.\n</pre> <p>Calling the function <code>convert()</code> will convert the input document into a <code>DoclingDocument</code></p> In\u00a0[12]: Copied! <pre>doc = backend.convert()\n\nclaims_sec = next(item for item in doc.texts if item.text == \"CLAIMS\")\nprint(f'Patent \"{doc.texts[0].text}\" has {len(claims_sec.children)} claims')\n</pre> doc = backend.convert() claims_sec = next(item for item in doc.texts if item.text == \"CLAIMS\") print(f'Patent \"{doc.texts[0].text}\" has {len(claims_sec.children)} claims') <pre>Patent \"Semiconductor package\" has 19 claims\n</pre> <p>\u270f\ufe0f Tip: in general, there is no need to use the backend converters to parse USPTO or JATS (PubMed) XML files. The generic <code>DocumentConverter</code> object tries to guess the input document format and applies the corresponding backend parser. The conversion shown in Simple Conversion is the recommended usage for the supported XML files.</p> <p>The <code>DoclingDocument</code> format of the converted patents has a rich hierarchical structure, inherited from the original XML document and preserved by the Docling custom backend. In this notebook, we will leverage:</p> <ul> <li>The <code>SimpleDirectoryReader</code> pattern to iterate over the exported XML files created in section Fetch the data.</li> <li>The LlamaIndex extensions, <code>DoclingReader</code> and <code>DoclingNodeParser</code>, to ingest the patent chunks into a Milvus vector store.</li> <li>The <code>HierarchicalChunker</code> implementation, which applies a document-based hierarchical chunking, to leverage the patent structures like sections and paragraphs within sections.</li> </ul> <p>Refer to other possible implementations and usage patterns in the Chunking documentation and the RAG with LlamaIndex notebook.</p> In\u00a0[13]: Copied! <pre>from llama_index.core import SimpleDirectoryReader\nfrom llama_index.readers.docling import DoclingReader\n\nreader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\ndir_reader = SimpleDirectoryReader(\n input_dir=TEMP_DIR,\n exclude=[\"docling.db\", \"*.nxml\"],\n file_extractor={\".xml\": reader},\n filename_as_id=True,\n num_files_limit=100,\n)\n</pre> from llama_index.core import SimpleDirectoryReader from llama_index.readers.docling import DoclingReader reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) dir_reader = SimpleDirectoryReader( input_dir=TEMP_DIR, exclude=[\"docling.db\", \"*.nxml\"], file_extractor={\".xml\": reader}, filename_as_id=True, num_files_limit=100, ) In\u00a0[14]: Copied! <pre>from llama_index.node_parser.docling import DoclingNodeParser\n\nnode_parser = DoclingNodeParser()\n</pre> from llama_index.node_parser.docling import DoclingNodeParser node_parser = DoclingNodeParser() In\u00a0[\u00a0]: Copied! <pre>from llama_index.core import StorageContext, VectorStoreIndex\nfrom llama_index.vector_stores.milvus import MilvusVectorStore\n\nvector_store = MilvusVectorStore(\n uri=MILVUS_URI,\n dim=embed_dim,\n overwrite=True,\n)\n\nindex = VectorStoreIndex.from_documents(\n documents=dir_reader.load_data(show_progress=True),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n show_progress=True,\n)\n</pre> from llama_index.core import StorageContext, VectorStoreIndex from llama_index.vector_stores.milvus import MilvusVectorStore vector_store = MilvusVectorStore( uri=MILVUS_URI, dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=dir_reader.load_data(show_progress=True), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, show_progress=True, ) <p>Finally, add the PMC article to the vector store directly from the reader.</p> In\u00a0[14]: Copied! <pre>index.from_documents(\n documents=reader.load_data(TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\n</pre> index.from_documents( documents=reader.load_data(TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) Out[14]: <pre><llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x373a7f7d0></pre> <p>The retriever can be used to identify highly relevant documents:</p> In\u00a0[15]: Copied! <pre>retriever = index.as_retriever(similarity_top_k=3)\nresults = retriever.retrieve(\"What patents are related to fitness devices?\")\n\nfor item in results:\n print(item)\n</pre> retriever = index.as_retriever(similarity_top_k=3) results = retriever.retrieve(\"What patents are related to fitness devices?\") for item in results: print(item) <pre>Node ID: 5afd36c0-a739-4a88-a51c-6d0f75358db5\nText: The portable fitness monitoring device 102 may be a device such\nas, for example, a mobile phone, a personal digital assistant, a music\nfile player (e.g. and MP3 player), an intelligent article for wearing\n(e.g. a fitness monitoring garment, wrist band, or watch), a dongle\n(e.g. a small hardware device that protects software) that includes a\nfitn...\nScore: 0.772\n\nNode ID: f294b5fd-9089-43cb-8c4e-d1095a634ff1\nText: US Patent Application US 20120071306 entitled \u201cPortable\nMultipurpose Whole Body Exercise Device\u201d discloses a portable\nmultipurpose whole body exercise device which can be used for general\nfitness, Pilates-type, core strengthening, therapeutic, and\nrehabilitative exercises as well as stretching and physical therapy\nand which includes storable acc...\nScore: 0.749\n\nNode ID: 8251c7ef-1165-42e1-8c91-c99c8a711bf7\nText: Program products, methods, and systems for providing fitness\nmonitoring services of the present invention can include any software\napplication executed by one or more computing devices. A computing\ndevice can be any type of computing device having one or more\nprocessors. For example, a computing device can be a workstation,\nmobile device (e.g., ...\nScore: 0.744\n\n</pre> <p>With the query engine, we can run the question-answering with the RAG pattern on the set of indexed documents.</p> <p>First, we can prompt the LLM directly:</p> In\u00a0[16]: Copied! <pre>from llama_index.core.base.llms.types import ChatMessage, MessageRole\nfrom rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console()\nquery = \"Do mosquitoes in high altitude expand viruses over large distances?\"\n\nusr_msg = ChatMessage(role=MessageRole.USER, content=query)\nresponse = GEN_MODEL.chat(messages=[usr_msg])\n\nconsole.print(Panel(query, title=\"Prompt\", border_style=\"bold red\"))\nconsole.print(\n Panel(\n response.message.content.strip(),\n title=\"Generated Content\",\n border_style=\"bold green\",\n )\n)\n</pre> from llama_index.core.base.llms.types import ChatMessage, MessageRole from rich.console import Console from rich.panel import Panel console = Console() query = \"Do mosquitoes in high altitude expand viruses over large distances?\" usr_msg = ChatMessage(role=MessageRole.USER, content=query) response = GEN_MODEL.chat(messages=[usr_msg]) console.print(Panel(query, title=\"Prompt\", border_style=\"bold red\")) console.print( Panel( response.message.content.strip(), title=\"Generated Content\", border_style=\"bold green\", ) ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Do mosquitoes in high altitude expand viruses over large distances? \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Mosquitoes can be found at high altitudes, but their ability to transmit viruses over long distances is not \u2502\n\u2502 primarily dependent on altitude. Mosquitoes are vectors for various diseases, such as malaria, dengue fever, \u2502\n\u2502 and Zika virus, and their transmission range is more closely related to their movement, the presence of a host, \u2502\n\u2502 and environmental conditions that support their survival and reproduction. \u2502\n\u2502 \u2502\n\u2502 At high altitudes, the environment can be less suitable for mosquitoes due to factors such as colder \u2502\n\u2502 temperatures, lower humidity, and stronger winds, which can limit their population size and distribution. \u2502\n\u2502 However, some species of mosquitoes have adapted to high-altitude environments and can still transmit diseases \u2502\n\u2502 in these areas. \u2502\n\u2502 \u2502\n\u2502 It is possible for mosquitoes to be transported by wind or human activities to higher altitudes, but this is \u2502\n\u2502 not a significant factor in their ability to transmit viruses over long distances. Instead, long-distance \u2502\n\u2502 transmission of viruses is more often associated with human travel and transportation, which can rapidly spread \u2502\n\u2502 infected mosquitoes or humans to new areas, leading to the spread of disease. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>Now, we can compare the response when the model is prompted with the indexed PMC article as supporting context:</p> In\u00a0[17]: Copied! <pre>from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters\n\nfilters = MetadataFilters(\n filters=[\n ExactMatchFilter(key=\"filename\", value=\"nihpp-2024.12.26.630351v1.nxml\"),\n ]\n)\n\nquery_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3)\nresult = query_engine.query(query)\n\nconsole.print(\n Panel(\n result.response.strip(),\n title=\"Generated Content with RAG\",\n border_style=\"bold green\",\n )\n)\n</pre> from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters filters = MetadataFilters( filters=[ ExactMatchFilter(key=\"filename\", value=\"nihpp-2024.12.26.630351v1.nxml\"), ] ) query_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3) result = query_engine.query(query) console.print( Panel( result.response.strip(), title=\"Generated Content with RAG\", border_style=\"bold green\", ) ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content with RAG \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female \u2502\n\u2502 mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with \u2502\n\u2502 arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with \u2502\n\u2502 flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens, \u2502\n\u2502 including three arboviruses that affect humans (dengue, West Nile, and M\u2019Poko viruses). The study provides \u2502\n\u2502 compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre>"},{"location":"examples/backend_xml_rag/#conversion-of-custom-xml","title":"Conversion of custom XML\u00b6","text":""},{"location":"examples/backend_xml_rag/#overview","title":"Overview\u00b6","text":""},{"location":"examples/backend_xml_rag/#simple-conversion","title":"Simple conversion\u00b6","text":"<p>The XML file format defines and stores data in a format that is both human-readable and machine-readable. Because of this flexibility, Docling requires custom backend processors to interpret XML definitions and convert them into <code>DoclingDocument</code> objects.</p> <p>Some public data collections in XML format are already supported by Docling (USTPO patents and PMC articles). In these cases, the document conversion is straightforward and the same as with any other supported format, such as PDF or HTML. The execution example in Simple Conversion is the recommended usage of Docling for a single file:</p>"},{"location":"examples/backend_xml_rag/#end-to-end-application","title":"End-to-end application\u00b6","text":"<p>This section describes a step-by-step application for processing XML files from supported public collections and use them for question-answering.</p>"},{"location":"examples/backend_xml_rag/#setup","title":"Setup\u00b6","text":""},{"location":"examples/backend_xml_rag/#fetch-the-data","title":"Fetch the data\u00b6","text":""},{"location":"examples/backend_xml_rag/#pmc-articles","title":"PMC articles\u00b6","text":"<p>The OA file is a manifest file of all the PMC articles, including the URL path to download the source files. In this notebook we will use as example the article Pathogens spread by high-altitude windborne mosquitoes, which is available in the archive file PMC11703268.tar.gz.</p>"},{"location":"examples/backend_xml_rag/#uspto-patents","title":"USPTO patents\u00b6","text":"<p>Since each USPTO file is a concatenation of several patents, we need to split its content into valid XML pieces. The following code downloads a sample zip file, split its content in sections, and dumps each section as an XML file. For simplicity, this pipeline is shown here in a sequential manner, but it could be parallelized.</p>"},{"location":"examples/backend_xml_rag/#using-the-backend-converter-optional","title":"Using the backend converter (optional)\u00b6","text":"<ul> <li>The custom backend converters <code>PubMedDocumentBackend</code> and <code>PatentUsptoDocumentBackend</code> aim at handling the parsing of PMC articles and USPTO patents, respectively.</li> <li>As any other backends, you can leverage the function <code>is_valid()</code> to check if the input document is supported by the this backend.</li> <li>Note that some XML sections in the original USPTO zip file may not represent patents, like sequence listings, and therefore they will show as invalid by the backend.</li> </ul>"},{"location":"examples/backend_xml_rag/#parse-chunk-and-index","title":"Parse, chunk, and index\u00b6","text":""},{"location":"examples/backend_xml_rag/#set-the-docling-reader-and-the-directory-reader","title":"Set the Docling reader and the directory reader\u00b6","text":"<p>Note that <code>DoclingReader</code> uses Docling's <code>DocumentConverter</code> by default and therefore it will recognize the format of the XML files and leverage the <code>PatentUsptoDocumentBackend</code> automatically.</p> <p>For demonstration purposes, we limit the scope of the analysis to the first 100 patents.</p>"},{"location":"examples/backend_xml_rag/#set-the-node-parser","title":"Set the node parser\u00b6","text":"<p>Note that the <code>HierarchicalChunker</code> is the default chunking implementation of the <code>DoclingNodeParser</code>.</p>"},{"location":"examples/backend_xml_rag/#set-a-local-milvus-database-and-run-the-ingestion","title":"Set a local Milvus database and run the ingestion\u00b6","text":""},{"location":"examples/backend_xml_rag/#question-answering-with-rag","title":"Question-answering with RAG\u00b6","text":""},{"location":"examples/batch_convert/","title":"Batch conversion","text":"In\u00a0[\u00a0]: Copied! <pre>import json\nimport logging\nimport time\nfrom collections.abc import Iterable\nfrom pathlib import Path\n</pre> import json import logging import time from collections.abc import Iterable from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>import yaml\nfrom docling_core.types.doc import ImageRefMode\n</pre> import yaml from docling_core.types.doc import ImageRefMode In\u00a0[\u00a0]: Copied! <pre>from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied! <pre>_log = logging.getLogger(__name__)\n</pre> _log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied! <pre>USE_V2 = True\nUSE_LEGACY = False\n</pre> USE_V2 = True USE_LEGACY = False In\u00a0[\u00a0]: Copied! <pre>def export_documents(\n conv_results: Iterable[ConversionResult],\n output_dir: Path,\n):\n output_dir.mkdir(parents=True, exist_ok=True)\n\n success_count = 0\n failure_count = 0\n partial_success_count = 0\n\n for conv_res in conv_results:\n if conv_res.status == ConversionStatus.SUCCESS:\n success_count += 1\n doc_filename = conv_res.input.file.stem\n\n if USE_V2:\n conv_res.document.save_as_json(\n output_dir / f\"{doc_filename}.json\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n conv_res.document.save_as_html(\n output_dir / f\"{doc_filename}.html\",\n image_mode=ImageRefMode.EMBEDDED,\n )\n conv_res.document.save_as_document_tokens(\n output_dir / f\"{doc_filename}.doctags.txt\"\n )\n conv_res.document.save_as_markdown(\n output_dir / f\"{doc_filename}.md\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n conv_res.document.save_as_markdown(\n output_dir / f\"{doc_filename}.txt\",\n image_mode=ImageRefMode.PLACEHOLDER,\n strict_text=True,\n )\n\n # Export Docling document format to YAML:\n with (output_dir / f\"{doc_filename}.yaml\").open(\"w\") as fp:\n fp.write(yaml.safe_dump(conv_res.document.export_to_dict()))\n\n # Export Docling document format to doctags:\n with (output_dir / f\"{doc_filename}.doctags.txt\").open(\"w\") as fp:\n fp.write(conv_res.document.export_to_document_tokens())\n\n # Export Docling document format to markdown:\n with (output_dir / f\"{doc_filename}.md\").open(\"w\") as fp:\n fp.write(conv_res.document.export_to_markdown())\n\n # Export Docling document format to text:\n with (output_dir / f\"{doc_filename}.txt\").open(\"w\") as fp:\n fp.write(conv_res.document.export_to_markdown(strict_text=True))\n\n if USE_LEGACY:\n # Export Deep Search document JSON format:\n with (output_dir / f\"{doc_filename}.legacy.json\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(json.dumps(conv_res.legacy_document.export_to_dict()))\n\n # Export Text format:\n with (output_dir / f\"{doc_filename}.legacy.txt\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(\n conv_res.legacy_document.export_to_markdown(strict_text=True)\n )\n\n # Export Markdown format:\n with (output_dir / f\"{doc_filename}.legacy.md\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(conv_res.legacy_document.export_to_markdown())\n\n # Export Document Tags format:\n with (output_dir / f\"{doc_filename}.legacy.doctags.txt\").open(\n \"w\", encoding=\"utf-8\"\n ) as fp:\n fp.write(conv_res.legacy_document.export_to_document_tokens())\n\n elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS:\n _log.info(\n f\"Document {conv_res.input.file} was partially converted with the following errors:\"\n )\n for item in conv_res.errors:\n _log.info(f\"\\t{item.error_message}\")\n partial_success_count += 1\n else:\n _log.info(f\"Document {conv_res.input.file} failed to convert.\")\n failure_count += 1\n\n _log.info(\n f\"Processed {success_count + partial_success_count + failure_count} docs, \"\n f\"of which {failure_count} failed \"\n f\"and {partial_success_count} were partially converted.\"\n )\n return success_count, partial_success_count, failure_count\n</pre> def export_documents( conv_results: Iterable[ConversionResult], output_dir: Path, ): output_dir.mkdir(parents=True, exist_ok=True) success_count = 0 failure_count = 0 partial_success_count = 0 for conv_res in conv_results: if conv_res.status == ConversionStatus.SUCCESS: success_count += 1 doc_filename = conv_res.input.file.stem if USE_V2: conv_res.document.save_as_json( output_dir / f\"{doc_filename}.json\", image_mode=ImageRefMode.PLACEHOLDER, ) conv_res.document.save_as_html( output_dir / f\"{doc_filename}.html\", image_mode=ImageRefMode.EMBEDDED, ) conv_res.document.save_as_document_tokens( output_dir / f\"{doc_filename}.doctags.txt\" ) conv_res.document.save_as_markdown( output_dir / f\"{doc_filename}.md\", image_mode=ImageRefMode.PLACEHOLDER, ) conv_res.document.save_as_markdown( output_dir / f\"{doc_filename}.txt\", image_mode=ImageRefMode.PLACEHOLDER, strict_text=True, ) # Export Docling document format to YAML: with (output_dir / f\"{doc_filename}.yaml\").open(\"w\") as fp: fp.write(yaml.safe_dump(conv_res.document.export_to_dict())) # Export Docling document format to doctags: with (output_dir / f\"{doc_filename}.doctags.txt\").open(\"w\") as fp: fp.write(conv_res.document.export_to_document_tokens()) # Export Docling document format to markdown: with (output_dir / f\"{doc_filename}.md\").open(\"w\") as fp: fp.write(conv_res.document.export_to_markdown()) # Export Docling document format to text: with (output_dir / f\"{doc_filename}.txt\").open(\"w\") as fp: fp.write(conv_res.document.export_to_markdown(strict_text=True)) if USE_LEGACY: # Export Deep Search document JSON format: with (output_dir / f\"{doc_filename}.legacy.json\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write(json.dumps(conv_res.legacy_document.export_to_dict())) # Export Text format: with (output_dir / f\"{doc_filename}.legacy.txt\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write( conv_res.legacy_document.export_to_markdown(strict_text=True) ) # Export Markdown format: with (output_dir / f\"{doc_filename}.legacy.md\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write(conv_res.legacy_document.export_to_markdown()) # Export Document Tags format: with (output_dir / f\"{doc_filename}.legacy.doctags.txt\").open( \"w\", encoding=\"utf-8\" ) as fp: fp.write(conv_res.legacy_document.export_to_document_tokens()) elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS: _log.info( f\"Document {conv_res.input.file} was partially converted with the following errors:\" ) for item in conv_res.errors: _log.info(f\"\\t{item.error_message}\") partial_success_count += 1 else: _log.info(f\"Document {conv_res.input.file} failed to convert.\") failure_count += 1 _log.info( f\"Processed {success_count + partial_success_count + failure_count} docs, \" f\"of which {failure_count} failed \" f\"and {partial_success_count} were partially converted.\" ) return success_count, partial_success_count, failure_count In\u00a0[\u00a0]: Copied! <pre>def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_paths = [\n data_folder / \"pdf/2206.01062.pdf\",\n data_folder / \"pdf/2203.01017v2.pdf\",\n data_folder / \"pdf/2305.03393v1.pdf\",\n data_folder / \"pdf/redp5110_sampled.pdf\",\n ]\n\n # buf = BytesIO((data_folder / \"pdf/2206.01062.pdf\").open(\"rb\").read())\n # docs = [DocumentStream(name=\"my_doc.pdf\", stream=buf)]\n # input = DocumentConversionInput.from_streams(docs)\n\n # # Turn on inline debug visualizations:\n # settings.debug.visualize_layout = True\n # settings.debug.visualize_ocr = True\n # settings.debug.visualize_tables = True\n # settings.debug.visualize_cells = True\n\n pipeline_options = PdfPipelineOptions()\n pipeline_options.generate_page_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options, backend=DoclingParseV4DocumentBackend\n )\n }\n )\n\n start_time = time.time()\n\n conv_results = doc_converter.convert_all(\n input_doc_paths,\n raises_on_error=False, # to let conversion run through all and examine results at the end\n )\n success_count, partial_success_count, failure_count = export_documents(\n conv_results, output_dir=Path(\"scratch\")\n )\n\n end_time = time.time() - start_time\n\n _log.info(f\"Document conversion complete in {end_time:.2f} seconds.\")\n\n if failure_count > 0:\n raise RuntimeError(\n f\"The example failed converting {failure_count} on {len(input_doc_paths)}.\"\n )\n</pre> def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_paths = [ data_folder / \"pdf/2206.01062.pdf\", data_folder / \"pdf/2203.01017v2.pdf\", data_folder / \"pdf/2305.03393v1.pdf\", data_folder / \"pdf/redp5110_sampled.pdf\", ] # buf = BytesIO((data_folder / \"pdf/2206.01062.pdf\").open(\"rb\").read()) # docs = [DocumentStream(name=\"my_doc.pdf\", stream=buf)] # input = DocumentConversionInput.from_streams(docs) # # Turn on inline debug visualizations: # settings.debug.visualize_layout = True # settings.debug.visualize_ocr = True # settings.debug.visualize_tables = True # settings.debug.visualize_cells = True pipeline_options = PdfPipelineOptions() pipeline_options.generate_page_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, backend=DoclingParseV4DocumentBackend ) } ) start_time = time.time() conv_results = doc_converter.convert_all( input_doc_paths, raises_on_error=False, # to let conversion run through all and examine results at the end ) success_count, partial_success_count, failure_count = export_documents( conv_results, output_dir=Path(\"scratch\") ) end_time = time.time() - start_time _log.info(f\"Document conversion complete in {end_time:.2f} seconds.\") if failure_count > 0: raise RuntimeError( f\"The example failed converting {failure_count} on {len(input_doc_paths)}.\" ) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/compare_vlm_models/","title":"Compare VLM models","text":"In\u00a0[\u00a0]: Copied! <pre>import json\nimport sys\nimport time\nfrom pathlib import Path\n</pre> import json import sys import time from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>from docling_core.types.doc import DocItemLabel, ImageRefMode\nfrom docling_core.types.doc.document import DEFAULT_EXPORT_LABELS\nfrom tabulate import tabulate\n</pre> from docling_core.types.doc import DocItemLabel, ImageRefMode from docling_core.types.doc.document import DEFAULT_EXPORT_LABELS from tabulate import tabulate In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel import vlm_model_specs\nfrom docling.datamodel.accelerator_options import AcceleratorDevice\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import (\n InferenceFramework,\n InlineVlmOptions,\n ResponseFormat,\n TransformersModelType,\n TransformersPromptStyle,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n</pre> from docling.datamodel import vlm_model_specs from docling.datamodel.accelerator_options import AcceleratorDevice from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import ( InferenceFramework, InlineVlmOptions, ResponseFormat, TransformersModelType, TransformersPromptStyle, ) from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline In\u00a0[\u00a0]: Copied! <pre>def convert(sources: list[Path], converter: DocumentConverter):\n model_id = pipeline_options.vlm_options.repo_id.replace(\"/\", \"_\")\n framework = pipeline_options.vlm_options.inference_framework\n for source in sources:\n print(\"================================================\")\n print(\"Processing...\")\n print(f\"Source: {source}\")\n print(\"---\")\n print(f\"Model: {model_id}\")\n print(f\"Framework: {framework}\")\n print(\"================================================\")\n print(\"\")\n\n res = converter.convert(source)\n\n print(\"\")\n\n fname = f\"{res.input.file.stem}-{model_id}-{framework}\"\n\n inference_time = 0.0\n for i, page in enumerate(res.pages):\n inference_time += page.predictions.vlm_response.generation_time\n print(\"\")\n print(\n f\" ---------- Predicted page {i} in {pipeline_options.vlm_options.response_format} in {page.predictions.vlm_response.generation_time} [sec]:\"\n )\n print(page.predictions.vlm_response.text)\n print(\" ---------- \")\n\n print(\"===== Final output of the converted document =======\")\n\n with (out_path / f\"{fname}.json\").open(\"w\") as fp:\n fp.write(json.dumps(res.document.export_to_dict()))\n\n res.document.save_as_json(\n out_path / f\"{fname}.json\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n print(f\" => produced {out_path / fname}.json\")\n\n res.document.save_as_markdown(\n out_path / f\"{fname}.md\",\n image_mode=ImageRefMode.PLACEHOLDER,\n )\n print(f\" => produced {out_path / fname}.md\")\n\n res.document.save_as_html(\n out_path / f\"{fname}.html\",\n image_mode=ImageRefMode.EMBEDDED,\n labels=[*DEFAULT_EXPORT_LABELS, DocItemLabel.FOOTNOTE],\n split_page_view=True,\n )\n print(f\" => produced {out_path / fname}.html\")\n\n pg_num = res.document.num_pages()\n print(\"\")\n print(\n f\"Total document prediction time: {inference_time:.2f} seconds, pages: {pg_num}\"\n )\n print(\"====================================================\")\n\n return [\n source,\n model_id,\n str(framework),\n pg_num,\n inference_time,\n ]\n</pre> def convert(sources: list[Path], converter: DocumentConverter): model_id = pipeline_options.vlm_options.repo_id.replace(\"/\", \"_\") framework = pipeline_options.vlm_options.inference_framework for source in sources: print(\"================================================\") print(\"Processing...\") print(f\"Source: {source}\") print(\"---\") print(f\"Model: {model_id}\") print(f\"Framework: {framework}\") print(\"================================================\") print(\"\") res = converter.convert(source) print(\"\") fname = f\"{res.input.file.stem}-{model_id}-{framework}\" inference_time = 0.0 for i, page in enumerate(res.pages): inference_time += page.predictions.vlm_response.generation_time print(\"\") print( f\" ---------- Predicted page {i} in {pipeline_options.vlm_options.response_format} in {page.predictions.vlm_response.generation_time} [sec]:\" ) print(page.predictions.vlm_response.text) print(\" ---------- \") print(\"===== Final output of the converted document =======\") with (out_path / f\"{fname}.json\").open(\"w\") as fp: fp.write(json.dumps(res.document.export_to_dict())) res.document.save_as_json( out_path / f\"{fname}.json\", image_mode=ImageRefMode.PLACEHOLDER, ) print(f\" => produced {out_path / fname}.json\") res.document.save_as_markdown( out_path / f\"{fname}.md\", image_mode=ImageRefMode.PLACEHOLDER, ) print(f\" => produced {out_path / fname}.md\") res.document.save_as_html( out_path / f\"{fname}.html\", image_mode=ImageRefMode.EMBEDDED, labels=[*DEFAULT_EXPORT_LABELS, DocItemLabel.FOOTNOTE], split_page_view=True, ) print(f\" => produced {out_path / fname}.html\") pg_num = res.document.num_pages() print(\"\") print( f\"Total document prediction time: {inference_time:.2f} seconds, pages: {pg_num}\" ) print(\"====================================================\") return [ source, model_id, str(framework), pg_num, inference_time, ] In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n sources = [\n \"tests/data/pdf/2305.03393v1-pg9.pdf\",\n ]\n\n out_path = Path(\"scratch\")\n out_path.mkdir(parents=True, exist_ok=True)\n\n ## Definiton of more inline models\n llava_qwen = InlineVlmOptions(\n repo_id=\"llava-hf/llava-interleave-qwen-0.5b-hf\",\n # prompt=\"Read text in the image.\",\n prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\",\n # prompt=\"Parse the reading order of this document.\",\n response_format=ResponseFormat.MARKDOWN,\n inference_framework=InferenceFramework.TRANSFORMERS,\n transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,\n supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU],\n scale=2.0,\n temperature=0.0,\n )\n\n # Note that this is not the expected way of using the Dolphin model, but it shows the usage of a raw prompt.\n dolphin_oneshot = InlineVlmOptions(\n repo_id=\"ByteDance/Dolphin\",\n prompt=\"<s>Read text in the image. <Answer/>\",\n response_format=ResponseFormat.MARKDOWN,\n inference_framework=InferenceFramework.TRANSFORMERS,\n transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,\n transformers_prompt_style=TransformersPromptStyle.RAW,\n supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU],\n scale=2.0,\n temperature=0.0,\n )\n\n ## Use VlmPipeline\n pipeline_options = VlmPipelineOptions()\n pipeline_options.generate_page_images = True\n\n ## On GPU systems, enable flash_attention_2 with CUDA:\n # pipeline_options.accelerator_options.device = AcceleratorDevice.CUDA\n # pipeline_options.accelerator_options.cuda_use_flash_attention2 = True\n\n vlm_models = [\n ## DocTags / SmolDocling models\n vlm_model_specs.SMOLDOCLING_MLX,\n vlm_model_specs.SMOLDOCLING_TRANSFORMERS,\n ## Markdown models (using MLX framework)\n vlm_model_specs.QWEN25_VL_3B_MLX,\n vlm_model_specs.PIXTRAL_12B_MLX,\n vlm_model_specs.GEMMA3_12B_MLX,\n ## Markdown models (using Transformers framework)\n vlm_model_specs.GRANITE_VISION_TRANSFORMERS,\n vlm_model_specs.PHI4_TRANSFORMERS,\n vlm_model_specs.PIXTRAL_12B_TRANSFORMERS,\n ## More inline models\n dolphin_oneshot,\n llava_qwen,\n ]\n\n # Remove MLX models if not on Mac\n if sys.platform != \"darwin\":\n vlm_models = [\n m for m in vlm_models if m.inference_framework != InferenceFramework.MLX\n ]\n\n rows = []\n for vlm_options in vlm_models:\n pipeline_options.vlm_options = vlm_options\n\n ## Set up pipeline for PDF or image inputs\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n InputFormat.IMAGE: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n },\n )\n\n row = convert(sources=sources, converter=converter)\n rows.append(row)\n\n print(\n tabulate(\n rows, headers=[\"source\", \"model_id\", \"framework\", \"num_pages\", \"time\"]\n )\n )\n\n print(\"see if memory gets released ...\")\n time.sleep(10)\n</pre> if __name__ == \"__main__\": sources = [ \"tests/data/pdf/2305.03393v1-pg9.pdf\", ] out_path = Path(\"scratch\") out_path.mkdir(parents=True, exist_ok=True) ## Definiton of more inline models llava_qwen = InlineVlmOptions( repo_id=\"llava-hf/llava-interleave-qwen-0.5b-hf\", # prompt=\"Read text in the image.\", prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\", # prompt=\"Parse the reading order of this document.\", response_format=ResponseFormat.MARKDOWN, inference_framework=InferenceFramework.TRANSFORMERS, transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT, supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU], scale=2.0, temperature=0.0, ) # Note that this is not the expected way of using the Dolphin model, but it shows the usage of a raw prompt. dolphin_oneshot = InlineVlmOptions( repo_id=\"ByteDance/Dolphin\", prompt=\"Read text in the image. \", response_format=ResponseFormat.MARKDOWN, inference_framework=InferenceFramework.TRANSFORMERS, transformers_model_type=TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT, transformers_prompt_style=TransformersPromptStyle.RAW, supported_devices=[AcceleratorDevice.CUDA, AcceleratorDevice.CPU], scale=2.0, temperature=0.0, ) ## Use VlmPipeline pipeline_options = VlmPipelineOptions() pipeline_options.generate_page_images = True ## On GPU systems, enable flash_attention_2 with CUDA: # pipeline_options.accelerator_options.device = AcceleratorDevice.CUDA # pipeline_options.accelerator_options.cuda_use_flash_attention2 = True vlm_models = [ ## DocTags / SmolDocling models vlm_model_specs.SMOLDOCLING_MLX, vlm_model_specs.SMOLDOCLING_TRANSFORMERS, ## Markdown models (using MLX framework) vlm_model_specs.QWEN25_VL_3B_MLX, vlm_model_specs.PIXTRAL_12B_MLX, vlm_model_specs.GEMMA3_12B_MLX, ## Markdown models (using Transformers framework) vlm_model_specs.GRANITE_VISION_TRANSFORMERS, vlm_model_specs.PHI4_TRANSFORMERS, vlm_model_specs.PIXTRAL_12B_TRANSFORMERS, ## More inline models dolphin_oneshot, llava_qwen, ] # Remove MLX models if not on Mac if sys.platform != \"darwin\": vlm_models = [ m for m in vlm_models if m.inference_framework != InferenceFramework.MLX ] rows = [] for vlm_options in vlm_models: pipeline_options.vlm_options = vlm_options ## Set up pipeline for PDF or image inputs converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=VlmPipeline, pipeline_options=pipeline_options, ), InputFormat.IMAGE: PdfFormatOption( pipeline_cls=VlmPipeline, pipeline_options=pipeline_options, ), }, ) row = convert(sources=sources, converter=converter) rows.append(row) print( tabulate( rows, headers=[\"source\", \"model_id\", \"framework\", \"num_pages\", \"time\"] ) ) print(\"see if memory gets released ...\") time.sleep(10)"},{"location":"examples/compare_vlm_models/#compare-vlm-models","title":"Compare VLM models\u00b6","text":"<p>This example runs the VLM pipeline with different vision-language models. Their runtime as well output quality is compared.</p>"},{"location":"examples/custom_convert/","title":"Custom conversion","text":"In\u00a0[\u00a0]: Copied! <pre>import json\nimport logging\nimport time\nfrom pathlib import Path\n</pre> import json import logging import time from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied! <pre>_log = logging.getLogger(__name__)\n</pre> _log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied! <pre>def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n ###########################################################################\n\n # The following sections contain a combination of PipelineOptions\n # and PDF Backends for various configurations.\n # Uncomment one section at the time to see the differences in the output.\n\n # PyPdfium without EasyOCR\n # --------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = False\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = False\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(\n # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend\n # )\n # }\n # )\n\n # PyPdfium with EasyOCR\n # -----------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(\n # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend\n # )\n # }\n # )\n\n # Docling Parse without EasyOCR\n # -------------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = False\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with EasyOCR\n # ----------------------\n pipeline_options = PdfPipelineOptions()\n pipeline_options.do_ocr = True\n pipeline_options.do_table_structure = True\n pipeline_options.table_structure_options.do_cell_matching = True\n pipeline_options.ocr_options.lang = [\"es\"]\n pipeline_options.accelerator_options = AcceleratorOptions(\n num_threads=4, device=AcceleratorDevice.AUTO\n )\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n # Docling Parse with EasyOCR (CPU only)\n # ----------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.ocr_options.use_gpu = False # <-- set this.\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with Tesseract\n # ----------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n # pipeline_options.ocr_options = TesseractOcrOptions()\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with Tesseract CLI\n # ----------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n # pipeline_options.ocr_options = TesseractCliOcrOptions()\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n # Docling Parse with ocrmac(Mac only)\n # ----------------------\n # pipeline_options = PdfPipelineOptions()\n # pipeline_options.do_ocr = True\n # pipeline_options.do_table_structure = True\n # pipeline_options.table_structure_options.do_cell_matching = True\n # pipeline_options.ocr_options = OcrMacOptions()\n\n # doc_converter = DocumentConverter(\n # format_options={\n # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n # }\n # )\n\n ###########################################################################\n\n start_time = time.time()\n conv_result = doc_converter.convert(input_doc_path)\n end_time = time.time() - start_time\n\n _log.info(f\"Document converted in {end_time:.2f} seconds.\")\n\n ## Export results\n output_dir = Path(\"scratch\")\n output_dir.mkdir(parents=True, exist_ok=True)\n doc_filename = conv_result.input.file.stem\n\n # Export Deep Search document JSON format:\n with (output_dir / f\"{doc_filename}.json\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(json.dumps(conv_result.document.export_to_dict()))\n\n # Export Text format:\n with (output_dir / f\"{doc_filename}.txt\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(conv_result.document.export_to_text())\n\n # Export Markdown format:\n with (output_dir / f\"{doc_filename}.md\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(conv_result.document.export_to_markdown())\n\n # Export Document Tags format:\n with (output_dir / f\"{doc_filename}.doctags\").open(\"w\", encoding=\"utf-8\") as fp:\n fp.write(conv_result.document.export_to_document_tokens())\n</pre> def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" ########################################################################### # The following sections contain a combination of PipelineOptions # and PDF Backends for various configurations. # Uncomment one section at the time to see the differences in the output. # PyPdfium without EasyOCR # -------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = False # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = False # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption( # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend # ) # } # ) # PyPdfium with EasyOCR # ----------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption( # pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend # ) # } # ) # Docling Parse without EasyOCR # ------------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = False # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with EasyOCR # ---------------------- pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True pipeline_options.ocr_options.lang = [\"es\"] pipeline_options.accelerator_options = AcceleratorOptions( num_threads=4, device=AcceleratorDevice.AUTO ) doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) # Docling Parse with EasyOCR (CPU only) # ---------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.ocr_options.use_gpu = False # <-- set this. # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with Tesseract # ---------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # pipeline_options.ocr_options = TesseractOcrOptions() # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with Tesseract CLI # ---------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # pipeline_options.ocr_options = TesseractCliOcrOptions() # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) # Docling Parse with ocrmac(Mac only) # ---------------------- # pipeline_options = PdfPipelineOptions() # pipeline_options.do_ocr = True # pipeline_options.do_table_structure = True # pipeline_options.table_structure_options.do_cell_matching = True # pipeline_options.ocr_options = OcrMacOptions() # doc_converter = DocumentConverter( # format_options={ # InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) # } # ) ########################################################################### start_time = time.time() conv_result = doc_converter.convert(input_doc_path) end_time = time.time() - start_time _log.info(f\"Document converted in {end_time:.2f} seconds.\") ## Export results output_dir = Path(\"scratch\") output_dir.mkdir(parents=True, exist_ok=True) doc_filename = conv_result.input.file.stem # Export Deep Search document JSON format: with (output_dir / f\"{doc_filename}.json\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(json.dumps(conv_result.document.export_to_dict())) # Export Text format: with (output_dir / f\"{doc_filename}.txt\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(conv_result.document.export_to_text()) # Export Markdown format: with (output_dir / f\"{doc_filename}.md\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(conv_result.document.export_to_markdown()) # Export Document Tags format: with (output_dir / f\"{doc_filename}.doctags\").open(\"w\", encoding=\"utf-8\") as fp: fp.write(conv_result.document.export_to_document_tokens()) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/develop_formula_understanding/","title":"Formula enrichment","text":"<p>WARNING This example demonstrates only how to develop a new enrichment model. It does not run the actual formula understanding model.</p> In\u00a0[\u00a0]: Copied! <pre>import logging\nfrom collections.abc import Iterable\nfrom pathlib import Path\n</pre> import logging from collections.abc import Iterable from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>from docling_core.types.doc import DocItemLabel, DoclingDocument, NodeItem, TextItem\n</pre> from docling_core.types.doc import DocItemLabel, DoclingDocument, NodeItem, TextItem In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.models.base_model import BaseItemAndImageEnrichmentModel\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\n</pre> from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.models.base_model import BaseItemAndImageEnrichmentModel from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline In\u00a0[\u00a0]: Copied! <pre>class ExampleFormulaUnderstandingPipelineOptions(PdfPipelineOptions):\n do_formula_understanding: bool = True\n</pre> class ExampleFormulaUnderstandingPipelineOptions(PdfPipelineOptions): do_formula_understanding: bool = True In\u00a0[\u00a0]: Copied! <pre># A new enrichment model using both the document element and its image as input\nclass ExampleFormulaUnderstandingEnrichmentModel(BaseItemAndImageEnrichmentModel):\n images_scale = 2.6\n\n def __init__(self, enabled: bool):\n self.enabled = enabled\n\n def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:\n return (\n self.enabled\n and isinstance(element, TextItem)\n and element.label == DocItemLabel.FORMULA\n )\n\n def __call__(\n self,\n doc: DoclingDocument,\n element_batch: Iterable[ItemAndImageEnrichmentElement],\n ) -> Iterable[NodeItem]:\n if not self.enabled:\n return\n\n for enrich_element in element_batch:\n enrich_element.image.show()\n\n yield enrich_element.item\n</pre> # A new enrichment model using both the document element and its image as input class ExampleFormulaUnderstandingEnrichmentModel(BaseItemAndImageEnrichmentModel): images_scale = 2.6 def __init__(self, enabled: bool): self.enabled = enabled def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool: return ( self.enabled and isinstance(element, TextItem) and element.label == DocItemLabel.FORMULA ) def __call__( self, doc: DoclingDocument, element_batch: Iterable[ItemAndImageEnrichmentElement], ) -> Iterable[NodeItem]: if not self.enabled: return for enrich_element in element_batch: enrich_element.image.show() yield enrich_element.item In\u00a0[\u00a0]: Copied! <pre># How the pipeline can be extended.\nclass ExampleFormulaUnderstandingPipeline(StandardPdfPipeline):\n def __init__(self, pipeline_options: ExampleFormulaUnderstandingPipelineOptions):\n super().__init__(pipeline_options)\n self.pipeline_options: ExampleFormulaUnderstandingPipelineOptions\n\n self.enrichment_pipe = [\n ExampleFormulaUnderstandingEnrichmentModel(\n enabled=self.pipeline_options.do_formula_understanding\n )\n ]\n\n if self.pipeline_options.do_formula_understanding:\n self.keep_backend = True\n\n @classmethod\n def get_default_options(cls) -> ExampleFormulaUnderstandingPipelineOptions:\n return ExampleFormulaUnderstandingPipelineOptions()\n</pre> # How the pipeline can be extended. class ExampleFormulaUnderstandingPipeline(StandardPdfPipeline): def __init__(self, pipeline_options: ExampleFormulaUnderstandingPipelineOptions): super().__init__(pipeline_options) self.pipeline_options: ExampleFormulaUnderstandingPipelineOptions self.enrichment_pipe = [ ExampleFormulaUnderstandingEnrichmentModel( enabled=self.pipeline_options.do_formula_understanding ) ] if self.pipeline_options.do_formula_understanding: self.keep_backend = True @classmethod def get_default_options(cls) -> ExampleFormulaUnderstandingPipelineOptions: return ExampleFormulaUnderstandingPipelineOptions() In\u00a0[\u00a0]: Copied! <pre># Example main. In the final version, we simply have to set do_formula_understanding to true.\ndef main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2203.01017v2.pdf\"\n\n pipeline_options = ExampleFormulaUnderstandingPipelineOptions()\n pipeline_options.do_formula_understanding = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=ExampleFormulaUnderstandingPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n doc_converter.convert(input_doc_path)\n</pre> # Example main. In the final version, we simply have to set do_formula_understanding to true. def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2203.01017v2.pdf\" pipeline_options = ExampleFormulaUnderstandingPipelineOptions() pipeline_options.do_formula_understanding = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=ExampleFormulaUnderstandingPipeline, pipeline_options=pipeline_options, ) } ) doc_converter.convert(input_doc_path) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/develop_picture_enrichment/","title":"Figure enrichment","text":"<p>WARNING This example demonstrates only how to develop a new enrichment model. It does not run the actual picture classifier model.</p> In\u00a0[\u00a0]: Copied! <pre>import logging\nfrom collections.abc import Iterable\nfrom pathlib import Path\nfrom typing import Any\n</pre> import logging from collections.abc import Iterable from pathlib import Path from typing import Any In\u00a0[\u00a0]: Copied! <pre>from docling_core.types.doc import (\n DoclingDocument,\n NodeItem,\n PictureClassificationClass,\n PictureClassificationData,\n PictureItem,\n)\n</pre> from docling_core.types.doc import ( DoclingDocument, NodeItem, PictureClassificationClass, PictureClassificationData, PictureItem, ) In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.models.base_model import BaseEnrichmentModel\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.models.base_model import BaseEnrichmentModel from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline In\u00a0[\u00a0]: Copied! <pre>class ExamplePictureClassifierPipelineOptions(PdfPipelineOptions):\n do_picture_classifer: bool = True\n</pre> class ExamplePictureClassifierPipelineOptions(PdfPipelineOptions): do_picture_classifer: bool = True In\u00a0[\u00a0]: Copied! <pre>class ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel):\n def __init__(self, enabled: bool):\n self.enabled = enabled\n\n def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:\n return self.enabled and isinstance(element, PictureItem)\n\n def __call__(\n self, doc: DoclingDocument, element_batch: Iterable[NodeItem]\n ) -> Iterable[Any]:\n if not self.enabled:\n return\n\n for element in element_batch:\n assert isinstance(element, PictureItem)\n\n # uncomment this to interactively visualize the image\n # element.get_image(doc).show()\n\n element.annotations.append(\n PictureClassificationData(\n provenance=\"example_classifier-0.0.1\",\n predicted_classes=[\n PictureClassificationClass(class_name=\"dummy\", confidence=0.42)\n ],\n )\n )\n\n yield element\n</pre> class ExamplePictureClassifierEnrichmentModel(BaseEnrichmentModel): def __init__(self, enabled: bool): self.enabled = enabled def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool: return self.enabled and isinstance(element, PictureItem) def __call__( self, doc: DoclingDocument, element_batch: Iterable[NodeItem] ) -> Iterable[Any]: if not self.enabled: return for element in element_batch: assert isinstance(element, PictureItem) # uncomment this to interactively visualize the image # element.get_image(doc).show() element.annotations.append( PictureClassificationData( provenance=\"example_classifier-0.0.1\", predicted_classes=[ PictureClassificationClass(class_name=\"dummy\", confidence=0.42) ], ) ) yield element In\u00a0[\u00a0]: Copied! <pre>class ExamplePictureClassifierPipeline(StandardPdfPipeline):\n def __init__(self, pipeline_options: ExamplePictureClassifierPipelineOptions):\n super().__init__(pipeline_options)\n self.pipeline_options: ExamplePictureClassifierPipeline\n\n self.enrichment_pipe = [\n ExamplePictureClassifierEnrichmentModel(\n enabled=pipeline_options.do_picture_classifer\n )\n ]\n\n @classmethod\n def get_default_options(cls) -> ExamplePictureClassifierPipelineOptions:\n return ExamplePictureClassifierPipelineOptions()\n</pre> class ExamplePictureClassifierPipeline(StandardPdfPipeline): def __init__(self, pipeline_options: ExamplePictureClassifierPipelineOptions): super().__init__(pipeline_options) self.pipeline_options: ExamplePictureClassifierPipeline self.enrichment_pipe = [ ExamplePictureClassifierEnrichmentModel( enabled=pipeline_options.do_picture_classifer ) ] @classmethod def get_default_options(cls) -> ExamplePictureClassifierPipelineOptions: return ExamplePictureClassifierPipelineOptions() In\u00a0[\u00a0]: Copied! <pre>def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n pipeline_options = ExamplePictureClassifierPipelineOptions()\n pipeline_options.images_scale = 2.0\n pipeline_options.generate_picture_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=ExamplePictureClassifierPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n result = doc_converter.convert(input_doc_path)\n\n for element, _level in result.document.iterate_items():\n if isinstance(element, PictureItem):\n print(\n f\"The model populated the `data` portion of picture {element.self_ref}:\\n{element.annotations}\"\n )\n</pre> def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" pipeline_options = ExamplePictureClassifierPipelineOptions() pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=ExamplePictureClassifierPipeline, pipeline_options=pipeline_options, ) } ) result = doc_converter.convert(input_doc_path) for element, _level in result.document.iterate_items(): if isinstance(element, PictureItem): print( f\"The model populated the `data` portion of picture {element.self_ref}:\\n{element.annotations}\" ) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/enrich_doclingdocument/","title":"Enrich DoclingDocument","text":"In\u00a0[\u00a0]: Copied! <pre>from pathlib import Path\nfrom typing import Iterable, Optional\n</pre> from pathlib import Path from typing import Iterable, Optional In\u00a0[\u00a0]: Copied! <pre>from docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem\nfrom rich.pretty import pprint\n</pre> from docling_core.types.doc import BoundingBox, DocItem, DoclingDocument, NodeItem from rich.pretty import pprint In\u00a0[\u00a0]: Copied! <pre>from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\nfrom docling.datamodel.accelerator_options import AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement\nfrom docling.datamodel.document import InputDocument\nfrom docling.models.base_model import BaseItemAndImageEnrichmentModel\nfrom docling.models.document_picture_classifier import (\n DocumentPictureClassifier,\n DocumentPictureClassifierOptions,\n)\nfrom docling.utils.utils import chunkify\n</pre> from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend from docling.datamodel.accelerator_options import AcceleratorOptions from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement from docling.datamodel.document import InputDocument from docling.models.base_model import BaseItemAndImageEnrichmentModel from docling.models.document_picture_classifier import ( DocumentPictureClassifier, DocumentPictureClassifierOptions, ) from docling.utils.utils import chunkify In\u00a0[\u00a0]: Copied! <pre>BATCH_SIZE = 4\n</pre> BATCH_SIZE = 4 In\u00a0[\u00a0]: Copied! <pre>def prepare_element(\n doc: DoclingDocument,\n backend: PyPdfiumDocumentBackend,\n model: BaseItemAndImageEnrichmentModel,\n element: NodeItem,\n) -> Optional[ItemAndImageEnrichmentElement]:\n if not model.is_processable(doc=doc, element=element):\n return None\n\n assert isinstance(element, DocItem)\n element_prov = element.prov[0]\n\n bbox = element_prov.bbox\n width = bbox.r - bbox.l\n height = bbox.t - bbox.b\n\n expanded_bbox = BoundingBox(\n l=bbox.l - width * model.expansion_factor,\n t=bbox.t + height * model.expansion_factor,\n r=bbox.r + width * model.expansion_factor,\n b=bbox.b - height * model.expansion_factor,\n coord_origin=bbox.coord_origin,\n )\n\n page_ix = element_prov.page_no - 1\n page_backend = backend.load_page(page_no=page_ix)\n cropped_image = page_backend.get_page_image(\n scale=model.images_scale, cropbox=expanded_bbox\n )\n return ItemAndImageEnrichmentElement(item=element, image=cropped_image)\n</pre> def prepare_element( doc: DoclingDocument, backend: PyPdfiumDocumentBackend, model: BaseItemAndImageEnrichmentModel, element: NodeItem, ) -> Optional[ItemAndImageEnrichmentElement]: if not model.is_processable(doc=doc, element=element): return None assert isinstance(element, DocItem) element_prov = element.prov[0] bbox = element_prov.bbox width = bbox.r - bbox.l height = bbox.t - bbox.b expanded_bbox = BoundingBox( l=bbox.l - width * model.expansion_factor, t=bbox.t + height * model.expansion_factor, r=bbox.r + width * model.expansion_factor, b=bbox.b - height * model.expansion_factor, coord_origin=bbox.coord_origin, ) page_ix = element_prov.page_no - 1 page_backend = backend.load_page(page_no=page_ix) cropped_image = page_backend.get_page_image( scale=model.images_scale, cropbox=expanded_bbox ) return ItemAndImageEnrichmentElement(item=element, image=cropped_image) In\u00a0[\u00a0]: Copied! <pre>def enrich_document(\n doc: DoclingDocument,\n backend: PyPdfiumDocumentBackend,\n model: BaseItemAndImageEnrichmentModel,\n) -> DoclingDocument:\n def _prepare_elements(\n doc: DoclingDocument,\n backend: PyPdfiumDocumentBackend,\n model: BaseItemAndImageEnrichmentModel,\n ) -> Iterable[NodeItem]:\n for doc_element, _level in doc.iterate_items():\n prepared_element = prepare_element(\n doc=doc, backend=backend, model=model, element=doc_element\n )\n if prepared_element is not None:\n yield prepared_element\n\n for element_batch in chunkify(\n _prepare_elements(doc, backend, model),\n BATCH_SIZE,\n ):\n for element in model(doc=doc, element_batch=element_batch): # Must exhaust!\n pass\n\n return doc\n</pre> def enrich_document( doc: DoclingDocument, backend: PyPdfiumDocumentBackend, model: BaseItemAndImageEnrichmentModel, ) -> DoclingDocument: def _prepare_elements( doc: DoclingDocument, backend: PyPdfiumDocumentBackend, model: BaseItemAndImageEnrichmentModel, ) -> Iterable[NodeItem]: for doc_element, _level in doc.iterate_items(): prepared_element = prepare_element( doc=doc, backend=backend, model=model, element=doc_element ) if prepared_element is not None: yield prepared_element for element_batch in chunkify( _prepare_elements(doc, backend, model), BATCH_SIZE, ): for element in model(doc=doc, element_batch=element_batch): # Must exhaust! pass return doc In\u00a0[\u00a0]: Copied! <pre>def main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_pdf_path = data_folder / \"pdf/2206.01062.pdf\"\n\n input_doc_path = data_folder / \"groundtruth/docling_v2/2206.01062.json\"\n\n doc = DoclingDocument.load_from_json(input_doc_path)\n\n in_pdf_doc = InputDocument(\n input_pdf_path,\n format=InputFormat.PDF,\n backend=PyPdfiumDocumentBackend,\n filename=input_pdf_path.name,\n )\n backend = in_pdf_doc._backend\n\n model = DocumentPictureClassifier(\n enabled=True,\n artifacts_path=None,\n options=DocumentPictureClassifierOptions(),\n accelerator_options=AcceleratorOptions(),\n )\n\n doc = enrich_document(doc=doc, backend=backend, model=model)\n\n for pic in doc.pictures[:5]:\n print(pic.self_ref)\n pprint(pic.annotations)\n</pre> def main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_pdf_path = data_folder / \"pdf/2206.01062.pdf\" input_doc_path = data_folder / \"groundtruth/docling_v2/2206.01062.json\" doc = DoclingDocument.load_from_json(input_doc_path) in_pdf_doc = InputDocument( input_pdf_path, format=InputFormat.PDF, backend=PyPdfiumDocumentBackend, filename=input_pdf_path.name, ) backend = in_pdf_doc._backend model = DocumentPictureClassifier( enabled=True, artifacts_path=None, options=DocumentPictureClassifierOptions(), accelerator_options=AcceleratorOptions(), ) doc = enrich_document(doc=doc, backend=backend, model=model) for pic in doc.pictures[:5]: print(pic.self_ref) pprint(pic.annotations) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/enrich_doclingdocument/#enrich-doclingdocument","title":"Enrich DoclingDocument\u00b6","text":"<p>This example allows to run Docling enrichment models on documents which have been already converted and stored as serialized DoclingDocument JSON files.</p>"},{"location":"examples/enrich_doclingdocument/#load-modules","title":"Load modules\u00b6","text":""},{"location":"examples/enrich_doclingdocument/#define-batch-size-used-for-processing","title":"Define batch size used for processing\u00b6","text":""},{"location":"examples/enrich_doclingdocument/#from-docitem-to-the-model-inputs","title":"From DocItem to the model inputs\u00b6","text":"<p>The following function is responsible for taking an item and applying the required pre-processing for the model. In this case we generate a cropped image from the document backend.</p>"},{"location":"examples/enrich_doclingdocument/#iterate-through-the-document","title":"Iterate through the document\u00b6","text":"<p>This block defines the <code>enrich_document()</code> which is responsible for iterating through the document and batch the selected document items for running through the model.</p>"},{"location":"examples/enrich_doclingdocument/#open-and-process","title":"Open and process\u00b6","text":"<p>The <code>main()</code> function which initializes the document and model objects for calling <code>enrich_document()</code>.</p>"},{"location":"examples/export_figures/","title":"Figure export","text":"In\u00a0[\u00a0]: Copied! <pre>import logging\nimport time\nfrom pathlib import Path\n</pre> import logging import time from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>from docling_core.types.doc import ImageRefMode, PictureItem, TableItem\n</pre> from docling_core.types.doc import ImageRefMode, PictureItem, TableItem In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied! <pre>_log = logging.getLogger(__name__)\n</pre> _log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied! <pre>IMAGE_RESOLUTION_SCALE = 2.0\n</pre> IMAGE_RESOLUTION_SCALE = 2.0 In\u00a0[\u00a0]: Copied! <pre>def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\")\n\n # Important: For operating with page images, we must keep them, otherwise the DocumentConverter\n # will destroy them for cleaning up memory.\n # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images.\n # scale=1 correspond of a standard 72 DPI image\n # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched\n # with the image field\n pipeline_options = PdfPipelineOptions()\n pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n pipeline_options.generate_page_images = True\n pipeline_options.generate_picture_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n start_time = time.time()\n\n conv_res = doc_converter.convert(input_doc_path)\n\n output_dir.mkdir(parents=True, exist_ok=True)\n doc_filename = conv_res.input.file.stem\n\n # Save page images\n for page_no, page in conv_res.document.pages.items():\n page_no = page.page_no\n page_image_filename = output_dir / f\"{doc_filename}-{page_no}.png\"\n with page_image_filename.open(\"wb\") as fp:\n page.image.pil_image.save(fp, format=\"PNG\")\n\n # Save images of figures and tables\n table_counter = 0\n picture_counter = 0\n for element, _level in conv_res.document.iterate_items():\n if isinstance(element, TableItem):\n table_counter += 1\n element_image_filename = (\n output_dir / f\"{doc_filename}-table-{table_counter}.png\"\n )\n with element_image_filename.open(\"wb\") as fp:\n element.get_image(conv_res.document).save(fp, \"PNG\")\n\n if isinstance(element, PictureItem):\n picture_counter += 1\n element_image_filename = (\n output_dir / f\"{doc_filename}-picture-{picture_counter}.png\"\n )\n with element_image_filename.open(\"wb\") as fp:\n element.get_image(conv_res.document).save(fp, \"PNG\")\n\n # Save markdown with embedded pictures\n md_filename = output_dir / f\"{doc_filename}-with-images.md\"\n conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n # Save markdown with externally referenced pictures\n md_filename = output_dir / f\"{doc_filename}-with-image-refs.md\"\n conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)\n\n # Save HTML with externally referenced pictures\n html_filename = output_dir / f\"{doc_filename}-with-image-refs.html\"\n conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED)\n\n end_time = time.time() - start_time\n\n _log.info(f\"Document converted and figures exported in {end_time:.2f} seconds.\")\n</pre> def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") # Important: For operating with page images, we must keep them, otherwise the DocumentConverter # will destroy them for cleaning up memory. # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images. # scale=1 correspond of a standard 72 DPI image # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched # with the image field pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) start_time = time.time() conv_res = doc_converter.convert(input_doc_path) output_dir.mkdir(parents=True, exist_ok=True) doc_filename = conv_res.input.file.stem # Save page images for page_no, page in conv_res.document.pages.items(): page_no = page.page_no page_image_filename = output_dir / f\"{doc_filename}-{page_no}.png\" with page_image_filename.open(\"wb\") as fp: page.image.pil_image.save(fp, format=\"PNG\") # Save images of figures and tables table_counter = 0 picture_counter = 0 for element, _level in conv_res.document.iterate_items(): if isinstance(element, TableItem): table_counter += 1 element_image_filename = ( output_dir / f\"{doc_filename}-table-{table_counter}.png\" ) with element_image_filename.open(\"wb\") as fp: element.get_image(conv_res.document).save(fp, \"PNG\") if isinstance(element, PictureItem): picture_counter += 1 element_image_filename = ( output_dir / f\"{doc_filename}-picture-{picture_counter}.png\" ) with element_image_filename.open(\"wb\") as fp: element.get_image(conv_res.document).save(fp, \"PNG\") # Save markdown with embedded pictures md_filename = output_dir / f\"{doc_filename}-with-images.md\" conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED) # Save markdown with externally referenced pictures md_filename = output_dir / f\"{doc_filename}-with-image-refs.md\" conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED) # Save HTML with externally referenced pictures html_filename = output_dir / f\"{doc_filename}-with-image-refs.html\" conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED) end_time = time.time() - start_time _log.info(f\"Document converted and figures exported in {end_time:.2f} seconds.\") In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/export_multimodal/","title":"Multimodal export","text":"In\u00a0[\u00a0]: Copied! <pre>import datetime\nimport logging\nimport time\nfrom pathlib import Path\n</pre> import datetime import logging import time from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>import pandas as pd\n</pre> import pandas as pd In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.utils.export import generate_multimodal_pages\nfrom docling.utils.utils import create_hash\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.utils.export import generate_multimodal_pages from docling.utils.utils import create_hash In\u00a0[\u00a0]: Copied! <pre>_log = logging.getLogger(__name__)\n</pre> _log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied! <pre>IMAGE_RESOLUTION_SCALE = 2.0\n</pre> IMAGE_RESOLUTION_SCALE = 2.0 In\u00a0[\u00a0]: Copied! <pre>def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\")\n\n # Important: For operating with page images, we must keep them, otherwise the DocumentConverter\n # will destroy them for cleaning up memory.\n # This is done by setting AssembleOptions.images_scale, which also defines the scale of images.\n # scale=1 correspond of a standard 72 DPI image\n pipeline_options = PdfPipelineOptions()\n pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n pipeline_options.generate_page_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n start_time = time.time()\n\n conv_res = doc_converter.convert(input_doc_path)\n\n output_dir.mkdir(parents=True, exist_ok=True)\n\n rows = []\n for (\n content_text,\n content_md,\n content_dt,\n page_cells,\n page_segments,\n page,\n ) in generate_multimodal_pages(conv_res):\n dpi = page._default_image_scale * 72\n\n rows.append(\n {\n \"document\": conv_res.input.file.name,\n \"hash\": conv_res.input.document_hash,\n \"page_hash\": create_hash(\n conv_res.input.document_hash + \":\" + str(page.page_no - 1)\n ),\n \"image\": {\n \"width\": page.image.width,\n \"height\": page.image.height,\n \"bytes\": page.image.tobytes(),\n },\n \"cells\": page_cells,\n \"contents\": content_text,\n \"contents_md\": content_md,\n \"contents_dt\": content_dt,\n \"segments\": page_segments,\n \"extra\": {\n \"page_num\": page.page_no + 1,\n \"width_in_points\": page.size.width,\n \"height_in_points\": page.size.height,\n \"dpi\": dpi,\n },\n }\n )\n\n # Generate one parquet from all documents\n df_result = pd.json_normalize(rows)\n now = datetime.datetime.now()\n output_filename = output_dir / f\"multimodal_{now:%Y-%m-%d_%H%M%S}.parquet\"\n df_result.to_parquet(output_filename)\n\n end_time = time.time() - start_time\n\n _log.info(\n f\"Document converted and multimodal pages generated in {end_time:.2f} seconds.\"\n )\n\n # This block demonstrates how the file can be opened with the HF datasets library\n # from datasets import Dataset\n # from PIL import Image\n # multimodal_df = pd.read_parquet(output_filename)\n\n # # Convert pandas DataFrame to Hugging Face Dataset and load bytes into image\n # dataset = Dataset.from_pandas(multimodal_df)\n # def transforms(examples):\n # examples[\"image\"] = Image.frombytes('RGB', (examples[\"image.width\"], examples[\"image.height\"]), examples[\"image.bytes\"], 'raw')\n # return examples\n # dataset = dataset.map(transforms)\n</pre> def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") # Important: For operating with page images, we must keep them, otherwise the DocumentConverter # will destroy them for cleaning up memory. # This is done by setting AssembleOptions.images_scale, which also defines the scale of images. # scale=1 correspond of a standard 72 DPI image pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) start_time = time.time() conv_res = doc_converter.convert(input_doc_path) output_dir.mkdir(parents=True, exist_ok=True) rows = [] for ( content_text, content_md, content_dt, page_cells, page_segments, page, ) in generate_multimodal_pages(conv_res): dpi = page._default_image_scale * 72 rows.append( { \"document\": conv_res.input.file.name, \"hash\": conv_res.input.document_hash, \"page_hash\": create_hash( conv_res.input.document_hash + \":\" + str(page.page_no - 1) ), \"image\": { \"width\": page.image.width, \"height\": page.image.height, \"bytes\": page.image.tobytes(), }, \"cells\": page_cells, \"contents\": content_text, \"contents_md\": content_md, \"contents_dt\": content_dt, \"segments\": page_segments, \"extra\": { \"page_num\": page.page_no + 1, \"width_in_points\": page.size.width, \"height_in_points\": page.size.height, \"dpi\": dpi, }, } ) # Generate one parquet from all documents df_result = pd.json_normalize(rows) now = datetime.datetime.now() output_filename = output_dir / f\"multimodal_{now:%Y-%m-%d_%H%M%S}.parquet\" df_result.to_parquet(output_filename) end_time = time.time() - start_time _log.info( f\"Document converted and multimodal pages generated in {end_time:.2f} seconds.\" ) # This block demonstrates how the file can be opened with the HF datasets library # from datasets import Dataset # from PIL import Image # multimodal_df = pd.read_parquet(output_filename) # # Convert pandas DataFrame to Hugging Face Dataset and load bytes into image # dataset = Dataset.from_pandas(multimodal_df) # def transforms(examples): # examples[\"image\"] = Image.frombytes('RGB', (examples[\"image.width\"], examples[\"image.height\"]), examples[\"image.bytes\"], 'raw') # return examples # dataset = dataset.map(transforms) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/export_tables/","title":"Table export","text":"In\u00a0[\u00a0]: Copied! <pre>import logging\nimport time\nfrom pathlib import Path\n</pre> import logging import time from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>import pandas as pd\n</pre> import pandas as pd In\u00a0[\u00a0]: Copied! <pre>from docling.document_converter import DocumentConverter\n</pre> from docling.document_converter import DocumentConverter In\u00a0[\u00a0]: Copied! <pre>_log = logging.getLogger(__name__)\n</pre> _log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied! <pre>def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\")\n\n doc_converter = DocumentConverter()\n\n start_time = time.time()\n\n conv_res = doc_converter.convert(input_doc_path)\n\n output_dir.mkdir(parents=True, exist_ok=True)\n\n doc_filename = conv_res.input.file.stem\n\n # Export tables\n for table_ix, table in enumerate(conv_res.document.tables):\n table_df: pd.DataFrame = table.export_to_dataframe()\n print(f\"## Table {table_ix}\")\n print(table_df.to_markdown())\n\n # Save the table as csv\n element_csv_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.csv\"\n _log.info(f\"Saving CSV table to {element_csv_filename}\")\n table_df.to_csv(element_csv_filename)\n\n # Save the table as html\n element_html_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.html\"\n _log.info(f\"Saving HTML table to {element_html_filename}\")\n with element_html_filename.open(\"w\") as fp:\n fp.write(table.export_to_html(doc=conv_res.document))\n\n end_time = time.time() - start_time\n\n _log.info(f\"Document converted and tables exported in {end_time:.2f} seconds.\")\n</pre> def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") doc_converter = DocumentConverter() start_time = time.time() conv_res = doc_converter.convert(input_doc_path) output_dir.mkdir(parents=True, exist_ok=True) doc_filename = conv_res.input.file.stem # Export tables for table_ix, table in enumerate(conv_res.document.tables): table_df: pd.DataFrame = table.export_to_dataframe() print(f\"## Table {table_ix}\") print(table_df.to_markdown()) # Save the table as csv element_csv_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.csv\" _log.info(f\"Saving CSV table to {element_csv_filename}\") table_df.to_csv(element_csv_filename) # Save the table as html element_html_filename = output_dir / f\"{doc_filename}-table-{table_ix + 1}.html\" _log.info(f\"Saving HTML table to {element_html_filename}\") with element_html_filename.open(\"w\") as fp: fp.write(table.export_to_html(doc=conv_res.document)) end_time = time.time() - start_time _log.info(f\"Document converted and tables exported in {end_time:.2f} seconds.\") In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/full_page_ocr/","title":"Force full page OCR","text":"In\u00a0[\u00a0]: Copied! <pre>from pathlib import Path\n</pre> from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n TesseractCliOcrOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, TesseractCliOcrOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied! <pre>def main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n pipeline_options = PdfPipelineOptions()\n pipeline_options.do_ocr = True\n pipeline_options.do_table_structure = True\n pipeline_options.table_structure_options.do_cell_matching = True\n\n # Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions\n # ocr_options = EasyOcrOptions(force_full_page_ocr=True)\n # ocr_options = TesseractOcrOptions(force_full_page_ocr=True)\n # ocr_options = OcrMacOptions(force_full_page_ocr=True)\n # ocr_options = RapidOcrOptions(force_full_page_ocr=True)\n ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)\n pipeline_options.ocr_options = ocr_options\n\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n\n doc = converter.convert(input_doc_path).document\n md = doc.export_to_markdown()\n print(md)\n</pre> def main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True # Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions # ocr_options = EasyOcrOptions(force_full_page_ocr=True) # ocr_options = TesseractOcrOptions(force_full_page_ocr=True) # ocr_options = OcrMacOptions(force_full_page_ocr=True) # ocr_options = RapidOcrOptions(force_full_page_ocr=True) ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True) pipeline_options.ocr_options = ocr_options converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(input_doc_path).document md = doc.export_to_markdown() print(md) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/hybrid_chunking/","title":"Hybrid chunking","text":"<p>Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking.</p> <p>For more details, see here.</p> In\u00a0[1]: Copied! <pre>%pip install -qU pip docling transformers\n</pre> %pip install -qU pip docling transformers <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>DOC_SOURCE = \"../../tests/data/md/wiki.md\"\n</pre> DOC_SOURCE = \"../../tests/data/md/wiki.md\" <p>We first convert the document:</p> In\u00a0[3]: Copied! <pre>from docling.document_converter import DocumentConverter\n\ndoc = DocumentConverter().convert(source=DOC_SOURCE).document\n</pre> from docling.document_converter import DocumentConverter doc = DocumentConverter().convert(source=DOC_SOURCE).document <p>For a basic chunking scenario, we can just instantiate a <code>HybridChunker</code>, which will use the default parameters.</p> In\u00a0[4]: Copied! <pre>from docling.chunking import HybridChunker\n\nchunker = HybridChunker()\nchunk_iter = chunker.chunk(dl_doc=doc)\n</pre> from docling.chunking import HybridChunker chunker = HybridChunker() chunk_iter = chunker.chunk(dl_doc=doc) <pre>Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors\n</pre> <p>\ud83d\udc49 NOTE: As you see above, using the <code>HybridChunker</code> can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here.</p> <p>Note that the text you would typically want to embed is the context-enriched one as returned by the <code>contextualize()</code> method:</p> In\u00a0[5]: Copied! <pre>for i, chunk in enumerate(chunk_iter):\n print(f\"=== {i} ===\")\n print(f\"chunk.text:\\n{f'{chunk.text[:300]}\u2026'!r}\")\n\n enriched_text = chunker.contextualize(chunk=chunk)\n print(f\"chunker.contextualize(chunk):\\n{f'{enriched_text[:300]}\u2026'!r}\")\n\n print()\n</pre> for i, chunk in enumerate(chunk_iter): print(f\"=== {i} ===\") print(f\"chunk.text:\\n{f'{chunk.text[:300]}\u2026'!r}\") enriched_text = chunker.contextualize(chunk=chunk) print(f\"chunker.contextualize(chunk):\\n{f'{enriched_text[:300]}\u2026'!r}\") print() <pre>=== 0 ===\nchunk.text:\n'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Aver\u2026'\nchunker.contextualize(chunk):\n'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial \u2026'\n\n=== 1 ===\nchunk.text:\n'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willa\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1910s\u20131950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889\u2026'\n\n=== 2 ===\nchunk.text:\n'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson,\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1910s\u20131950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John \u2026'\n\n=== 3 ===\nchunk.text:\n'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.\u2026'\nchunker.contextualize(chunk):\n'IBM\\n1960s\u20131980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.\u2026'\n\n</pre> In\u00a0[6]: Copied! <pre>from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\nfrom transformers import AutoTokenizer\n\nfrom docling.chunking import HybridChunker\n\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nMAX_TOKENS = 64 # set to a small number for illustrative purposes\n\ntokenizer = HuggingFaceTokenizer(\n tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case\n)\n</pre> from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer from transformers import AutoTokenizer from docling.chunking import HybridChunker EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" MAX_TOKENS = 64 # set to a small number for illustrative purposes tokenizer = HuggingFaceTokenizer( tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID), max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case ) <p>\ud83d\udc49 Alternatively, OpenAI tokenizers can be used as shown in the example below (uncomment to use \u2014 requires installing <code>docling-core[chunking-openai]</code>):</p> In\u00a0[7]: Copied! <pre># import tiktoken\n\n# from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer\n\n# tokenizer = OpenAITokenizer(\n# tokenizer=tiktoken.encoding_for_model(\"gpt-4o\"),\n# max_tokens=128 * 1024, # context window length required for OpenAI tokenizers\n# )\n</pre> # import tiktoken # from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer # tokenizer = OpenAITokenizer( # tokenizer=tiktoken.encoding_for_model(\"gpt-4o\"), # max_tokens=128 * 1024, # context window length required for OpenAI tokenizers # ) <p>We can now instantiate our chunker:</p> In\u00a0[8]: Copied! <pre>chunker = HybridChunker(\n tokenizer=tokenizer,\n merge_peers=True, # optional, defaults to True\n)\nchunk_iter = chunker.chunk(dl_doc=doc)\nchunks = list(chunk_iter)\n</pre> chunker = HybridChunker( tokenizer=tokenizer, merge_peers=True, # optional, defaults to True ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter) <p>Points to notice looking at the output chunks below:</p> <ul> <li>Where possible, we fit the limit of 64 tokens for the metadata-enriched serialization form (see chunk 2)</li> <li>Where needed, we stop before the limit, e.g. see cases of 63 as it would otherwise run into a comma (see chunk 6)</li> <li>Where possible, we merge undersized peer chunks (see chunk 0)</li> <li>\"Tail\" chunks trailing right after merges may still be undersized (see chunk 8)</li> </ul> In\u00a0[9]: Copied! <pre>for i, chunk in enumerate(chunks):\n print(f\"=== {i} ===\")\n txt_tokens = tokenizer.count_tokens(chunk.text)\n print(f\"chunk.text ({txt_tokens} tokens):\\n{chunk.text!r}\")\n\n ser_txt = chunker.contextualize(chunk=chunk)\n ser_tokens = tokenizer.count_tokens(ser_txt)\n print(f\"chunker.contextualize(chunk) ({ser_tokens} tokens):\\n{ser_txt!r}\")\n\n print()\n</pre> for i, chunk in enumerate(chunks): print(f\"=== {i} ===\") txt_tokens = tokenizer.count_tokens(chunk.text) print(f\"chunk.text ({txt_tokens} tokens):\\n{chunk.text!r}\") ser_txt = chunker.contextualize(chunk=chunk) ser_tokens = tokenizer.count_tokens(ser_txt) print(f\"chunker.contextualize(chunk) ({ser_tokens} tokens):\\n{ser_txt!r}\") print() <pre>=== 0 ===\nchunk.text (55 tokens):\n'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\nchunker.contextualize(chunk) (56 tokens):\n'IBM\\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.'\n\n=== 1 ===\nchunk.text (45 tokens):\n'IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\nchunker.contextualize(chunk) (46 tokens):\n'IBM\\nIBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\n\n=== 2 ===\nchunk.text (63 tokens):\n'IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nIBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\n\n=== 3 ===\nchunk.text (44 tokens):\n\"IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\nchunker.contextualize(chunk) (45 tokens):\n\"IBM\\nIBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n\n=== 4 ===\nchunk.text (63 tokens):\n'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, \u2014 its DOS software provided by Microsoft, \u2014 which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, \u2014 its DOS software provided by Microsoft, \u2014 which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\n\n=== 5 ===\nchunk.text (61 tokens):\n'IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\nchunker.contextualize(chunk) (62 tokens):\n'IBM\\nIBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n\n=== 6 ===\nchunk.text (62 tokens):\n\"As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\nchunker.contextualize(chunk) (63 tokens):\n\"IBM\\nAs one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\n\n=== 7 ===\nchunk.text (63 tokens):\n'language, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\nlanguage, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\n\n=== 8 ===\nchunk.text (5 tokens):\n'Awards.[16]'\nchunker.contextualize(chunk) (6 tokens):\n'IBM\\nAwards.[16]'\n\n=== 9 ===\nchunk.text (56 tokens):\n'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\nchunker.contextualize(chunk) (60 tokens):\n'IBM\\n1910s\u20131950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\n\n=== 10 ===\nchunk.text (60 tokens):\n\"(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\nchunker.contextualize(chunk) (64 tokens):\n\"IBM\\n1910s\u20131950s\\n(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\n\n=== 11 ===\nchunk.text (59 tokens):\n'Computing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\nchunker.contextualize(chunk) (63 tokens):\n'IBM\\n1910s\u20131950s\\nComputing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\n\n=== 12 ===\nchunk.text (13 tokens):\n'D.C.; and Toronto, Canada.[22]'\nchunker.contextualize(chunk) (17 tokens):\n'IBM\\n1910s\u20131950s\\nD.C.; and Toronto, Canada.[22]'\n\n=== 13 ===\nchunk.text (60 tokens):\n'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\n1910s\u20131950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\n\n=== 14 ===\nchunk.text (59 tokens):\n\"on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\nchunker.contextualize(chunk) (63 tokens):\n\"IBM\\n1910s\u20131950s\\non Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\n\n=== 15 ===\nchunk.text (23 tokens):\n\"practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\nchunker.contextualize(chunk) (27 tokens):\n\"IBM\\n1910s\u20131950s\\npractices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\n\n=== 16 ===\nchunk.text (59 tokens):\n'He implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\nchunker.contextualize(chunk) (63 tokens):\n'IBM\\n1910s\u20131950s\\nHe implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\n\n=== 17 ===\nchunk.text (60 tokens):\n'\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\nchunker.contextualize(chunk) (64 tokens):\n'IBM\\n1910s\u20131950s\\n\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\n\n=== 18 ===\nchunk.text (57 tokens):\n'clumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\nchunker.contextualize(chunk) (61 tokens):\n'IBM\\n1910s\u20131950s\\nclumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\n\n=== 19 ===\nchunk.text (21 tokens):\n'1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\nchunker.contextualize(chunk) (25 tokens):\n'IBM\\n1910s\u20131950s\\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n\n=== 20 ===\nchunk.text (22 tokens):\n'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\nchunker.contextualize(chunk) (26 tokens):\n'IBM\\n1960s\u20131980s\\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\n\n</pre>"},{"location":"examples/hybrid_chunking/#hybrid-chunking","title":"Hybrid chunking\u00b6","text":""},{"location":"examples/hybrid_chunking/#overview","title":"Overview\u00b6","text":""},{"location":"examples/hybrid_chunking/#setup","title":"Setup\u00b6","text":""},{"location":"examples/hybrid_chunking/#basic-usage","title":"Basic usage\u00b6","text":""},{"location":"examples/hybrid_chunking/#configuring-tokenization","title":"Configuring tokenization\u00b6","text":"<p>For more control on the chunking, we can parametrize tokenization as shown below.</p> <p>In a RAG / retrieval context, it is important to make sure that the chunker and embedding model are using the same tokenizer.</p> <p>\ud83d\udc49 HuggingFace transformers tokenizers can be used as shown in the following example:</p>"},{"location":"examples/inspect_picture_content/","title":"Inspect picture content","text":"In\u00a0[\u00a0]: Copied! <pre>from docling_core.types.doc import TextItem\n</pre> from docling_core.types.doc import TextItem In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied! <pre>source = \"tests/data/pdf/amt_handbook_sample.pdf\"\n</pre> source = \"tests/data/pdf/amt_handbook_sample.pdf\" In\u00a0[\u00a0]: Copied! <pre>pipeline_options = PdfPipelineOptions()\npipeline_options.images_scale = 2\npipeline_options.generate_page_images = True\n</pre> pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = 2 pipeline_options.generate_page_images = True In\u00a0[\u00a0]: Copied! <pre>doc_converter = DocumentConverter(\n format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}\n)\n</pre> doc_converter = DocumentConverter( format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} ) In\u00a0[\u00a0]: Copied! <pre>result = doc_converter.convert(source)\n</pre> result = doc_converter.convert(source) In\u00a0[\u00a0]: Copied! <pre>doc = result.document\n</pre> doc = result.document In\u00a0[\u00a0]: Copied! <pre>for picture in doc.pictures:\n # picture.get_image(doc).show() # display the picture\n print(picture.caption_text(doc), \" contains these elements:\")\n\n for item, level in doc.iterate_items(root=picture, traverse_pictures=True):\n if isinstance(item, TextItem):\n print(item.text)\n\n print(\"\\n\")\n</pre> for picture in doc.pictures: # picture.get_image(doc).show() # display the picture print(picture.caption_text(doc), \" contains these elements:\") for item, level in doc.iterate_items(root=picture, traverse_pictures=True): if isinstance(item, TextItem): print(item.text) print(\"\\n\")"},{"location":"examples/minimal/","title":"Simple conversion","text":"In\u00a0[\u00a0]: Copied! <pre>from docling.document_converter import DocumentConverter\n</pre> from docling.document_converter import DocumentConverter In\u00a0[\u00a0]: Copied! <pre>source = \"https://arxiv.org/pdf/2408.09869\" # document per local path or URL\n</pre> source = \"https://arxiv.org/pdf/2408.09869\" # document per local path or URL In\u00a0[\u00a0]: Copied! <pre>converter = DocumentConverter()\ndoc = converter.convert(source).document\n</pre> converter = DocumentConverter() doc = converter.convert(source).document In\u00a0[\u00a0]: Copied! <pre>print(doc.export_to_markdown())\n# output: ## Docling Technical Report [...]\"\n</pre> print(doc.export_to_markdown()) # output: ## Docling Technical Report [...]\""},{"location":"examples/minimal_asr_pipeline/","title":"ASR pipeline with Whisper","text":"In\u00a0[\u00a0]: Copied! <pre>from pathlib import Path\n</pre> from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>from docling_core.types.doc import DoclingDocument\n</pre> from docling_core.types.doc import DoclingDocument In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel import asr_model_specs\nfrom docling.datamodel.base_models import ConversionStatus, InputFormat\nfrom docling.datamodel.document import ConversionResult\nfrom docling.datamodel.pipeline_options import AsrPipelineOptions\nfrom docling.document_converter import AudioFormatOption, DocumentConverter\nfrom docling.pipeline.asr_pipeline import AsrPipeline\n</pre> from docling.datamodel import asr_model_specs from docling.datamodel.base_models import ConversionStatus, InputFormat from docling.datamodel.document import ConversionResult from docling.datamodel.pipeline_options import AsrPipelineOptions from docling.document_converter import AudioFormatOption, DocumentConverter from docling.pipeline.asr_pipeline import AsrPipeline In\u00a0[\u00a0]: Copied! <pre>def get_asr_converter():\n \"\"\"Create a DocumentConverter configured for ASR with whisper_turbo model.\"\"\"\n pipeline_options = AsrPipelineOptions()\n pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO\n\n converter = DocumentConverter(\n format_options={\n InputFormat.AUDIO: AudioFormatOption(\n pipeline_cls=AsrPipeline,\n pipeline_options=pipeline_options,\n )\n }\n )\n return converter\n</pre> def get_asr_converter(): \"\"\"Create a DocumentConverter configured for ASR with whisper_turbo model.\"\"\" pipeline_options = AsrPipelineOptions() pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO converter = DocumentConverter( format_options={ InputFormat.AUDIO: AudioFormatOption( pipeline_cls=AsrPipeline, pipeline_options=pipeline_options, ) } ) return converter In\u00a0[\u00a0]: Copied! <pre>def asr_pipeline_conversion(audio_path: Path) -> DoclingDocument:\n \"\"\"ASR pipeline conversion using whisper_turbo\"\"\"\n # Check if the test audio file exists\n assert audio_path.exists(), f\"Test audio file not found: {audio_path}\"\n\n converter = get_asr_converter()\n\n # Convert the audio file\n result: ConversionResult = converter.convert(audio_path)\n\n # Verify conversion was successful\n assert result.status == ConversionStatus.SUCCESS, (\n f\"Conversion failed with status: {result.status}\"\n )\n return result.document\n</pre> def asr_pipeline_conversion(audio_path: Path) -> DoclingDocument: \"\"\"ASR pipeline conversion using whisper_turbo\"\"\" # Check if the test audio file exists assert audio_path.exists(), f\"Test audio file not found: {audio_path}\" converter = get_asr_converter() # Convert the audio file result: ConversionResult = converter.convert(audio_path) # Verify conversion was successful assert result.status == ConversionStatus.SUCCESS, ( f\"Conversion failed with status: {result.status}\" ) return result.document In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n audio_path = Path(\"tests/data/audio/sample_10s.mp3\")\n\n doc = asr_pipeline_conversion(audio_path=audio_path)\n print(doc.export_to_markdown())\n\n # Expected output:\n #\n # [time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde\n #\n # [time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.\n</pre> if __name__ == \"__main__\": audio_path = Path(\"tests/data/audio/sample_10s.mp3\") doc = asr_pipeline_conversion(audio_path=audio_path) print(doc.export_to_markdown()) # Expected output: # # [time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde # # [time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain."},{"location":"examples/minimal_vlm_pipeline/","title":"VLM pipeline with SmolDocling","text":"In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel import vlm_model_specs\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n</pre> from docling.datamodel import vlm_model_specs from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( VlmPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline In\u00a0[\u00a0]: Copied! <pre>source = \"https://arxiv.org/pdf/2501.17887\"\n</pre> source = \"https://arxiv.org/pdf/2501.17887\" In\u00a0[\u00a0]: Copied! <pre>converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n ),\n }\n)\n</pre> converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=VlmPipeline, ), } ) In\u00a0[\u00a0]: Copied! <pre>doc = converter.convert(source=source).document\n</pre> doc = converter.convert(source=source).document In\u00a0[\u00a0]: Copied! <pre>print(doc.export_to_markdown())\n</pre> print(doc.export_to_markdown()) In\u00a0[\u00a0]: Copied! <pre>pipeline_options = VlmPipelineOptions(\n vlm_options=vlm_model_specs.SMOLDOCLING_MLX,\n)\n</pre> pipeline_options = VlmPipelineOptions( vlm_options=vlm_model_specs.SMOLDOCLING_MLX, ) In\u00a0[\u00a0]: Copied! <pre>converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n }\n)\n</pre> converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=VlmPipeline, pipeline_options=pipeline_options, ), } ) In\u00a0[\u00a0]: Copied! <pre>doc = converter.convert(source=source).document\n</pre> doc = converter.convert(source=source).document In\u00a0[\u00a0]: Copied! <pre>print(doc.export_to_markdown())\n</pre> print(doc.export_to_markdown())"},{"location":"examples/minimal_vlm_pipeline/#using-simple-default-values","title":"USING SIMPLE DEFAULT VALUES\u00b6","text":"<ul> <li>SmolDocling model</li> <li>Using the transformers framework</li> </ul>"},{"location":"examples/minimal_vlm_pipeline/#using-macos-mps-accelerator","title":"USING MACOS MPS ACCELERATOR\u00b6","text":"<p>For more options see the compare_vlm_models.py example.</p>"},{"location":"examples/pictures_description/","title":"Annotate picture with local VLM","text":"In\u00a0[\u00a0]: Copied! <pre>%pip install -q docling[vlm] ipython\n</pre> %pip install -q docling[vlm] ipython <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[1]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[2]: Copied! <pre># The source document\nDOC_SOURCE = \"https://arxiv.org/pdf/2501.17887\"\n</pre> # The source document DOC_SOURCE = \"https://arxiv.org/pdf/2501.17887\" In\u00a0[3]: Copied! <pre>from docling.datamodel.pipeline_options import granite_picture_description\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = (\n granite_picture_description # <-- the model choice\n)\npipeline_options.picture_description_options.prompt = (\n \"Describe the image in three sentences. Be consise and accurate.\"\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n)\ndoc = converter.convert(DOC_SOURCE).document\n</pre> from docling.datamodel.pipeline_options import granite_picture_description pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = ( granite_picture_description # <-- the model choice ) pipeline_options.picture_description_options.prompt = ( \"Describe the image in three sentences. Be consise and accurate.\" ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(DOC_SOURCE).document <pre>Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.\n</pre> <pre>Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]</pre> In\u00a0[4]: Copied! <pre>from docling_core.types.doc.document import PictureDescriptionData\nfrom IPython import display\n\nhtml_buffer = []\n# display the first 5 pictures and their captions and annotations:\nfor pic in doc.pictures[:5]:\n html_item = (\n f\"<h3>Picture <code>{pic.self_ref}</code></h3>\"\n f'<img src=\"{pic.image.uri!s}\" /><br />'\n f\"<h4>Caption</h4>{pic.caption_text(doc=doc)}<br />\"\n )\n for annotation in pic.annotations:\n if not isinstance(annotation, PictureDescriptionData):\n continue\n html_item += (\n f\"<h4>Annotations ({annotation.provenance})</h4>{annotation.text}<br />\\n\"\n )\n html_buffer.append(html_item)\ndisplay.HTML(\"<hr />\".join(html_buffer))\n</pre> from docling_core.types.doc.document import PictureDescriptionData from IPython import display html_buffer = [] # display the first 5 pictures and their captions and annotations: for pic in doc.pictures[:5]: html_item = ( f\"Picture <code>{pic.self_ref}</code>\" f'' f\"Caption{pic.caption_text(doc=doc)}\" ) for annotation in pic.annotations: if not isinstance(annotation, PictureDescriptionData): continue html_item += ( f\"Annotations ({annotation.provenance}){annotation.text}\\n\" ) html_buffer.append(html_item) display.HTML(\"\".join(html_buffer)) Out[4]: Picture <code>#/pictures/0</code>CaptionFigure 1: Sketch of Docling's pipelines and usage model. Both PDF pipeline and simple pipeline build up a DoclingDocument representation, which can be further enriched. Downstream applications can utilize Docling's API to inspect, export, or chunk the document for various purposes.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a poster with some text and images. Picture <code>#/pictures/1</code>CaptionFigure 2: Dataset categories and sample counts for documents and pages.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a pie chart. In the pie chart we can see the categories and the number of documents in each category. Picture <code>#/pictures/2</code>CaptionFigure 3: Distribution of conversion times for all documents, ordered by number of pages in a document, on all system configurations. Every dot represents one document. Log/log scale is used to even the spacing, since both number of pages and conversion times have long-tail distributions.Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a graph. On the x-axis we can see the number of pages. On the y-axis we can see the seconds. Picture <code>#/pictures/3</code>CaptionFigure 4: Contributions of PDF backend and AI models to the conversion time of a page (in seconds per page). Lower is better. Left: Ranges of time contributions for each model to pages it was applied on (i.e., OCR was applied only on pages with bitmaps, table structure was applied only on pages with tables). Right: Average time contribution to a page in the benchmark dataset (factoring in zero-time contribution for OCR and table structure models on pages without bitmaps or tables) .Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a bar chart and a line chart. In the bar chart we can see the values of Pdf Parse, OCR, Layout, Table Structure, Page Total and Page. In the line chart we can see the values of Pdf Parse, OCR, Layout, Table Structure, Page Total and Page. Picture <code>#/pictures/4</code>CaptionFigure 5: Conversion time in seconds per page on our dataset in three scenarios, across all assets and system configurations. Lower bars are better. The configuration includes OCR and table structure recognition ( fast table option on Docling and MinerU, hi res in unstructured, as shown in table 1).Annotations (ibm-granite/granite-vision-3.1-2b-preview)In this image we can see a bar chart. In the chart we can see the CPU, Max, GPU, and sec/page. In\u00a0[7]: Copied! <pre>from docling.datamodel.pipeline_options import smolvlm_picture_description\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = (\n smolvlm_picture_description # <-- the model choice\n)\npipeline_options.picture_description_options.prompt = (\n \"Describe the image in three sentences. Be consise and accurate.\"\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n)\ndoc = converter.convert(DOC_SOURCE).document\n</pre> from docling.datamodel.pipeline_options import smolvlm_picture_description pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = ( smolvlm_picture_description # <-- the model choice ) pipeline_options.picture_description_options.prompt = ( \"Describe the image in three sentences. Be consise and accurate.\" ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(DOC_SOURCE).document In\u00a0[6]: Copied! <pre>from docling_core.types.doc.document import PictureDescriptionData\nfrom IPython import display\n\nhtml_buffer = []\n# display the first 5 pictures and their captions and annotations:\nfor pic in doc.pictures[:5]:\n html_item = (\n f\"<h3>Picture <code>{pic.self_ref}</code></h3>\"\n f'<img src=\"{pic.image.uri!s}\" /><br />'\n f\"<h4>Caption</h4>{pic.caption_text(doc=doc)}<br />\"\n )\n for annotation in pic.annotations:\n if not isinstance(annotation, PictureDescriptionData):\n continue\n html_item += (\n f\"<h4>Annotations ({annotation.provenance})</h4>{annotation.text}<br />\\n\"\n )\n html_buffer.append(html_item)\ndisplay.HTML(\"<hr />\".join(html_buffer))\n</pre> from docling_core.types.doc.document import PictureDescriptionData from IPython import display html_buffer = [] # display the first 5 pictures and their captions and annotations: for pic in doc.pictures[:5]: html_item = ( f\"Picture <code>{pic.self_ref}</code>\" f'' f\"Caption{pic.caption_text(doc=doc)}\" ) for annotation in pic.annotations: if not isinstance(annotation, PictureDescriptionData): continue html_item += ( f\"Annotations ({annotation.provenance}){annotation.text}\\n\" ) html_buffer.append(html_item) display.HTML(\"\".join(html_buffer)) Out[6]: Picture <code>#/pictures/0</code>CaptionFigure 1: Sketch of Docling's pipelines and usage model. Both PDF pipeline and simple pipeline build up a DoclingDocument representation, which can be further enriched. Downstream applications can utilize Docling's API to inspect, export, or chunk the document for various purposes.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)This is a page that has different types of documents on it. Picture <code>#/pictures/1</code>CaptionFigure 2: Dataset categories and sample counts for documents and pages.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)Here is a page-by-page list of documents per category: - Science - Articles - Law and Regulations - Articles - Misc. Picture <code>#/pictures/2</code>CaptionFigure 3: Distribution of conversion times for all documents, ordered by number of pages in a document, on all system configurations. Every dot represents one document. Log/log scale is used to even the spacing, since both number of pages and conversion times have long-tail distributions.Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)The image is a bar chart that shows the number of pages of a website as a function of the number of pages of the website. The x-axis represents the number of pages, ranging from 100 to 10,000. The y-axis represents the number of pages, ranging from 100 to 10,000. The chart is labeled \"Number of pages\" and has a legend at the top of the chart that indicates the number of pages. The chart shows a clear trend: as the number of pages increases, the number of pages decreases. This is evident from the following points: - The number of pages increases from 100 to 1000. - The number of pages decreases from 1000 to 10,000. - The number of pages increases from 10,000 to 10,000. Picture <code>#/pictures/3</code>CaptionFigure 4: Contributions of PDF backend and AI models to the conversion time of a page (in seconds per page). Lower is better. Left: Ranges of time contributions for each model to pages it was applied on (i.e., OCR was applied only on pages with bitmaps, table structure was applied only on pages with tables). Right: Average time contribution to a page in the benchmark dataset (factoring in zero-time contribution for OCR and table structure models on pages without bitmaps or tables) .Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)bar chart with different colored bars representing different data points. Picture <code>#/pictures/4</code>CaptionFigure 5: Conversion time in seconds per page on our dataset in three scenarios, across all assets and system configurations. Lower bars are better. The configuration includes OCR and table structure recognition ( fast table option on Docling and MinerU, hi res in unstructured, as shown in table 1).Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)A bar chart with the following information: - The x-axis represents the number of pages, ranging from 0 to 14. - The y-axis represents the page count, ranging from 0 to 14. - The chart has three categories: Marker, Unstructured, and Detailed. - The x-axis is labeled \"see/page.\" - The y-axis is labeled \"Page Count.\" - The chart shows that the Marker category has the highest number of pages, followed by the Unstructured category, and then the Detailed category. In\u00a0[8]: Copied! <pre>from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\npipeline_options.picture_description_options = PictureDescriptionVlmOptions(\n repo_id=\"\", # <-- add here the Hugging Face repo_id of your favorite VLM\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n)\npipeline_options.images_scale = 2.0\npipeline_options.generate_picture_images = True\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n)\n\n# Uncomment to run:\n# doc = converter.convert(DOC_SOURCE).document\n</pre> from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions pipeline_options = PdfPipelineOptions() pipeline_options.do_picture_description = True pipeline_options.picture_description_options = PictureDescriptionVlmOptions( repo_id=\"\", # <-- add here the Hugging Face repo_id of your favorite VLM prompt=\"Describe the image in three sentences. Be consise and accurate.\", ) pipeline_options.images_scale = 2.0 pipeline_options.generate_picture_images = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) # Uncomment to run: # doc = converter.convert(DOC_SOURCE).document In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/pictures_description/#describe-pictures-with-granite-vision","title":"Describe pictures with Granite Vision\u00b6","text":"<p>This section will run locally the ibm-granite/granite-vision-3.1-2b-preview model to describe the pictures of the document.</p>"},{"location":"examples/pictures_description/#describe-pictures-with-smolvlm","title":"Describe pictures with SmolVLM\u00b6","text":"<p>This section will run locally the HuggingFaceTB/SmolVLM-256M-Instruct model to describe the pictures of the document.</p>"},{"location":"examples/pictures_description/#use-other-vision-models","title":"Use other vision models\u00b6","text":"<p>The examples above can also be reproduced using other vision model. The Docling options <code>PictureDescriptionVlmOptions</code> allows to specify your favorite vision model from the Hugging Face Hub.</p>"},{"location":"examples/pictures_description_api/","title":"Annotate picture with remote VLM","text":"In\u00a0[\u00a0]: Copied! <pre>import logging\nimport os\nfrom pathlib import Path\n</pre> import logging import os from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>import requests\nfrom docling_core.types.doc import PictureItem\nfrom dotenv import load_dotenv\n</pre> import requests from docling_core.types.doc import PictureItem from dotenv import load_dotenv In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n PictureDescriptionApiOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, PictureDescriptionApiOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied! <pre>def vllm_local_options(model: str):\n options = PictureDescriptionApiOptions(\n url=\"http://localhost:8000/v1/chat/completions\",\n params=dict(\n model=model,\n seed=42,\n max_completion_tokens=200,\n ),\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n timeout=90,\n )\n return options\n</pre> def vllm_local_options(model: str): options = PictureDescriptionApiOptions( url=\"http://localhost:8000/v1/chat/completions\", params=dict( model=model, seed=42, max_completion_tokens=200, ), prompt=\"Describe the image in three sentences. Be consise and accurate.\", timeout=90, ) return options In\u00a0[\u00a0]: Copied! <pre>def lms_local_options(model: str):\n options = PictureDescriptionApiOptions(\n url=\"http://localhost:1234/v1/chat/completions\",\n params=dict(\n model=model,\n seed=42,\n max_completion_tokens=200,\n ),\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n timeout=90,\n )\n return options\n</pre> def lms_local_options(model: str): options = PictureDescriptionApiOptions( url=\"http://localhost:1234/v1/chat/completions\", params=dict( model=model, seed=42, max_completion_tokens=200, ), prompt=\"Describe the image in three sentences. Be consise and accurate.\", timeout=90, ) return options In\u00a0[\u00a0]: Copied! <pre>def watsonx_vlm_options():\n load_dotenv()\n api_key = os.environ.get(\"WX_API_KEY\")\n project_id = os.environ.get(\"WX_PROJECT_ID\")\n\n def _get_iam_access_token(api_key: str) -> str:\n res = requests.post(\n url=\"https://iam.cloud.ibm.com/identity/token\",\n headers={\n \"Content-Type\": \"application/x-www-form-urlencoded\",\n },\n data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\",\n )\n res.raise_for_status()\n api_out = res.json()\n print(f\"{api_out=}\")\n return api_out[\"access_token\"]\n\n options = PictureDescriptionApiOptions(\n url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\",\n params=dict(\n model_id=\"ibm/granite-vision-3-2-2b\",\n project_id=project_id,\n parameters=dict(\n max_new_tokens=400,\n ),\n ),\n headers={\n \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key),\n },\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n timeout=60,\n )\n return options\n</pre> def watsonx_vlm_options(): load_dotenv() api_key = os.environ.get(\"WX_API_KEY\") project_id = os.environ.get(\"WX_PROJECT_ID\") def _get_iam_access_token(api_key: str) -> str: res = requests.post( url=\"https://iam.cloud.ibm.com/identity/token\", headers={ \"Content-Type\": \"application/x-www-form-urlencoded\", }, data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\", ) res.raise_for_status() api_out = res.json() print(f\"{api_out=}\") return api_out[\"access_token\"] options = PictureDescriptionApiOptions( url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\", params=dict( model_id=\"ibm/granite-vision-3-2-2b\", project_id=project_id, parameters=dict( max_new_tokens=400, ), ), headers={ \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key), }, prompt=\"Describe the image in three sentences. Be consise and accurate.\", timeout=60, ) return options In\u00a0[\u00a0]: Copied! <pre>def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n pipeline_options = PdfPipelineOptions(\n enable_remote_services=True # <-- this is required!\n )\n pipeline_options.do_picture_description = True\n\n # The PictureDescriptionApiOptions() allows to interface with APIs supporting\n # the multi-modal chat interface. Here follow a few example on how to configure those.\n #\n # One possibility is self-hosting model, e.g. via VLLM.\n # $ vllm serve MODEL_NAME\n # Then PictureDescriptionApiOptions can point to the localhost endpoint.\n\n # Example for the Granite Vision model:\n # (uncomment the following lines)\n # pipeline_options.picture_description_options = vllm_local_options(\n # model=\"ibm-granite/granite-vision-3.3-2b\"\n # )\n\n # Example for the SmolVLM model:\n # (uncomment the following lines)\n # pipeline_options.picture_description_options = vllm_local_options(\n # model=\"HuggingFaceTB/SmolVLM-256M-Instruct\"\n # )\n\n # For using models on LM Studio using the built-in GGUF or MLX runtimes, e.g. the SmolVLM model:\n # (uncomment the following lines)\n pipeline_options.picture_description_options = lms_local_options(\n model=\"smolvlm-256m-instruct\"\n )\n\n # Another possibility is using online services, e.g. watsonx.ai.\n # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID.\n # (uncomment the following lines)\n # pipeline_options.picture_description_options = watsonx_vlm_options()\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n result = doc_converter.convert(input_doc_path)\n\n for element, _level in result.document.iterate_items():\n if isinstance(element, PictureItem):\n print(\n f\"Picture {element.self_ref}\\n\"\n f\"Caption: {element.caption_text(doc=result.document)}\\n\"\n f\"Annotations: {element.annotations}\"\n )\n</pre> def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" pipeline_options = PdfPipelineOptions( enable_remote_services=True # <-- this is required! ) pipeline_options.do_picture_description = True # The PictureDescriptionApiOptions() allows to interface with APIs supporting # the multi-modal chat interface. Here follow a few example on how to configure those. # # One possibility is self-hosting model, e.g. via VLLM. # $ vllm serve MODEL_NAME # Then PictureDescriptionApiOptions can point to the localhost endpoint. # Example for the Granite Vision model: # (uncomment the following lines) # pipeline_options.picture_description_options = vllm_local_options( # model=\"ibm-granite/granite-vision-3.3-2b\" # ) # Example for the SmolVLM model: # (uncomment the following lines) # pipeline_options.picture_description_options = vllm_local_options( # model=\"HuggingFaceTB/SmolVLM-256M-Instruct\" # ) # For using models on LM Studio using the built-in GGUF or MLX runtimes, e.g. the SmolVLM model: # (uncomment the following lines) pipeline_options.picture_description_options = lms_local_options( model=\"smolvlm-256m-instruct\" ) # Another possibility is using online services, e.g. watsonx.ai. # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID. # (uncomment the following lines) # pipeline_options.picture_description_options = watsonx_vlm_options() doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) result = doc_converter.convert(input_doc_path) for element, _level in result.document.iterate_items(): if isinstance(element, PictureItem): print( f\"Picture {element.self_ref}\\n\" f\"Caption: {element.caption_text(doc=result.document)}\\n\" f\"Annotations: {element.annotations}\" ) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/pictures_description_api/#example-of-picturedescriptionapioptions-definitions","title":"Example of PictureDescriptionApiOptions definitions\u00b6","text":""},{"location":"examples/pictures_description_api/#using-vllm","title":"Using vLLM\u00b6","text":"<p>Models can be launched via: $ vllm serve MODEL_NAME</p>"},{"location":"examples/pictures_description_api/#using-lm-studio","title":"Using LM Studio\u00b6","text":""},{"location":"examples/pictures_description_api/#using-a-cloud-service-like-ibm-watsonxai","title":"Using a cloud service like IBM watsonx.ai\u00b6","text":""},{"location":"examples/pictures_description_api/#usage-and-conversion","title":"Usage and conversion\u00b6","text":""},{"location":"examples/rag_azuresearch/","title":"RAG with Azure AI Search","text":"Step Tech Execution Embedding Azure OpenAI \ud83c\udf10 Remote Vector Store Azure AI Search \ud83c\udf10 Remote Gen AI Azure OpenAI \ud83c\udf10 Remote In\u00a0[\u00a0]: Copied! <pre># If running in a fresh environment (like Google Colab), uncomment and run this single command:\n%pip install \"docling~=2.12\" azure-search-documents==11.5.2 azure-identity openai rich torch python-dotenv\n</pre> # If running in a fresh environment (like Google Colab), uncomment and run this single command: %pip install \"docling~=2.12\" azure-search-documents==11.5.2 azure-identity openai rich torch python-dotenv In\u00a0[1]: Copied! <pre>import os\n\nfrom dotenv import load_dotenv\n\nload_dotenv()\n\n\ndef _get_env(key, default=None):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key, default)\n\n\nAZURE_SEARCH_ENDPOINT = _get_env(\"AZURE_SEARCH_ENDPOINT\")\nAZURE_SEARCH_KEY = _get_env(\"AZURE_SEARCH_KEY\") # Ensure this is your Admin Key\nAZURE_SEARCH_INDEX_NAME = _get_env(\"AZURE_SEARCH_INDEX_NAME\", \"docling-rag-sample\")\nAZURE_OPENAI_ENDPOINT = _get_env(\"AZURE_OPENAI_ENDPOINT\")\nAZURE_OPENAI_API_KEY = _get_env(\"AZURE_OPENAI_API_KEY\")\nAZURE_OPENAI_API_VERSION = _get_env(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\")\nAZURE_OPENAI_CHAT_MODEL = _get_env(\n \"AZURE_OPENAI_CHAT_MODEL\"\n) # Using a deployed model named \"gpt-4o\"\nAZURE_OPENAI_EMBEDDINGS = _get_env(\n \"AZURE_OPENAI_EMBEDDINGS\", \"text-embedding-3-small\"\n) # Using a deployed model named \"text-embeddings-3-small\"\n</pre> import os from dotenv import load_dotenv load_dotenv() def _get_env(key, default=None): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key, default) AZURE_SEARCH_ENDPOINT = _get_env(\"AZURE_SEARCH_ENDPOINT\") AZURE_SEARCH_KEY = _get_env(\"AZURE_SEARCH_KEY\") # Ensure this is your Admin Key AZURE_SEARCH_INDEX_NAME = _get_env(\"AZURE_SEARCH_INDEX_NAME\", \"docling-rag-sample\") AZURE_OPENAI_ENDPOINT = _get_env(\"AZURE_OPENAI_ENDPOINT\") AZURE_OPENAI_API_KEY = _get_env(\"AZURE_OPENAI_API_KEY\") AZURE_OPENAI_API_VERSION = _get_env(\"AZURE_OPENAI_API_VERSION\", \"2024-10-21\") AZURE_OPENAI_CHAT_MODEL = _get_env( \"AZURE_OPENAI_CHAT_MODEL\" ) # Using a deployed model named \"gpt-4o\" AZURE_OPENAI_EMBEDDINGS = _get_env( \"AZURE_OPENAI_EMBEDDINGS\", \"text-embedding-3-small\" ) # Using a deployed model named \"text-embeddings-3-small\" In\u00a0[11]: Copied! <pre>from rich.console import Console\nfrom rich.panel import Panel\n\nfrom docling.document_converter import DocumentConverter\n\nconsole = Console()\n\n# This URL points to the Microsoft GraphRAG Research Paper (arXiv: 2404.16130), ~15 pages\nsource_url = \"https://arxiv.org/pdf/2404.16130\"\n\nconsole.print(\n \"[bold yellow]Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...[/bold yellow]\"\n)\nconverter = DocumentConverter()\nresult = converter.convert(source_url)\n\n# Optional: preview the parsed Markdown\nmd_preview = result.document.export_to_markdown()\nconsole.print(Panel(md_preview[:500] + \"...\", title=\"Docling Markdown Preview\"))\n</pre> from rich.console import Console from rich.panel import Panel from docling.document_converter import DocumentConverter console = Console() # This URL points to the Microsoft GraphRAG Research Paper (arXiv: 2404.16130), ~15 pages source_url = \"https://arxiv.org/pdf/2404.16130\" console.print( \"[bold yellow]Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...[/bold yellow]\" ) converter = DocumentConverter() result = converter.convert(source_url) # Optional: preview the parsed Markdown md_preview = result.document.export_to_markdown() console.print(Panel(md_preview[:500] + \"...\", title=\"Docling Markdown Preview\")) <pre>Parsing a ~15-page PDF. The process should be relatively quick, even on CPU...\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Docling Markdown Preview \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 ## From Local to Global: A Graph RAG Approach to Query-Focused Summarization \u2502\n\u2502 \u2502\n\u2502 Darren Edge 1\u2020 \u2502\n\u2502 \u2502\n\u2502 Ha Trinh 1\u2020 \u2502\n\u2502 \u2502\n\u2502 Newman Cheng 2 \u2502\n\u2502 \u2502\n\u2502 Joshua Bradley 2 \u2502\n\u2502 \u2502\n\u2502 Alex Chao 3 \u2502\n\u2502 \u2502\n\u2502 Apurva Mody 3 \u2502\n\u2502 \u2502\n\u2502 Steven Truitt 2 \u2502\n\u2502 \u2502\n\u2502 ## Jonathan Larson 1 \u2502\n\u2502 \u2502\n\u2502 1 Microsoft Research 2 Microsoft Strategic Missions and Technologies 3 Microsoft Office of the CTO \u2502\n\u2502 \u2502\n\u2502 { daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso } @microsoft.com \u2502\n\u2502 \u2502\n\u2502 \u2020 These authors contributed equally to this work \u2502\n\u2502 \u2502\n\u2502 ## Abstract \u2502\n\u2502 \u2502\n\u2502 The use of retrieval-augmented gen... \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> In\u00a0[22]: Copied! <pre>from docling.chunking import HierarchicalChunker\n\nchunker = HierarchicalChunker()\ndoc_chunks = list(chunker.chunk(result.document))\n\nall_chunks = []\nfor idx, c in enumerate(doc_chunks):\n chunk_text = c.text\n all_chunks.append((f\"chunk_{idx}\", chunk_text))\n\nconsole.print(f\"Total chunks from PDF: {len(all_chunks)}\")\n</pre> from docling.chunking import HierarchicalChunker chunker = HierarchicalChunker() doc_chunks = list(chunker.chunk(result.document)) all_chunks = [] for idx, c in enumerate(doc_chunks): chunk_text = c.text all_chunks.append((f\"chunk_{idx}\", chunk_text)) console.print(f\"Total chunks from PDF: {len(all_chunks)}\") <pre>Total chunks from PDF: 106\n</pre> In\u00a0[\u00a0]: Copied! <pre>from azure.core.credentials import AzureKeyCredential\nfrom azure.search.documents.indexes import SearchIndexClient\nfrom azure.search.documents.indexes.models import (\n AzureOpenAIVectorizer,\n AzureOpenAIVectorizerParameters,\n HnswAlgorithmConfiguration,\n SearchableField,\n SearchField,\n SearchFieldDataType,\n SearchIndex,\n SimpleField,\n VectorSearch,\n VectorSearchProfile,\n)\nfrom rich.console import Console\n\nconsole = Console()\n\nVECTOR_DIM = 1536 # Adjust based on your chosen embeddings model\n\nindex_client = SearchIndexClient(\n AZURE_SEARCH_ENDPOINT, AzureKeyCredential(AZURE_SEARCH_KEY)\n)\n\n\ndef create_search_index(index_name: str):\n # Define fields\n fields = [\n SimpleField(name=\"chunk_id\", type=SearchFieldDataType.String, key=True),\n SearchableField(name=\"content\", type=SearchFieldDataType.String),\n SearchField(\n name=\"content_vector\",\n type=SearchFieldDataType.Collection(SearchFieldDataType.Single),\n searchable=True,\n filterable=False,\n sortable=False,\n facetable=False,\n vector_search_dimensions=VECTOR_DIM,\n vector_search_profile_name=\"default\",\n ),\n ]\n # Vector search config with an AzureOpenAIVectorizer\n vector_search = VectorSearch(\n algorithms=[HnswAlgorithmConfiguration(name=\"default\")],\n profiles=[\n VectorSearchProfile(\n name=\"default\",\n algorithm_configuration_name=\"default\",\n vectorizer_name=\"default\",\n )\n ],\n vectorizers=[\n AzureOpenAIVectorizer(\n vectorizer_name=\"default\",\n parameters=AzureOpenAIVectorizerParameters(\n resource_url=AZURE_OPENAI_ENDPOINT,\n deployment_name=AZURE_OPENAI_EMBEDDINGS,\n model_name=\"text-embedding-3-small\",\n api_key=AZURE_OPENAI_API_KEY,\n ),\n )\n ],\n )\n\n # Create or update the index\n new_index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)\n try:\n index_client.delete_index(index_name)\n except Exception:\n pass\n\n index_client.create_or_update_index(new_index)\n console.print(f\"Index '{index_name}' created.\")\n\n\ncreate_search_index(AZURE_SEARCH_INDEX_NAME)\n</pre> from azure.core.credentials import AzureKeyCredential from azure.search.documents.indexes import SearchIndexClient from azure.search.documents.indexes.models import ( AzureOpenAIVectorizer, AzureOpenAIVectorizerParameters, HnswAlgorithmConfiguration, SearchableField, SearchField, SearchFieldDataType, SearchIndex, SimpleField, VectorSearch, VectorSearchProfile, ) from rich.console import Console console = Console() VECTOR_DIM = 1536 # Adjust based on your chosen embeddings model index_client = SearchIndexClient( AZURE_SEARCH_ENDPOINT, AzureKeyCredential(AZURE_SEARCH_KEY) ) def create_search_index(index_name: str): # Define fields fields = [ SimpleField(name=\"chunk_id\", type=SearchFieldDataType.String, key=True), SearchableField(name=\"content\", type=SearchFieldDataType.String), SearchField( name=\"content_vector\", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), searchable=True, filterable=False, sortable=False, facetable=False, vector_search_dimensions=VECTOR_DIM, vector_search_profile_name=\"default\", ), ] # Vector search config with an AzureOpenAIVectorizer vector_search = VectorSearch( algorithms=[HnswAlgorithmConfiguration(name=\"default\")], profiles=[ VectorSearchProfile( name=\"default\", algorithm_configuration_name=\"default\", vectorizer_name=\"default\", ) ], vectorizers=[ AzureOpenAIVectorizer( vectorizer_name=\"default\", parameters=AzureOpenAIVectorizerParameters( resource_url=AZURE_OPENAI_ENDPOINT, deployment_name=AZURE_OPENAI_EMBEDDINGS, model_name=\"text-embedding-3-small\", api_key=AZURE_OPENAI_API_KEY, ), ) ], ) # Create or update the index new_index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search) try: index_client.delete_index(index_name) except Exception: pass index_client.create_or_update_index(new_index) console.print(f\"Index '{index_name}' created.\") create_search_index(AZURE_SEARCH_INDEX_NAME) <pre>Index 'docling-rag-sample-2' created.\n</pre> In\u00a0[28]: Copied! <pre>from azure.search.documents import SearchClient\nfrom openai import AzureOpenAI\n\nsearch_client = SearchClient(\n AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_INDEX_NAME, AzureKeyCredential(AZURE_SEARCH_KEY)\n)\nopenai_client = AzureOpenAI(\n api_key=AZURE_OPENAI_API_KEY,\n api_version=AZURE_OPENAI_API_VERSION,\n azure_endpoint=AZURE_OPENAI_ENDPOINT,\n)\n\n\ndef embed_text(text: str):\n \"\"\"\n Helper to generate embeddings with Azure OpenAI.\n \"\"\"\n response = openai_client.embeddings.create(\n input=text, model=AZURE_OPENAI_EMBEDDINGS\n )\n return response.data[0].embedding\n\n\nupload_docs = []\nfor chunk_id, chunk_text in all_chunks:\n embedding_vector = embed_text(chunk_text)\n upload_docs.append(\n {\n \"chunk_id\": chunk_id,\n \"content\": chunk_text,\n \"content_vector\": embedding_vector,\n }\n )\n\n\nBATCH_SIZE = 50\nfor i in range(0, len(upload_docs), BATCH_SIZE):\n subset = upload_docs[i : i + BATCH_SIZE]\n resp = search_client.upload_documents(documents=subset)\n\n all_succeeded = all(r.succeeded for r in resp)\n console.print(\n f\"Uploaded batch {i} -> {i + len(subset)}; all_succeeded: {all_succeeded}, \"\n f\"first_doc_status_code: {resp[0].status_code}\"\n )\n\nconsole.print(\"All chunks uploaded to Azure Search.\")\n</pre> from azure.search.documents import SearchClient from openai import AzureOpenAI search_client = SearchClient( AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_INDEX_NAME, AzureKeyCredential(AZURE_SEARCH_KEY) ) openai_client = AzureOpenAI( api_key=AZURE_OPENAI_API_KEY, api_version=AZURE_OPENAI_API_VERSION, azure_endpoint=AZURE_OPENAI_ENDPOINT, ) def embed_text(text: str): \"\"\" Helper to generate embeddings with Azure OpenAI. \"\"\" response = openai_client.embeddings.create( input=text, model=AZURE_OPENAI_EMBEDDINGS ) return response.data[0].embedding upload_docs = [] for chunk_id, chunk_text in all_chunks: embedding_vector = embed_text(chunk_text) upload_docs.append( { \"chunk_id\": chunk_id, \"content\": chunk_text, \"content_vector\": embedding_vector, } ) BATCH_SIZE = 50 for i in range(0, len(upload_docs), BATCH_SIZE): subset = upload_docs[i : i + BATCH_SIZE] resp = search_client.upload_documents(documents=subset) all_succeeded = all(r.succeeded for r in resp) console.print( f\"Uploaded batch {i} -> {i + len(subset)}; all_succeeded: {all_succeeded}, \" f\"first_doc_status_code: {resp[0].status_code}\" ) console.print(\"All chunks uploaded to Azure Search.\") <pre>Uploaded batch 0 -> 50; all_succeeded: True, first_doc_status_code: 201\n</pre> <pre>Uploaded batch 50 -> 100; all_succeeded: True, first_doc_status_code: 201\n</pre> <pre>Uploaded batch 100 -> 106; all_succeeded: True, first_doc_status_code: 201\n</pre> <pre>All chunks uploaded to Azure Search.\n</pre> In\u00a0[29]: Copied! <pre>from typing import Optional\n\nfrom azure.search.documents.models import VectorizableTextQuery\n\n\ndef generate_chat_response(prompt: str, system_message: Optional[str] = None):\n \"\"\"\n Generates a single-turn chat response using Azure OpenAI Chat.\n If you need multi-turn conversation or follow-up queries, you'll have to\n maintain the messages list externally.\n \"\"\"\n messages = []\n if system_message:\n messages.append({\"role\": \"system\", \"content\": system_message})\n messages.append({\"role\": \"user\", \"content\": prompt})\n\n completion = openai_client.chat.completions.create(\n model=AZURE_OPENAI_CHAT_MODEL, messages=messages, temperature=0.7\n )\n return completion.choices[0].message.content\n\n\nuser_query = \"What are the main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG methods?\"\nuser_embed = embed_text(user_query)\n\nvector_query = VectorizableTextQuery(\n text=user_query, # passing in text for a hybrid search\n k_nearest_neighbors=5,\n fields=\"content_vector\",\n)\n\nsearch_results = search_client.search(\n search_text=user_query, vector_queries=[vector_query], select=[\"content\"], top=10\n)\n\nretrieved_chunks = []\nfor result in search_results:\n snippet = result[\"content\"]\n retrieved_chunks.append(snippet)\n\ncontext_str = \"\\n---\\n\".join(retrieved_chunks)\nrag_prompt = f\"\"\"\nYou are an AI assistant helping answering questions about Microsoft GraphRAG.\nUse ONLY the text below to answer the user's question.\nIf the answer isn't in the text, say you don't know.\n\nContext:\n{context_str}\n\nQuestion: {user_query}\nAnswer:\n\"\"\"\n\nfinal_answer = generate_chat_response(rag_prompt)\n\nconsole.print(Panel(rag_prompt, title=\"RAG Prompt\", style=\"bold red\"))\nconsole.print(Panel(final_answer, title=\"RAG Response\", style=\"bold green\"))\n</pre> from typing import Optional from azure.search.documents.models import VectorizableTextQuery def generate_chat_response(prompt: str, system_message: Optional[str] = None): \"\"\" Generates a single-turn chat response using Azure OpenAI Chat. If you need multi-turn conversation or follow-up queries, you'll have to maintain the messages list externally. \"\"\" messages = [] if system_message: messages.append({\"role\": \"system\", \"content\": system_message}) messages.append({\"role\": \"user\", \"content\": prompt}) completion = openai_client.chat.completions.create( model=AZURE_OPENAI_CHAT_MODEL, messages=messages, temperature=0.7 ) return completion.choices[0].message.content user_query = \"What are the main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG methods?\" user_embed = embed_text(user_query) vector_query = VectorizableTextQuery( text=user_query, # passing in text for a hybrid search k_nearest_neighbors=5, fields=\"content_vector\", ) search_results = search_client.search( search_text=user_query, vector_queries=[vector_query], select=[\"content\"], top=10 ) retrieved_chunks = [] for result in search_results: snippet = result[\"content\"] retrieved_chunks.append(snippet) context_str = \"\\n---\\n\".join(retrieved_chunks) rag_prompt = f\"\"\" You are an AI assistant helping answering questions about Microsoft GraphRAG. Use ONLY the text below to answer the user's question. If the answer isn't in the text, say you don't know. Context: {context_str} Question: {user_query} Answer: \"\"\" final_answer = generate_chat_response(rag_prompt) console.print(Panel(rag_prompt, title=\"RAG Prompt\", style=\"bold red\")) console.print(Panel(final_answer, title=\"RAG Response\", style=\"bold green\")) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 RAG Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 \u2502\n\u2502 You are an AI assistant helping answering questions about Microsoft GraphRAG. \u2502\n\u2502 Use ONLY the text below to answer the user's question. \u2502\n\u2502 If the answer isn't in the text, say you don't know. \u2502\n\u2502 \u2502\n\u2502 Context: \u2502\n\u2502 Community summaries vs. source texts. When comparing community summaries to source texts using Graph RAG, \u2502\n\u2502 community summaries generally provided a small but consistent improvement in answer comprehensiveness and \u2502\n\u2502 diversity, except for root-level summaries. Intermediate-level summaries in the Podcast dataset and low-level \u2502\n\u2502 community summaries in the News dataset achieved comprehensiveness win rates of 57% and 64%, respectively. \u2502\n\u2502 Diversity win rates were 57% for Podcast intermediate-level summaries and 60% for News low-level community \u2502\n\u2502 summaries. Table 3 also illustrates the scalability advantages of Graph RAG compared to source text \u2502\n\u2502 summarization: for low-level community summaries ( C3 ), Graph RAG required 26-33% fewer context tokens, while \u2502\n\u2502 for root-level community summaries ( C0 ), it required over 97% fewer tokens. For a modest drop in performance \u2502\n\u2502 compared with other global methods, root-level Graph RAG offers a highly efficient method for the iterative \u2502\n\u2502 question answering that characterizes sensemaking activity, while retaining advantages in comprehensiveness \u2502\n\u2502 (72% win rate) and diversity (62% win rate) over na\u00a8\u0131ve RAG. \u2502\n\u2502 --- \u2502\n\u2502 We have presented a global approach to Graph RAG, combining knowledge graph generation, retrieval-augmented \u2502\n\u2502 generation (RAG), and query-focused summarization (QFS) to support human sensemaking over entire text corpora. \u2502\n\u2502 Initial evaluations show substantial improvements over a na\u00a8\u0131ve RAG baseline for both the comprehensiveness and \u2502\n\u2502 diversity of answers, as well as favorable comparisons to a global but graph-free approach using map-reduce \u2502\n\u2502 source text summarization. For situations requiring many global queries over the same dataset, summaries of \u2502\n\u2502 root-level communities in the entity-based graph index provide a data index that is both superior to na\u00a8\u0131ve RAG \u2502\n\u2502 and achieves competitive performance to other global methods at a fraction of the token cost. \u2502\n\u2502 --- \u2502\n\u2502 Trade-offs of building a graph index . We consistently observed Graph RAG achieve the best headto-head results \u2502\n\u2502 against other methods, but in many cases the graph-free approach to global summarization of source texts \u2502\n\u2502 performed competitively. The real-world decision about whether to invest in building a graph index depends on \u2502\n\u2502 multiple factors, including the compute budget, expected number of lifetime queries per dataset, and value \u2502\n\u2502 obtained from other aspects of the graph index (including the generic community summaries and the use of other \u2502\n\u2502 graph-related RAG approaches). \u2502\n\u2502 --- \u2502\n\u2502 Future work . The graph index, rich text annotations, and hierarchical community structure supporting the \u2502\n\u2502 current Graph RAG approach offer many possibilities for refinement and adaptation. This includes RAG approaches \u2502\n\u2502 that operate in a more local manner, via embedding-based matching of user queries and graph annotations, as \u2502\n\u2502 well as the possibility of hybrid RAG schemes that combine embedding-based matching against community reports \u2502\n\u2502 before employing our map-reduce summarization mechanisms. This 'roll-up' operation could also be extended \u2502\n\u2502 across more levels of the community hierarchy, as well as implemented as a more exploratory 'drill down' \u2502\n\u2502 mechanism that follows the information scent contained in higher-level community summaries. \u2502\n\u2502 --- \u2502\n\u2502 Advanced RAG systems include pre-retrieval, retrieval, post-retrieval strategies designed to overcome the \u2502\n\u2502 drawbacks of Na\u00a8\u0131ve RAG, while Modular RAG systems include patterns for iterative and dynamic cycles of \u2502\n\u2502 interleaved retrieval and generation (Gao et al., 2023). Our implementation of Graph RAG incorporates multiple \u2502\n\u2502 concepts related to other systems. For example, our community summaries are a kind of self-memory (Selfmem, \u2502\n\u2502 Cheng et al., 2024) for generation-augmented retrieval (GAR, Mao et al., 2020) that facilitates future \u2502\n\u2502 generation cycles, while our parallel generation of community answers from these summaries is a kind of \u2502\n\u2502 iterative (Iter-RetGen, Shao et al., 2023) or federated (FeB4RAG, Wang et al., 2024) retrieval-generation \u2502\n\u2502 strategy. Other systems have also combined these concepts for multi-document summarization (CAiRE-COVID, Su et \u2502\n\u2502 al., 2020) and multi-hop question answering (ITRG, Feng et al., 2023; IR-CoT, Trivedi et al., 2022; DSP, \u2502\n\u2502 Khattab et al., 2022). Our use of a hierarchical index and summarization also bears resemblance to further \u2502\n\u2502 approaches, such as generating a hierarchical index of text chunks by clustering the vectors of text embeddings \u2502\n\u2502 (RAPTOR, Sarthi et al., 2024) or generating a 'tree of clarifications' to answer multiple interpretations of \u2502\n\u2502 ambiguous questions (Kim et al., 2023). However, none of these iterative or hierarchical approaches use the \u2502\n\u2502 kind of self-generated graph index that enables Graph RAG. \u2502\n\u2502 --- \u2502\n\u2502 The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge \u2502\n\u2502 source enables large language models (LLMs) to answer questions over private and/or previously unseen document \u2502\n\u2502 collections. However, RAG fails on global questions directed at an entire text corpus, such as 'What are the \u2502\n\u2502 main themes in the dataset?', since this is inherently a queryfocused summarization (QFS) task, rather than an \u2502\n\u2502 explicit retrieval task. Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by \u2502\n\u2502 typical RAGsystems. To combine the strengths of these contrasting methods, we propose a Graph RAG approach to \u2502\n\u2502 question answering over private text corpora that scales with both the generality of user questions and the \u2502\n\u2502 quantity of source text to be indexed. Our approach uses an LLM to build a graph-based text index in two \u2502\n\u2502 stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community \u2502\n\u2502 summaries for all groups of closely-related entities. Given a question, each community summary is used to \u2502\n\u2502 generate a partial response, before all partial responses are again summarized in a final response to the user. \u2502\n\u2502 For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG \u2502\n\u2502 leads to substantial improvements over a na\u00a8\u0131ve RAG baseline for both the comprehensiveness and diversity of \u2502\n\u2502 generated answers. An open-source, Python-based implementation of both global and local Graph RAG approaches is \u2502\n\u2502 forthcoming at https://aka . ms/graphrag . \u2502\n\u2502 --- \u2502\n\u2502 Given the multi-stage nature of our Graph RAG mechanism, the multiple conditions we wanted to compare, and the \u2502\n\u2502 lack of gold standard answers to our activity-based sensemaking questions, we decided to adopt a head-to-head \u2502\n\u2502 comparison approach using an LLM evaluator. We selected three target metrics capturing qualities that are \u2502\n\u2502 desirable for sensemaking activities, as well as a control metric (directness) used as a indicator of validity. \u2502\n\u2502 Since directness is effectively in opposition to comprehensiveness and diversity, we would not expect any \u2502\n\u2502 method to win across all four metrics. \u2502\n\u2502 --- \u2502\n\u2502 Figure 1: Graph RAG pipeline using an LLM-derived graph index of source document text. This index spans nodes \u2502\n\u2502 (e.g., entities), edges (e.g., relationships), and covariates (e.g., claims) that have been detected, \u2502\n\u2502 extracted, and summarized by LLM prompts tailored to the domain of the dataset. Community detection (e.g., \u2502\n\u2502 Leiden, Traag et al., 2019) is used to partition the graph index into groups of elements (nodes, edges, \u2502\n\u2502 covariates) that the LLM can summarize in parallel at both indexing time and query time. The 'global answer' to \u2502\n\u2502 a given query is produced using a final round of query-focused summarization over all community summaries \u2502\n\u2502 reporting relevance to that query. \u2502\n\u2502 --- \u2502\n\u2502 Retrieval-augmented generation (RAG, Lewis et al., 2020) is an established approach to answering user questions \u2502\n\u2502 over entire datasets, but it is designed for situations where these answers are contained locally within \u2502\n\u2502 regions of text whose retrieval provides sufficient grounding for the generation task. Instead, a more \u2502\n\u2502 appropriate task framing is query-focused summarization (QFS, Dang, 2006), and in particular, query-focused \u2502\n\u2502 abstractive summarization that generates natural language summaries and not just concatenated excerpts (Baumel \u2502\n\u2502 et al., 2018; Laskar et al., 2020; Yao et al., 2017) . In recent years, however, such distinctions between \u2502\n\u2502 summarization tasks that are abstractive versus extractive, generic versus query-focused, and single-document \u2502\n\u2502 versus multi-document, have become less relevant. While early applications of the transformer architecture \u2502\n\u2502 showed substantial improvements on the state-of-the-art for all such summarization tasks (Goodwin et al., 2020; \u2502\n\u2502 Laskar et al., 2022; Liu and Lapata, 2019), these tasks are now trivialized by modern LLMs, including the GPT \u2502\n\u2502 (Achiam et al., 2023; Brown et al., 2020), Llama (Touvron et al., 2023), and Gemini (Anil et al., 2023) series, \u2502\n\u2502 all of which can use in-context learning to summarize any content provided in their context window. \u2502\n\u2502 --- \u2502\n\u2502 community descriptions provide complete coverage of the underlying graph index and the input documents it \u2502\n\u2502 represents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach: \u2502\n\u2502 first using each community summary to answer the query independently and in parallel, then summarizing all \u2502\n\u2502 relevant partial answers into a final global answer. \u2502\n\u2502 \u2502\n\u2502 Question: What are the main advantages of using the Graph RAG approach for query-focused summarization compared \u2502\n\u2502 to traditional RAG methods? \u2502\n\u2502 Answer: \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 RAG Response \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 The main advantages of using the Graph RAG approach for query-focused summarization compared to traditional RAG \u2502\n\u2502 methods include: \u2502\n\u2502 \u2502\n\u2502 1. **Improved Comprehensiveness and Diversity**: Graph RAG shows substantial improvements over a na\u00efve RAG \u2502\n\u2502 baseline in terms of the comprehensiveness and diversity of answers. This is particularly beneficial for global \u2502\n\u2502 sensemaking questions over large datasets. \u2502\n\u2502 \u2502\n\u2502 2. **Scalability**: Graph RAG provides scalability advantages, achieving efficient summarization with \u2502\n\u2502 significantly fewer context tokens required. For instance, it requires 26-33% fewer tokens for low-level \u2502\n\u2502 community summaries and over 97% fewer tokens for root-level summaries compared to source text summarization. \u2502\n\u2502 \u2502\n\u2502 3. **Efficiency in Iterative Question Answering**: Root-level Graph RAG offers a highly efficient method for \u2502\n\u2502 iterative question answering, which is crucial for sensemaking activities, with only a modest drop in \u2502\n\u2502 performance compared to other global methods. \u2502\n\u2502 \u2502\n\u2502 4. **Global Query Handling**: It supports handling global queries effectively, as it combines knowledge graph \u2502\n\u2502 generation, retrieval-augmented generation, and query-focused summarization, making it suitable for sensemaking \u2502\n\u2502 over entire text corpora. \u2502\n\u2502 \u2502\n\u2502 5. **Hierarchical Indexing and Summarization**: The use of a hierarchical index and summarization allows for \u2502\n\u2502 efficient processing and summarizing of community summaries into a final global answer, facilitating a \u2502\n\u2502 comprehensive coverage of the underlying graph index and input documents. \u2502\n\u2502 \u2502\n\u2502 6. **Reduced Token Cost**: For situations requiring many global queries over the same dataset, Graph RAG \u2502\n\u2502 achieves competitive performance to other global methods at a fraction of the token cost. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre>"},{"location":"examples/rag_azuresearch/#rag-with-azure-ai-search","title":"RAG with Azure AI Search\u00b6","text":""},{"location":"examples/rag_azuresearch/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"<p>This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using:</p> <ul> <li>Docling for document parsing and chunking</li> <li>Azure AI Search for vector indexing and retrieval</li> <li>Azure OpenAI for embeddings and chat completion</li> </ul> <p>This sample demonstrates how to:</p> <ol> <li>Parse a PDF with Docling.</li> <li>Chunk the parsed text.</li> <li>Use Azure OpenAI for embeddings.</li> <li>Index and search in Azure AI Search.</li> <li>Run a retrieval-augmented generation (RAG) query with Azure OpenAI GPT-4o.</li> </ol>"},{"location":"examples/rag_azuresearch/#part-0-prerequisites","title":"Part 0: Prerequisites\u00b6","text":"<ul> <li><p>Azure AI Search resource</p> </li> <li><p>Azure OpenAI resource with a deployed embedding and chat completion model (e.g. <code>text-embedding-3-small</code> and <code>gpt-4o</code>)</p> </li> <li><p>Docling 2.12+ (installs <code>docling_core</code> automatically) Docling installed (Python 3.8+ environment)</p> </li> <li><p>A GPU-enabled environment is preferred for faster parsing. Docling 2.12 automatically detects GPU if present.</p> <ul> <li>If you only have CPU, parsing large PDFs can be slower.</li> </ul> </li> </ul>"},{"location":"examples/rag_azuresearch/#part-1-parse-the-pdf-with-docling","title":"Part 1: Parse the PDF with Docling\u00b6","text":"<p>We\u2019ll parse the Microsoft GraphRAG Research Paper (~15 pages). Parsing should be relatively quick, even on CPU, but it will be faster on a GPU or MPS device if available.</p> <p>(If you prefer a different document, simply provide a different URL or local file path.)</p>"},{"location":"examples/rag_azuresearch/#part-2-hierarchical-chunking","title":"Part 2: Hierarchical Chunking\u00b6","text":"<p>We convert the <code>Document</code> into smaller chunks for embedding and indexing. The built-in <code>HierarchicalChunker</code> preserves structure.</p>"},{"location":"examples/rag_azuresearch/#part-3-create-azure-ai-search-index-and-push-chunk-embeddings","title":"Part 3: Create Azure AI Search Index and Push Chunk Embeddings\u00b6","text":"<p>We\u2019ll define a vector index in Azure AI Search, then embed each chunk using Azure OpenAI and upload in batches.</p>"},{"location":"examples/rag_azuresearch/#generate-embeddings-and-upload-to-azure-ai-search","title":"Generate Embeddings and Upload to Azure AI Search\u00b6","text":""},{"location":"examples/rag_azuresearch/#part-4-perform-rag-over-pdf","title":"Part 4: Perform RAG over PDF\u00b6","text":"<p>Combine retrieval from Azure AI Search with Azure OpenAI Chat Completions (aka. grounding your LLM)</p>"},{"location":"examples/rag_haystack/","title":"RAG with Haystack","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote <p>This example leverages the Haystack Docling extension, along with Milvus-based document store and retriever instances, as well as sentence-transformers embeddings.</p> <p>The presented <code>DoclingConverter</code> component enables you to:</p> <ul> <li>use various document types in your LLM applications with ease and speed, and</li> <li>leverage Docling's rich format for advanced, document-native grounding.</li> </ul> <p><code>DoclingConverter</code> supports two different export modes:</p> <ul> <li><code>ExportType.MARKDOWN</code>: if you want to capture each input document as a separate Haystack document, or</li> <li><code>ExportType.DOC_CHUNKS</code> (default): if you want to have each input document chunked and to then capture each individual chunk as a separate Haystack document downstream.</li> </ul> <p>The example allows to explore both modes via parameter <code>EXPORT_TYPE</code>; depending on the value set, the ingestion and RAG pipelines are then set up accordingly.</p> <ul> <li>\ud83d\udc49 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.</li> <li>Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var <code>HF_TOKEN</code>.</li> <li>Requirements can be installed as shown below (<code>--no-warn-conflicts</code> meant for Colab's pre-populated Python env; feel free to remove for stricter usage):</li> </ul> In\u00a0[1]: Copied! <pre>%pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling pymilvus milvus-haystack sentence-transformers python-dotenv\n</pre> %pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling pymilvus milvus-haystack sentence-transformers python-dotenv <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom docling_haystack.converter import ExportType\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nPATHS = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGENERATION_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nEXPORT_TYPE = ExportType.DOC_CHUNKS\nQUESTION = \"Which are the main AI models in Docling?\"\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\n</pre> import os from pathlib import Path from tempfile import mkdtemp from docling_haystack.converter import ExportType from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") PATHS = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GENERATION_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" EXPORT_TYPE = ExportType.DOC_CHUNKS QUESTION = \"Which are the main AI models in Docling?\" TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied! <pre>from docling_haystack.converter import DoclingConverter\nfrom haystack import Pipeline\nfrom haystack.components.embedders import (\n SentenceTransformersDocumentEmbedder,\n SentenceTransformersTextEmbedder,\n)\nfrom haystack.components.preprocessors import DocumentSplitter\nfrom haystack.components.writers import DocumentWriter\nfrom milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever\n\nfrom docling.chunking import HybridChunker\n\ndocument_store = MilvusDocumentStore(\n connection_args={\"uri\": MILVUS_URI},\n drop_old=True,\n text_field=\"txt\", # set for preventing conflict with same-name metadata field\n)\n\nidx_pipe = Pipeline()\nidx_pipe.add_component(\n \"converter\",\n DoclingConverter(\n export_type=EXPORT_TYPE,\n chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n ),\n)\nidx_pipe.add_component(\n \"embedder\",\n SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID),\n)\nidx_pipe.add_component(\"writer\", DocumentWriter(document_store=document_store))\nif EXPORT_TYPE == ExportType.DOC_CHUNKS:\n idx_pipe.connect(\"converter\", \"embedder\")\nelif EXPORT_TYPE == ExportType.MARKDOWN:\n idx_pipe.add_component(\n \"splitter\",\n DocumentSplitter(split_by=\"sentence\", split_length=1),\n )\n idx_pipe.connect(\"converter.documents\", \"splitter.documents\")\n idx_pipe.connect(\"splitter.documents\", \"embedder.documents\")\nelse:\n raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\nidx_pipe.connect(\"embedder\", \"writer\")\nidx_pipe.run({\"converter\": {\"paths\": PATHS}})\n</pre> from docling_haystack.converter import DoclingConverter from haystack import Pipeline from haystack.components.embedders import ( SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder, ) from haystack.components.preprocessors import DocumentSplitter from haystack.components.writers import DocumentWriter from milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever from docling.chunking import HybridChunker document_store = MilvusDocumentStore( connection_args={\"uri\": MILVUS_URI}, drop_old=True, text_field=\"txt\", # set for preventing conflict with same-name metadata field ) idx_pipe = Pipeline() idx_pipe.add_component( \"converter\", DoclingConverter( export_type=EXPORT_TYPE, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ), ) idx_pipe.add_component( \"embedder\", SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID), ) idx_pipe.add_component(\"writer\", DocumentWriter(document_store=document_store)) if EXPORT_TYPE == ExportType.DOC_CHUNKS: idx_pipe.connect(\"converter\", \"embedder\") elif EXPORT_TYPE == ExportType.MARKDOWN: idx_pipe.add_component( \"splitter\", DocumentSplitter(split_by=\"sentence\", split_length=1), ) idx_pipe.connect(\"converter.documents\", \"splitter.documents\") idx_pipe.connect(\"splitter.documents\", \"embedder.documents\") else: raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\") idx_pipe.connect(\"embedder\", \"writer\") idx_pipe.run({\"converter\": {\"paths\": PATHS}}) <pre>Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors\n</pre> <pre>Batches: 0%| | 0/2 [00:00<?, ?it/s]</pre> Out[3]: <pre>{'writer': {'documents_written': 54}}</pre> In\u00a0[4]: Copied! <pre>from haystack.components.builders import AnswerBuilder\nfrom haystack.components.builders.prompt_builder import PromptBuilder\nfrom haystack.components.generators import HuggingFaceAPIGenerator\nfrom haystack.utils import Secret\n\nprompt_template = \"\"\"\n Given these documents, answer the question.\n Documents:\n {% for doc in documents %}\n {{ doc.content }}\n {% endfor %}\n Question: {{query}}\n Answer:\n \"\"\"\n\nrag_pipe = Pipeline()\nrag_pipe.add_component(\n \"embedder\",\n SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID),\n)\nrag_pipe.add_component(\n \"retriever\",\n MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K),\n)\nrag_pipe.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template))\nrag_pipe.add_component(\n \"llm\",\n HuggingFaceAPIGenerator(\n api_type=\"serverless_inference_api\",\n api_params={\"model\": GENERATION_MODEL_ID},\n token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None,\n ),\n)\nrag_pipe.add_component(\"answer_builder\", AnswerBuilder())\nrag_pipe.connect(\"embedder.embedding\", \"retriever\")\nrag_pipe.connect(\"retriever\", \"prompt_builder.documents\")\nrag_pipe.connect(\"prompt_builder\", \"llm\")\nrag_pipe.connect(\"llm.replies\", \"answer_builder.replies\")\nrag_pipe.connect(\"llm.meta\", \"answer_builder.meta\")\nrag_pipe.connect(\"retriever\", \"answer_builder.documents\")\nrag_res = rag_pipe.run(\n {\n \"embedder\": {\"text\": QUESTION},\n \"prompt_builder\": {\"query\": QUESTION},\n \"answer_builder\": {\"query\": QUESTION},\n }\n)\n</pre> from haystack.components.builders import AnswerBuilder from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.generators import HuggingFaceAPIGenerator from haystack.utils import Secret prompt_template = \"\"\" Given these documents, answer the question. Documents: {% for doc in documents %} {{ doc.content }} {% endfor %} Question: {{query}} Answer: \"\"\" rag_pipe = Pipeline() rag_pipe.add_component( \"embedder\", SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID), ) rag_pipe.add_component( \"retriever\", MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K), ) rag_pipe.add_component(\"prompt_builder\", PromptBuilder(template=prompt_template)) rag_pipe.add_component( \"llm\", HuggingFaceAPIGenerator( api_type=\"serverless_inference_api\", api_params={\"model\": GENERATION_MODEL_ID}, token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None, ), ) rag_pipe.add_component(\"answer_builder\", AnswerBuilder()) rag_pipe.connect(\"embedder.embedding\", \"retriever\") rag_pipe.connect(\"retriever\", \"prompt_builder.documents\") rag_pipe.connect(\"prompt_builder\", \"llm\") rag_pipe.connect(\"llm.replies\", \"answer_builder.replies\") rag_pipe.connect(\"llm.meta\", \"answer_builder.meta\") rag_pipe.connect(\"retriever\", \"answer_builder.documents\") rag_res = rag_pipe.run( { \"embedder\": {\"text\": QUESTION}, \"prompt_builder\": {\"query\": QUESTION}, \"answer_builder\": {\"query\": QUESTION}, } ) <pre>Batches: 0%| | 0/1 [00:00<?, ?it/s]</pre> <pre>/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead.\n warnings.warn(\n</pre> <p>Below we print out the RAG results. If you have used <code>ExportType.DOC_CHUNKS</code>, notice how the sources contain document-level grounding (e.g. page number or bounding box information):</p> In\u00a0[5]: Copied! <pre>from docling.chunking import DocChunk\n\nprint(f\"Question:\\n{QUESTION}\\n\")\nprint(f\"Answer:\\n{rag_res['answer_builder']['answers'][0].data.strip()}\\n\")\nprint(\"Sources:\")\nsources = rag_res[\"answer_builder\"][\"answers\"][0].documents\nfor source in sources:\n if EXPORT_TYPE == ExportType.DOC_CHUNKS:\n doc_chunk = DocChunk.model_validate(source.meta[\"dl_meta\"])\n print(f\"- text: {doc_chunk.text!r}\")\n if doc_chunk.meta.origin:\n print(f\" file: {doc_chunk.meta.origin.filename}\")\n if doc_chunk.meta.headings:\n print(f\" section: {' / '.join(doc_chunk.meta.headings)}\")\n bbox = doc_chunk.meta.doc_items[0].prov[0].bbox\n print(\n f\" page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, \"\n f\"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]\"\n )\n elif EXPORT_TYPE == ExportType.MARKDOWN:\n print(repr(source.content))\n else:\n raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\n</pre> from docling.chunking import DocChunk print(f\"Question:\\n{QUESTION}\\n\") print(f\"Answer:\\n{rag_res['answer_builder']['answers'][0].data.strip()}\\n\") print(\"Sources:\") sources = rag_res[\"answer_builder\"][\"answers\"][0].documents for source in sources: if EXPORT_TYPE == ExportType.DOC_CHUNKS: doc_chunk = DocChunk.model_validate(source.meta[\"dl_meta\"]) print(f\"- text: {doc_chunk.text!r}\") if doc_chunk.meta.origin: print(f\" file: {doc_chunk.meta.origin.filename}\") if doc_chunk.meta.headings: print(f\" section: {' / '.join(doc_chunk.meta.headings)}\") bbox = doc_chunk.meta.doc_items[0].prov[0].bbox print( f\" page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, \" f\"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]\" ) elif EXPORT_TYPE == ExportType.MARKDOWN: print(repr(source.content)) else: raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\") <pre>Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nThe main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, while TableFormer is a state-of-the-art table structure recognition model. These models are provided with pre-trained weights and a separate package for the inference code as docling-ibm-models. They are also used in the open-access deepsearch-experience, a cloud-native service for knowledge exploration tasks. Additionally, Docling plans to extend its model library with a figure-classifier model, an equation-recognition model, a code-recognition model, and more in the future.\n\nSources:\n- text: 'As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.'\n file: 2408.09869v5.pdf\n section: 3.2 AI models\n page: 3, bounding box: [107, 406, 504, 330]\n- text: 'Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading-order and eventually assembles a typed document object which can be serialized to JSON or Markdown.'\n file: 2408.09869v5.pdf\n section: 3 Processing pipeline\n page: 2, bounding box: [107, 273, 504, 176]\n- text: 'Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of content, as well as augment extracted document metadata with additional information. Further investment into testing and optimizing GPU acceleration as well as improving the Docling-native PDF backend are on our roadmap, too.\\nWe encourage everyone to propose or implement additional features and models, and will gladly take your inputs and contributions under review . The codebase of Docling is open for use and contribution, under the MIT license agreement and in alignment with our contributing guidelines included in the Docling repository. If you use Docling in your projects, please consider citing this technical report.'\n section: 6 Future work and contributions\n page: 5, bounding box: [106, 323, 504, 258]\n</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/rag_haystack/#rag-with-haystack","title":"RAG with Haystack\u00b6","text":""},{"location":"examples/rag_haystack/#overview","title":"Overview\u00b6","text":""},{"location":"examples/rag_haystack/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_haystack/#indexing-pipeline","title":"Indexing pipeline\u00b6","text":""},{"location":"examples/rag_haystack/#rag-pipeline","title":"RAG pipeline\u00b6","text":""},{"location":"examples/rag_langchain/","title":"RAG with LangChain","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote <p>This example leverages the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings.</p> <p>The presented <code>DoclingLoader</code> component enables you to:</p> <ul> <li>use various document types in your LLM applications with ease and speed, and</li> <li>leverage Docling's rich format for advanced, document-native grounding.</li> </ul> <p><code>DoclingLoader</code> supports two different export modes:</p> <ul> <li><code>ExportType.MARKDOWN</code>: if you want to capture each input document as a separate LangChain document, or</li> <li><code>ExportType.DOC_CHUNKS</code> (default): if you want to have each input document chunked and to then capture each individual chunk as a separate LangChain document downstream.</li> </ul> <p>The example allows exploring both modes via parameter <code>EXPORT_TYPE</code>; depending on the value set, the example pipeline is then set up accordingly.</p> <ul> <li>\ud83d\udc49 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.</li> <li>Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var <code>HF_TOKEN</code>.</li> <li>Requirements can be installed as shown below (<code>--no-warn-conflicts</code> meant for Colab's pre-populated Python env; feel free to remove for stricter usage):</li> </ul> In\u00a0[1]: Copied! <pre>%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain python-dotenv\n</pre> %pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain python-dotenv <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom dotenv import load_dotenv\nfrom langchain_core.prompts import PromptTemplate\nfrom langchain_docling.loader import ExportType\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nFILE_PATH = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nEXPORT_TYPE = ExportType.DOC_CHUNKS\nQUESTION = \"Which are the main AI models in Docling?\"\nPROMPT = PromptTemplate.from_template(\n \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\",\n)\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\n</pre> import os from pathlib import Path from tempfile import mkdtemp from dotenv import load_dotenv from langchain_core.prompts import PromptTemplate from langchain_docling.loader import ExportType def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") FILE_PATH = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" EXPORT_TYPE = ExportType.DOC_CHUNKS QUESTION = \"Which are the main AI models in Docling?\" PROMPT = PromptTemplate.from_template( \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\", ) TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied! <pre>from langchain_docling import DoclingLoader\n\nfrom docling.chunking import HybridChunker\n\nloader = DoclingLoader(\n file_path=FILE_PATH,\n export_type=EXPORT_TYPE,\n chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n)\n\ndocs = loader.load()\n</pre> from langchain_docling import DoclingLoader from docling.chunking import HybridChunker loader = DoclingLoader( file_path=FILE_PATH, export_type=EXPORT_TYPE, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ) docs = loader.load() <pre>Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors\n</pre> <p>Note: a message saying <code>\"Token indices sequence length is longer than the specified maximum sequence length...\"</code> can be ignored in this case \u2014 details here.</p> <p>Determining the splits:</p> In\u00a0[4]: Copied! <pre>if EXPORT_TYPE == ExportType.DOC_CHUNKS:\n splits = docs\nelif EXPORT_TYPE == ExportType.MARKDOWN:\n from langchain_text_splitters import MarkdownHeaderTextSplitter\n\n splitter = MarkdownHeaderTextSplitter(\n headers_to_split_on=[\n (\"#\", \"Header_1\"),\n (\"##\", \"Header_2\"),\n (\"###\", \"Header_3\"),\n ],\n )\n splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]\nelse:\n raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")\n</pre> if EXPORT_TYPE == ExportType.DOC_CHUNKS: splits = docs elif EXPORT_TYPE == ExportType.MARKDOWN: from langchain_text_splitters import MarkdownHeaderTextSplitter splitter = MarkdownHeaderTextSplitter( headers_to_split_on=[ (\"#\", \"Header_1\"), (\"##\", \"Header_2\"), (\"###\", \"Header_3\"), ], ) splits = [split for doc in docs for split in splitter.split_text(doc.page_content)] else: raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\") <p>Inspecting some sample splits:</p> In\u00a0[5]: Copied! <pre>for d in splits[:3]:\n print(f\"- {d.page_content=}\")\nprint(\"...\")\n</pre> for d in splits[:3]: print(f\"- {d.page_content=}\") print(\"...\") <pre>- d.page_content='arXiv:2408.09869v5 [cs.CL] 9 Dec 2024'\n- d.page_content='Docling Technical Report\\nVersion 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R\u00a8uschlikon, Switzerland'\n- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n...\n</pre> In\u00a0[6]: Copied! <pre>import json\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom langchain_huggingface.embeddings import HuggingFaceEmbeddings\nfrom langchain_milvus import Milvus\n\nembedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)\n\n\nmilvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed\nvectorstore = Milvus.from_documents(\n documents=splits,\n embedding=embedding,\n collection_name=\"docling_demo\",\n connection_args={\"uri\": milvus_uri},\n index_params={\"index_type\": \"FLAT\"},\n drop_old=True,\n)\n</pre> import json from pathlib import Path from tempfile import mkdtemp from langchain_huggingface.embeddings import HuggingFaceEmbeddings from langchain_milvus import Milvus embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID) milvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed vectorstore = Milvus.from_documents( documents=splits, embedding=embedding, collection_name=\"docling_demo\", connection_args={\"uri\": milvus_uri}, index_params={\"index_type\": \"FLAT\"}, drop_old=True, ) In\u00a0[7]: Copied! <pre>from langchain.chains import create_retrieval_chain\nfrom langchain.chains.combine_documents import create_stuff_documents_chain\nfrom langchain_huggingface import HuggingFaceEndpoint\n\nretriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K})\nllm = HuggingFaceEndpoint(\n repo_id=GEN_MODEL_ID,\n huggingfacehub_api_token=HF_TOKEN,\n)\n\n\ndef clip_text(text, threshold=100):\n return f\"{text[:threshold]}...\" if len(text) > threshold else text\n</pre> from langchain.chains import create_retrieval_chain from langchain.chains.combine_documents import create_stuff_documents_chain from langchain_huggingface import HuggingFaceEndpoint retriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K}) llm = HuggingFaceEndpoint( repo_id=GEN_MODEL_ID, huggingfacehub_api_token=HF_TOKEN, ) def clip_text(text, threshold=100): return f\"{text[:threshold]}...\" if len(text) > threshold else text <pre>Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\n</pre> In\u00a0[8]: Copied! <pre>question_answer_chain = create_stuff_documents_chain(llm, PROMPT)\nrag_chain = create_retrieval_chain(retriever, question_answer_chain)\nresp_dict = rag_chain.invoke({\"input\": QUESTION})\n\nclipped_answer = clip_text(resp_dict[\"answer\"], threshold=200)\nprint(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")\nfor i, doc in enumerate(resp_dict[\"context\"]):\n print()\n print(f\"Source {i + 1}:\")\n print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")\n for key in doc.metadata:\n if key != \"pk\":\n val = doc.metadata.get(key)\n clipped_val = clip_text(val) if isinstance(val, str) else val\n print(f\" {key}: {clipped_val}\")\n</pre> question_answer_chain = create_stuff_documents_chain(llm, PROMPT) rag_chain = create_retrieval_chain(retriever, question_answer_chain) resp_dict = rag_chain.invoke({\"input\": QUESTION}) clipped_answer = clip_text(resp_dict[\"answer\"], threshold=200) print(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\") for i, doc in enumerate(resp_dict[\"context\"]): print() print(f\"Source {i + 1}:\") print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\") for key in doc.metadata: if key != \"pk\": val = doc.metadata.get(key) clipped_val = clip_text(val) if isinstance(val, str) else val print(f\" {key}: {clipped_val}\") <pre>Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nDocling initially releases two AI models, a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art tab...\n\nSource 1:\n text: \"3.2 AI models\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re...\"\n dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 405.1419982910156, 'r': 504.00299072265625, 'b': 330.7799987792969, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n source: https://arxiv.org/pdf/2408.09869\n\nSource 2:\n text: \"3 Processing pipeline\\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ...\"\n dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/26', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 2, 'bbox': {'l': 108.0, 't': 273.01800537109375, 'r': 504.00299072265625, 'b': 176.83799743652344, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 796]}]}], 'headings': ['3 Processing pipeline'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n source: https://arxiv.org/pdf/2408.09869\n\nSource 3:\n text: \"6 Future work and contributions\\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ...\"\n dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/76', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 322.468994140625, 'r': 504.00299072265625, 'b': 259.0169982910156, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}, {'self_ref': '#/texts/77', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 251.6540069580078, 'r': 504.00299072265625, 'b': 198.99200439453125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 402]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n source: https://arxiv.org/pdf/2408.09869\n</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/rag_langchain/#rag-with-langchain","title":"RAG with LangChain\u00b6","text":""},{"location":"examples/rag_langchain/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_langchain/#document-loading","title":"Document loading\u00b6","text":"<p>Now we can instantiate our loader and load documents.</p>"},{"location":"examples/rag_langchain/#ingestion","title":"Ingestion\u00b6","text":""},{"location":"examples/rag_langchain/#rag","title":"RAG\u00b6","text":""},{"location":"examples/rag_llamaindex/","title":"RAG with LlamaIndex","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote <p>This example leverages the official LlamaIndex Docling extension.</p> <p>Presented extensions <code>DoclingReader</code> and <code>DoclingNodeParser</code> enable you to:</p> <ul> <li>use various document types in your LLM applications with ease and speed, and</li> <li>leverage Docling's rich format for advanced, document-native grounding.</li> </ul> <ul> <li>\ud83d\udc49 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.</li> <li>Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var <code>HF_TOKEN</code>.</li> <li>Requirements can be installed as shown below (<code>--no-warn-conflicts</code> meant for Colab's pre-populated Python env; feel free to remove for stricter usage):</li> </ul> In\u00a0[1]: Copied! <pre>%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv\n</pre> %pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\nfrom warnings import filterwarnings\n\nfrom dotenv import load_dotenv\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\nfilterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")\nfilterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\")\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n</pre> import os from pathlib import Path from tempfile import mkdtemp from warnings import filterwarnings from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\") filterwarnings(action=\"ignore\", category=FutureWarning, module=\"easyocr\") # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" <p>We can now define the main parameters:</p> In\u00a0[3]: Copied! <pre>from llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n\nEMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\nGEN_MODEL = HuggingFaceInferenceAPI(\n token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n)\nSOURCE = \"https://arxiv.org/pdf/2408.09869\" # Docling Technical Report\nQUERY = \"Which are the main AI models in Docling?\"\n\nembed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\n</pre> from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI EMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\") MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") GEN_MODEL = HuggingFaceInferenceAPI( token=_get_env_from_colab_or_os(\"HF_TOKEN\"), model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\", ) SOURCE = \"https://arxiv.org/pdf/2408.09869\" # Docling Technical Report QUERY = \"Which are the main AI models in Docling?\" embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\")) <p>To create a simple RAG pipeline, we can:</p> <ul> <li>define a <code>DoclingReader</code>, which by default exports to Markdown, and</li> <li>use a standard node parser for these Markdown-based docs, e.g. a <code>MarkdownNodeParser</code></li> </ul> In\u00a0[4]: Copied! <pre>from llama_index.core import StorageContext, VectorStoreIndex\nfrom llama_index.core.node_parser import MarkdownNodeParser\nfrom llama_index.readers.docling import DoclingReader\nfrom llama_index.vector_stores.milvus import MilvusVectorStore\n\nreader = DoclingReader()\nnode_parser = MarkdownNodeParser()\n\nvector_store = MilvusVectorStore(\n uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed\n dim=embed_dim,\n overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n documents=reader.load_data(SOURCE),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\n</pre> from llama_index.core import StorageContext, VectorStoreIndex from llama_index.core.node_parser import MarkdownNodeParser from llama_index.readers.docling import DoclingReader from llama_index.vector_stores.milvus import MilvusVectorStore reader = DoclingReader() node_parser = MarkdownNodeParser() vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes]) <pre>Q: Which are the main AI models in Docling?\nA: The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.\n\nSources:\n</pre> <pre>[('3.2 AI models\\n\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n {'Header_2': '3.2 AI models'}),\n (\"5 Applications\\n\\nThanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.\",\n {'Header_2': '5 Applications'})]</pre> <p>To leverage Docling's rich native format, we:</p> <ul> <li>create a <code>DoclingReader</code> with JSON export type, and</li> <li>employ a <code>DoclingNodeParser</code> in order to appropriately parse that Docling format.</li> </ul> <p>Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):</p> In\u00a0[5]: Copied! <pre>from llama_index.node_parser.docling import DoclingNodeParser\n\nreader = DoclingReader(export_type=DoclingReader.ExportType.JSON)\nnode_parser = DoclingNodeParser()\n\nvector_store = MilvusVectorStore(\n uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed\n dim=embed_dim,\n overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n documents=reader.load_data(SOURCE),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\n</pre> from llama_index.node_parser.docling import DoclingNodeParser reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) node_parser = DoclingNodeParser() vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes]) <pre>Q: Which are the main AI models in Docling?\nA: The main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art table structure recognition model.\n\nSources:\n</pre> <pre>[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n {'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/34',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 3,\n 'bbox': {'l': 107.07593536376953,\n 't': 406.1695251464844,\n 'r': 504.1148681640625,\n 'b': 330.2677307128906,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 608]}]}],\n 'headings': ['3.2 AI models'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869v3.pdf'}}),\n ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n {'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/9',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 1,\n 'bbox': {'l': 107.0031967163086,\n 't': 136.7283935546875,\n 'r': 504.04998779296875,\n 'b': 83.30133056640625,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 488]}]}],\n 'headings': ['1 Introduction'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869v3.pdf'}})]</pre> <p>To demonstrate this usage pattern, we first set up a test document directory.</p> In\u00a0[6]: Copied! <pre>from pathlib import Path\nfrom tempfile import mkdtemp\n\nimport requests\n\ntmp_dir_path = Path(mkdtemp())\nr = requests.get(SOURCE)\nwith open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file:\n out_file.write(r.content)\n</pre> from pathlib import Path from tempfile import mkdtemp import requests tmp_dir_path = Path(mkdtemp()) r = requests.get(SOURCE) with open(tmp_dir_path / f\"{Path(SOURCE).name}.pdf\", \"wb\") as out_file: out_file.write(r.content) <p>Using the <code>reader</code> and <code>node_parser</code> definitions from any of the above variants, usage with <code>SimpleDirectoryReader</code> then looks as follows:</p> In\u00a0[7]: Copied! <pre>from llama_index.core import SimpleDirectoryReader\n\ndir_reader = SimpleDirectoryReader(\n input_dir=tmp_dir_path,\n file_extractor={\".pdf\": reader},\n)\n\nvector_store = MilvusVectorStore(\n uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed\n dim=embed_dim,\n overwrite=True,\n)\nindex = VectorStoreIndex.from_documents(\n documents=dir_reader.load_data(SOURCE),\n transformations=[node_parser],\n storage_context=StorageContext.from_defaults(vector_store=vector_store),\n embed_model=EMBED_MODEL,\n)\nresult = index.as_query_engine(llm=GEN_MODEL).query(QUERY)\nprint(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\")\ndisplay([(n.text, n.metadata) for n in result.source_nodes])\n</pre> from llama_index.core import SimpleDirectoryReader dir_reader = SimpleDirectoryReader( input_dir=tmp_dir_path, file_extractor={\".pdf\": reader}, ) vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / \"docling.db\"), # or set as needed dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=dir_reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f\"Q: {QUERY}\\nA: {result.response.strip()}\\n\\nSources:\") display([(n.text, n.metadata) for n in result.source_nodes]) <pre>Loading files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:11<00:00, 11.27s/file]\n</pre> <pre>Q: Which are the main AI models in Docling?\nA: 1. A layout analysis model, an accurate object-detector for page elements. 2. TableFormer, a state-of-the-art table structure recognition model.\n\nSources:\n</pre> <pre>[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.',\n {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf',\n 'file_name': '2408.09869.pdf',\n 'file_type': 'application/pdf',\n 'file_size': 5566574,\n 'creation_date': '2024-10-28',\n 'last_modified_date': '2024-10-28',\n 'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/34',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 3,\n 'bbox': {'l': 107.07593536376953,\n 't': 406.1695251464844,\n 'r': 504.1148681640625,\n 'b': 330.2677307128906,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 608]}]}],\n 'headings': ['3.2 AI models'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869.pdf'}}),\n ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.',\n {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf',\n 'file_name': '2408.09869.pdf',\n 'file_type': 'application/pdf',\n 'file_size': 5566574,\n 'creation_date': '2024-10-28',\n 'last_modified_date': '2024-10-28',\n 'schema_name': 'docling_core.transforms.chunker.DocMeta',\n 'version': '1.0.0',\n 'doc_items': [{'self_ref': '#/texts/9',\n 'parent': {'$ref': '#/body'},\n 'children': [],\n 'label': 'text',\n 'prov': [{'page_no': 1,\n 'bbox': {'l': 107.0031967163086,\n 't': 136.7283935546875,\n 'r': 504.04998779296875,\n 'b': 83.30133056640625,\n 'coord_origin': 'BOTTOMLEFT'},\n 'charspan': [0, 488]}]}],\n 'headings': ['1 Introduction'],\n 'origin': {'mimetype': 'application/pdf',\n 'binary_hash': 14981478401387673002,\n 'filename': '2408.09869.pdf'}})]</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/rag_llamaindex/#rag-with-llamaindex","title":"RAG with LlamaIndex\u00b6","text":""},{"location":"examples/rag_llamaindex/#overview","title":"Overview\u00b6","text":""},{"location":"examples/rag_llamaindex/#setup","title":"Setup\u00b6","text":""},{"location":"examples/rag_llamaindex/#using-markdown-export","title":"Using Markdown export\u00b6","text":""},{"location":"examples/rag_llamaindex/#using-docling-format","title":"Using Docling format\u00b6","text":""},{"location":"examples/rag_llamaindex/#with-simple-directory-reader","title":"With Simple Directory Reader\u00b6","text":""},{"location":"examples/rag_milvus/","title":"RAG with Milvus","text":"In\u00a0[\u00a0]: Copied! <pre>! pip install --upgrade pymilvus docling openai torch\n</pre> ! pip install --upgrade pymilvus docling openai torch <p>If you are using Google Colab, to enable dependencies just installed, you may need to restart the runtime (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu).</p> <p>Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.</p> <p>The code below checks to see if a GPU is available, either via CUDA or MPS.</p> In\u00a0[1]: Copied! <pre>import torch\n\n# Check if GPU or MPS is available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\n print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n device = torch.device(\"mps\")\n print(\"MPS GPU is enabled.\")\nelse:\n raise OSError(\n \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n )\n</pre> import torch # Check if GPU or MPS is available if torch.cuda.is_available(): device = torch.device(\"cuda\") print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available(): device = torch.device(\"mps\") print(\"MPS GPU is enabled.\") else: raise OSError( \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\" ) <pre>MPS GPU is enabled.\n</pre> In\u00a0[2]: Copied! <pre>import os\n\nos.environ[\"OPENAI_API_KEY\"] = \"sk-***********\"\n</pre> import os os.environ[\"OPENAI_API_KEY\"] = \"sk-***********\" In\u00a0[3]: Copied! <pre>from openai import OpenAI\n\nopenai_client = OpenAI()\n</pre> from openai import OpenAI openai_client = OpenAI() <p>Define a function to generate text embeddings using OpenAI client. We use the text-embedding-3-small model as an example.</p> In\u00a0[4]: Copied! <pre>def emb_text(text):\n return (\n openai_client.embeddings.create(input=text, model=\"text-embedding-3-small\")\n .data[0]\n .embedding\n )\n</pre> def emb_text(text): return ( openai_client.embeddings.create(input=text, model=\"text-embedding-3-small\") .data[0] .embedding ) <p>Generate a test embedding and print its dimension and first few elements.</p> In\u00a0[5]: Copied! <pre>test_embedding = emb_text(\"This is a test\")\nembedding_dim = len(test_embedding)\nprint(embedding_dim)\nprint(test_embedding[:10])\n</pre> test_embedding = emb_text(\"This is a test\") embedding_dim = len(test_embedding) print(embedding_dim) print(test_embedding[:10]) <pre>1536\n[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]\n</pre> <p>In this tutorial, we will use a Markdown file (source) as the input. We will process the document using a HierarchicalChunker provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks.</p> In\u00a0[6]: Copied! <pre>from docling_core.transforms.chunker import HierarchicalChunker\n\nfrom docling.document_converter import DocumentConverter\n\nconverter = DocumentConverter()\nchunker = HierarchicalChunker()\n\n# Convert the input file to Docling Document\nsource = \"https://milvus.io/docs/overview.md\"\ndoc = converter.convert(source).document\n\n# Perform hierarchical chunking\ntexts = [chunk.text for chunk in chunker.chunk(doc)]\n</pre> from docling_core.transforms.chunker import HierarchicalChunker from docling.document_converter import DocumentConverter converter = DocumentConverter() chunker = HierarchicalChunker() # Convert the input file to Docling Document source = \"https://milvus.io/docs/overview.md\" doc = converter.convert(source).document # Perform hierarchical chunking texts = [chunk.text for chunk in chunker.chunk(doc)] In\u00a0[7]: Copied! <pre>from pymilvus import MilvusClient\n\nmilvus_client = MilvusClient(uri=\"./milvus_demo.db\")\ncollection_name = \"my_rag_collection\"\n</pre> from pymilvus import MilvusClient milvus_client = MilvusClient(uri=\"./milvus_demo.db\") collection_name = \"my_rag_collection\" <p>As for the argument of <code>MilvusClient</code>:</p> <ul> <li>Setting the <code>uri</code> as a local file, e.g.<code>./milvus.db</code>, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file.</li> <li>If you have large scale of data, you can set up a more performant Milvus server on docker or kubernetes. In this setup, please use the server uri, e.g.<code>http://localhost:19530</code>, as your <code>uri</code>.</li> <li>If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust the <code>uri</code> and <code>token</code>, which correspond to the Public Endpoint and Api key in Zilliz Cloud.</li> </ul> <p>Check if the collection already exists and drop it if it does.</p> In\u00a0[8]: Copied! <pre>if milvus_client.has_collection(collection_name):\n milvus_client.drop_collection(collection_name)\n</pre> if milvus_client.has_collection(collection_name): milvus_client.drop_collection(collection_name) <p>Create a new collection with specified parameters.</p> <p>If we don\u2019t specify any field information, Milvus will automatically create a default <code>id</code> field for primary key, and a <code>vector</code> field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values.</p> In\u00a0[9]: Copied! <pre>milvus_client.create_collection(\n collection_name=collection_name,\n dimension=embedding_dim,\n metric_type=\"IP\", # Inner product distance\n consistency_level=\"Strong\", # Supported values are (`\"Strong\"`, `\"Session\"`, `\"Bounded\"`, `\"Eventually\"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details.\n)\n</pre> milvus_client.create_collection( collection_name=collection_name, dimension=embedding_dim, metric_type=\"IP\", # Inner product distance consistency_level=\"Strong\", # Supported values are (`\"Strong\"`, `\"Session\"`, `\"Bounded\"`, `\"Eventually\"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details. ) In\u00a0[10]: Copied! <pre>from tqdm import tqdm\n\ndata = []\n\nfor i, chunk in enumerate(tqdm(texts, desc=\"Processing chunks\")):\n embedding = emb_text(chunk)\n data.append({\"id\": i, \"vector\": embedding, \"text\": chunk})\n\nmilvus_client.insert(collection_name=collection_name, data=data)\n</pre> from tqdm import tqdm data = [] for i, chunk in enumerate(tqdm(texts, desc=\"Processing chunks\")): embedding = emb_text(chunk) data.append({\"id\": i, \"vector\": embedding, \"text\": chunk}) milvus_client.insert(collection_name=collection_name, data=data) <pre>Processing chunks: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 38/38 [00:14<00:00, 2.59it/s]\n</pre> Out[10]: <pre>{'insert_count': 38, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37], 'cost': 0}</pre> In\u00a0[11]: Copied! <pre>question = (\n \"What are the three deployment modes of Milvus, and what are their differences?\"\n)\n</pre> question = ( \"What are the three deployment modes of Milvus, and what are their differences?\" ) <p>Search for the question in the collection and retrieve the semantic top-3 matches.</p> In\u00a0[12]: Copied! <pre>search_res = milvus_client.search(\n collection_name=collection_name,\n data=[emb_text(question)],\n limit=3,\n search_params={\"metric_type\": \"IP\", \"params\": {}},\n output_fields=[\"text\"],\n)\n</pre> search_res = milvus_client.search( collection_name=collection_name, data=[emb_text(question)], limit=3, search_params={\"metric_type\": \"IP\", \"params\": {}}, output_fields=[\"text\"], ) <p>Let\u2019s take a look at the search results of the query</p> In\u00a0[13]: Copied! <pre>import json\n\nretrieved_lines_with_distances = [\n (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0]\n]\nprint(json.dumps(retrieved_lines_with_distances, indent=4))\n</pre> import json retrieved_lines_with_distances = [ (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0] ] print(json.dumps(retrieved_lines_with_distances, indent=4)) <pre>[\n [\n \"Milvus offers three deployment modes, covering a wide range of data scales\\u2014from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors:\",\n 0.6503315567970276\n ],\n [\n \"Milvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it\\u2019s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\\nMilvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\\nMilvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more.\",\n 0.6281915903091431\n ],\n [\n \"What is Milvus?\\nUnstructured Data, Embeddings, and Milvus\\nWhat Makes Milvus so Fast\\uff1f\\nWhat Makes Milvus so Scalable\\nTypes of Searches Supported by Milvus\\nComprehensive Feature Set\",\n 0.6117826700210571\n ]\n]\n</pre> In\u00a0[14]: Copied! <pre>context = \"\\n\".join(\n [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]\n)\n</pre> context = \"\\n\".join( [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances] ) <p>Define system and user prompts for the Lanage Model. This prompt is assembled with the retrieved documents from Milvus.</p> In\u00a0[16]: Copied! <pre>SYSTEM_PROMPT = \"\"\"\nHuman: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.\n\"\"\"\nUSER_PROMPT = f\"\"\"\nUse the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.\n<context>\n{context}\n</context>\n<question>\n{question}\n</question>\n\"\"\"\n</pre> SYSTEM_PROMPT = \"\"\" Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. \"\"\" USER_PROMPT = f\"\"\" Use the following pieces of information enclosed in tags to provide an answer to the question enclosed in tags. {context} {question} \"\"\" <p>Use OpenAI ChatGPT to generate a response based on the prompts.</p> In\u00a0[17]: Copied! <pre>response = openai_client.chat.completions.create(\n model=\"gpt-4o\",\n messages=[\n {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n {\"role\": \"user\", \"content\": USER_PROMPT},\n ],\n)\nprint(response.choices[0].message.content)\n</pre> response = openai_client.chat.completions.create( model=\"gpt-4o\", messages=[ {\"role\": \"system\", \"content\": SYSTEM_PROMPT}, {\"role\": \"user\", \"content\": USER_PROMPT}, ], ) print(response.choices[0].message.content) <pre>The three deployment modes of Milvus are:\n\n1. **Milvus Lite**: This is a Python library that integrates easily into your applications. It's a lightweight version ideal for quick prototyping in Jupyter Notebooks or for running on edge devices with limited resources.\n\n2. **Milvus Standalone**: This mode is a single-machine server deployment where all components are bundled into a single Docker image, making it convenient to deploy.\n\n3. **Milvus Distributed**: This mode is designed for deployment on Kubernetes clusters. It features a cloud-native architecture suited for managing scenarios at a billion-scale or larger, ensuring redundancy in critical components.\n</pre>"},{"location":"examples/rag_milvus/#rag-with-milvus","title":"RAG with Milvus\u00b6","text":"Step Tech Execution Embedding OpenAI (text-embedding-3-small) \ud83c\udf10 Remote Vector store Milvus \ud83d\udcbb Local Gen AI OpenAI (gpt-4o) \ud83c\udf10 Remote"},{"location":"examples/rag_milvus/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"<p>This is a code recipe that uses Milvus, the world's most advanced open-source vector database, to perform RAG over documents parsed by Docling.</p> <p>In this notebook, we accomplish the following:</p> <ul> <li>Parse documents using Docling's document conversion capabilities</li> <li>Perform hierarchical chunking of the documents using Docling</li> <li>Generate text embeddings with OpenAI</li> <li>Perform RAG using Milvus, the world's most advanced open-source vector database</li> </ul> <p>Note: For best results, please use GPU acceleration to run this notebook. Here are two options for running this notebook:</p> <ol> <li>Locally on a MacBook with an Apple Silicon chip. Converting all documents in the notebook takes ~2 minutes on a MacBook M2 due to Docling's usage of MPS accelerators.</li> <li>Run this notebook on Google Colab. Converting all documents in the notebook takes ~8 minutes on a Google Colab T4 GPU.</li> </ol>"},{"location":"examples/rag_milvus/#preparation","title":"Preparation\u00b6","text":""},{"location":"examples/rag_milvus/#dependencies-and-environment","title":"Dependencies and Environment\u00b6","text":"<p>To start, install the required dependencies by running the following command:</p>"},{"location":"examples/rag_milvus/#gpu-checking","title":"GPU Checking\u00b6","text":""},{"location":"examples/rag_milvus/#setting-up-api-keys","title":"Setting Up API Keys\u00b6","text":"<p>We will use OpenAI as the LLM in this example. You should prepare the OPENAI_API_KEY as an environment variable.</p>"},{"location":"examples/rag_milvus/#prepare-the-llm-and-embedding-model","title":"Prepare the LLM and Embedding Model\u00b6","text":"<p>We initialize the OpenAI client to prepare the embedding model.</p>"},{"location":"examples/rag_milvus/#process-data-using-docling","title":"Process Data Using Docling\u00b6","text":"<p>Docling can parse various document formats into a unified representation (Docling Document), which can then be exported to different output formats. For a full list of supported input and output formats, please refer to the official documentation.</p>"},{"location":"examples/rag_milvus/#load-data-into-milvus","title":"Load Data into Milvus\u00b6","text":""},{"location":"examples/rag_milvus/#create-the-collection","title":"Create the collection\u00b6","text":"<p>With data in hand, we can create a <code>MilvusClient</code> instance and insert the data into a Milvus collection.</p>"},{"location":"examples/rag_milvus/#insert-data","title":"Insert data\u00b6","text":""},{"location":"examples/rag_milvus/#build-rag","title":"Build RAG\u00b6","text":""},{"location":"examples/rag_milvus/#retrieve-data-for-a-query","title":"Retrieve data for a query\u00b6","text":"<p>Let\u2019s specify a query question about the website we just scraped.</p>"},{"location":"examples/rag_milvus/#use-llm-to-get-a-rag-response","title":"Use LLM to get a RAG response\u00b6","text":"<p>Convert the retrieved documents into a string format.</p>"},{"location":"examples/rag_weaviate/","title":"RAG with Weaviate","text":"Step Tech Execution Embedding Open AI \ud83c\udf10 Remote Vector store Weavieate \ud83d\udcbb Local Gen AI Open AI \ud83c\udf10 Remote In\u00a0[\u00a0]: Copied! <pre>%%capture\n%pip install docling~=\"2.7.0\"\n%pip install -U weaviate-client~=\"4.9.4\"\n%pip install rich\n%pip install torch\n\nimport logging\nimport warnings\n\nwarnings.filterwarnings(\"ignore\")\n\n# Suppress Weaviate client logs\nlogging.getLogger(\"weaviate\").setLevel(logging.ERROR)\n</pre> %%capture %pip install docling~=\"2.7.0\" %pip install -U weaviate-client~=\"4.9.4\" %pip install rich %pip install torch import logging import warnings warnings.filterwarnings(\"ignore\") # Suppress Weaviate client logs logging.getLogger(\"weaviate\").setLevel(logging.ERROR) In\u00a0[2]: Copied! <pre>import torch\n\n# Check if GPU or MPS is available\nif torch.cuda.is_available():\n device = torch.device(\"cuda\")\n print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\")\nelif torch.backends.mps.is_available():\n device = torch.device(\"mps\")\n print(\"MPS GPU is enabled.\")\nelse:\n raise OSError(\n \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\"\n )\n</pre> import torch # Check if GPU or MPS is available if torch.cuda.is_available(): device = torch.device(\"cuda\") print(f\"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}\") elif torch.backends.mps.is_available(): device = torch.device(\"mps\") print(\"MPS GPU is enabled.\") else: raise OSError( \"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured.\" ) <pre>MPS GPU is enabled.\n</pre> <p>Here, we've collected 10 influential machine learning papers published as PDFs on arXiv. Because Docling does not yet have title extraction for PDFs, we manually add the titles in a corresponding list.</p> <p>Note: Converting all 10 papers should take around 8 minutes with a T4 GPU.</p> In\u00a0[3]: Copied! <pre># Influential machine learning papers\nsource_urls = [\n \"https://arxiv.org/pdf/1706.03762\",\n \"https://arxiv.org/pdf/1810.04805\",\n \"https://arxiv.org/pdf/1406.2661\",\n \"https://arxiv.org/pdf/1409.0473\",\n \"https://arxiv.org/pdf/1412.6980\",\n \"https://arxiv.org/pdf/1312.6114\",\n \"https://arxiv.org/pdf/1312.5602\",\n \"https://arxiv.org/pdf/1512.03385\",\n \"https://arxiv.org/pdf/1409.3215\",\n \"https://arxiv.org/pdf/1301.3781\",\n]\n\n# And their corresponding titles (because Docling doesn't have title extraction yet!)\nsource_titles = [\n \"Attention Is All You Need\",\n \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\",\n \"Generative Adversarial Nets\",\n \"Neural Machine Translation by Jointly Learning to Align and Translate\",\n \"Adam: A Method for Stochastic Optimization\",\n \"Auto-Encoding Variational Bayes\",\n \"Playing Atari with Deep Reinforcement Learning\",\n \"Deep Residual Learning for Image Recognition\",\n \"Sequence to Sequence Learning with Neural Networks\",\n \"A Neural Probabilistic Language Model\",\n]\n</pre> # Influential machine learning papers source_urls = [ \"https://arxiv.org/pdf/1706.03762\", \"https://arxiv.org/pdf/1810.04805\", \"https://arxiv.org/pdf/1406.2661\", \"https://arxiv.org/pdf/1409.0473\", \"https://arxiv.org/pdf/1412.6980\", \"https://arxiv.org/pdf/1312.6114\", \"https://arxiv.org/pdf/1312.5602\", \"https://arxiv.org/pdf/1512.03385\", \"https://arxiv.org/pdf/1409.3215\", \"https://arxiv.org/pdf/1301.3781\", ] # And their corresponding titles (because Docling doesn't have title extraction yet!) source_titles = [ \"Attention Is All You Need\", \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\", \"Generative Adversarial Nets\", \"Neural Machine Translation by Jointly Learning to Align and Translate\", \"Adam: A Method for Stochastic Optimization\", \"Auto-Encoding Variational Bayes\", \"Playing Atari with Deep Reinforcement Learning\", \"Deep Residual Learning for Image Recognition\", \"Sequence to Sequence Learning with Neural Networks\", \"A Neural Probabilistic Language Model\", ] In\u00a0[4]: Copied! <pre>from docling.document_converter import DocumentConverter\n\n# Instantiate the doc converter\ndoc_converter = DocumentConverter()\n\n# Directly pass list of files or streams to `convert_all`\nconv_results_iter = doc_converter.convert_all(source_urls) # previously `convert`\n\n# Iterate over the generator to get a list of Docling documents\ndocs = [result.document for result in conv_results_iter]\n</pre> from docling.document_converter import DocumentConverter # Instantiate the doc converter doc_converter = DocumentConverter() # Directly pass list of files or streams to `convert_all` conv_results_iter = doc_converter.convert_all(source_urls) # previously `convert` # Iterate over the generator to get a list of Docling documents docs = [result.document for result in conv_results_iter] <pre>Fetching 9 files: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 9/9 [00:00<00:00, 84072.91it/s]\n</pre> <pre>ERR#: COULD NOT CONVERT TO RS THIS TABLE TO COMPUTE SPANS\n</pre> In\u00a0[5]: Copied! <pre>from docling_core.transforms.chunker import HierarchicalChunker\n\n# Initialize lists for text, and titles\ntexts, titles = [], []\n\nchunker = HierarchicalChunker()\n\n# Process each document in the list\nfor doc, title in zip(docs, source_titles): # Pair each document with its title\n chunks = list(\n chunker.chunk(doc)\n ) # Perform hierarchical chunking and get text from chunks\n for chunk in chunks:\n texts.append(chunk.text)\n titles.append(title)\n</pre> from docling_core.transforms.chunker import HierarchicalChunker # Initialize lists for text, and titles texts, titles = [], [] chunker = HierarchicalChunker() # Process each document in the list for doc, title in zip(docs, source_titles): # Pair each document with its title chunks = list( chunker.chunk(doc) ) # Perform hierarchical chunking and get text from chunks for chunk in chunks: texts.append(chunk.text) titles.append(title) <p>Because we're splitting the documents into chunks, we'll concatenate the article title to the beginning of each chunk for additional context.</p> In\u00a0[6]: Copied! <pre># Concatenate title and text\nfor i in range(len(texts)):\n texts[i] = f\"{titles[i]} {texts[i]}\"\n</pre> # Concatenate title and text for i in range(len(texts)): texts[i] = f\"{titles[i]} {texts[i]}\" <p>We'll be using the OpenAI API for both generating the text embeddings and for the generative model in our RAG pipeline. The code below dynamically fetches your API key based on whether you're running this notebook in Google Colab and running it as a regular Jupyter notebook. All you need to do is replace <code>openai_api_key_var</code> with the name of your environmental variable name or Colab secret name for the API key.</p> <p>If you're running this notebook in Google Colab, make sure you add your API key as a secret.</p> In\u00a0[7]: Copied! <pre># OpenAI API key variable name\nopenai_api_key_var = \"OPENAI_API_KEY\" # Replace with the name of your secret/env var\n\n# Fetch OpenAI API key\ntry:\n # If running in Colab, fetch API key from Secrets\n import google.colab\n from google.colab import userdata\n\n openai_api_key = userdata.get(openai_api_key_var)\n if not openai_api_key:\n raise ValueError(f\"Secret '{openai_api_key_var}' not found in Colab secrets.\")\nexcept ImportError:\n # If not running in Colab, fetch API key from environment variable\n import os\n\n openai_api_key = os.getenv(openai_api_key_var)\n if not openai_api_key:\n raise OSError(\n f\"Environment variable '{openai_api_key_var}' is not set. \"\n \"Please define it before running this script.\"\n )\n</pre> # OpenAI API key variable name openai_api_key_var = \"OPENAI_API_KEY\" # Replace with the name of your secret/env var # Fetch OpenAI API key try: # If running in Colab, fetch API key from Secrets import google.colab from google.colab import userdata openai_api_key = userdata.get(openai_api_key_var) if not openai_api_key: raise ValueError(f\"Secret '{openai_api_key_var}' not found in Colab secrets.\") except ImportError: # If not running in Colab, fetch API key from environment variable import os openai_api_key = os.getenv(openai_api_key_var) if not openai_api_key: raise OSError( f\"Environment variable '{openai_api_key_var}' is not set. \" \"Please define it before running this script.\" ) <p>Embedded Weaviate allows you to spin up a Weaviate instance directly from your application code, without having to use a Docker container. If you're interested in other deployment methods, like using Docker-Compose or Kubernetes, check out this page in the Weaviate docs.</p> In\u00a0[\u00a0]: Copied! <pre>import weaviate\n\n# Connect to Weaviate embedded\nclient = weaviate.connect_to_embedded(headers={\"X-OpenAI-Api-Key\": openai_api_key})\n</pre> import weaviate # Connect to Weaviate embedded client = weaviate.connect_to_embedded(headers={\"X-OpenAI-Api-Key\": openai_api_key}) In\u00a0[\u00a0]: Copied! <pre>import weaviate.classes.config as wc\n\n# Define the collection name\ncollection_name = \"docling\"\n\n# Delete the collection if it already exists\nif client.collections.exists(collection_name):\n client.collections.delete(collection_name)\n\n# Create the collection\ncollection = client.collections.create(\n name=collection_name,\n vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(\n model=\"text-embedding-3-large\", # Specify your embedding model here\n ),\n # Enable generative model from Cohere\n generative_config=wc.Configure.Generative.openai(\n model=\"gpt-4o\" # Specify your generative model for RAG here\n ),\n # Define properties of metadata\n properties=[\n wc.Property(name=\"text\", data_type=wc.DataType.TEXT),\n wc.Property(name=\"title\", data_type=wc.DataType.TEXT, skip_vectorization=True),\n ],\n)\n</pre> import weaviate.classes.config as wc # Define the collection name collection_name = \"docling\" # Delete the collection if it already exists if client.collections.exists(collection_name): client.collections.delete(collection_name) # Create the collection collection = client.collections.create( name=collection_name, vectorizer_config=wc.Configure.Vectorizer.text2vec_openai( model=\"text-embedding-3-large\", # Specify your embedding model here ), # Enable generative model from Cohere generative_config=wc.Configure.Generative.openai( model=\"gpt-4o\" # Specify your generative model for RAG here ), # Define properties of metadata properties=[ wc.Property(name=\"text\", data_type=wc.DataType.TEXT), wc.Property(name=\"title\", data_type=wc.DataType.TEXT, skip_vectorization=True), ], ) In\u00a0[10]: Copied! <pre># Initialize the data object\ndata = []\n\n# Create a dictionary for each row by iterating through the corresponding lists\nfor text, title in zip(texts, titles):\n data_point = {\n \"text\": text,\n \"title\": title,\n }\n data.append(data_point)\n</pre> # Initialize the data object data = [] # Create a dictionary for each row by iterating through the corresponding lists for text, title in zip(texts, titles): data_point = { \"text\": text, \"title\": title, } data.append(data_point) In\u00a0[\u00a0]: Copied! <pre># Insert text chunks and metadata into vector DB collection\nresponse = collection.data.insert_many(data)\n\nif response.has_errors:\n print(response.errors)\nelse:\n print(\"Insert complete.\")\n</pre> # Insert text chunks and metadata into vector DB collection response = collection.data.insert_many(data) if response.has_errors: print(response.errors) else: print(\"Insert complete.\") In\u00a0[12]: Copied! <pre>from weaviate.classes.query import MetadataQuery\n\nresponse = collection.query.near_text(\n query=\"bert\",\n limit=2,\n return_metadata=MetadataQuery(distance=True),\n return_properties=[\"text\", \"title\"],\n)\n\nfor o in response.objects:\n print(o.properties)\n print(o.metadata.distance)\n</pre> from weaviate.classes.query import MetadataQuery response = collection.query.near_text( query=\"bert\", limit=2, return_metadata=MetadataQuery(distance=True), return_properties=[\"text\", \"title\"], ) for o in response.objects: print(o.properties) print(o.metadata.distance) <pre>{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding A distinctive feature of BERT is its unified architecture across different tasks. There is mini-', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}\n0.6578550338745117\n{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding We introduce a new language representation model called BERT , which stands for B idirectional E ncoder R epresentations from T ransformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}\n0.6696287989616394\n</pre> In\u00a0[13]: Copied! <pre>from rich.console import Console\nfrom rich.panel import Panel\n\n# Create a prompt where context from the Weaviate collection will be injected\nprompt = \"Explain how {text} works, using only the retrieved context.\"\nquery = \"bert\"\n\nresponse = collection.generate.near_text(\n query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"]\n)\n\n# Prettify the output using Rich\nconsole = Console()\n\nconsole.print(\n Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\")\n)\nconsole.print(\n Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\")\n)\n</pre> from rich.console import Console from rich.panel import Panel # Create a prompt where context from the Weaviate collection will be injected prompt = \"Explain how {text} works, using only the retrieved context.\" query = \"bert\" response = collection.generate.near_text( query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"] ) # Prettify the output using Rich console = Console() console.print( Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\") ) console.print( Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\") ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Explain how bert works, using only the retrieved context. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 BERT, which stands for Bidirectional Encoder Representations from Transformers, is a language representation \u2502\n\u2502 model designed to pretrain deep bidirectional representations from unlabeled text. It conditions on both left \u2502\n\u2502 and right context in all layers, unlike traditional left-to-right or right-to-left language models. This \u2502\n\u2502 pre-training involves two unsupervised tasks. The pre-trained BERT model can then be fine-tuned with just one \u2502\n\u2502 additional output layer to create state-of-the-art models for various tasks, such as question answering and \u2502\n\u2502 language inference, without needing substantial task-specific architecture modifications. A distinctive feature \u2502\n\u2502 of BERT is its unified architecture across different tasks. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> In\u00a0[14]: Copied! <pre># Create a prompt where context from the Weaviate collection will be injected\nprompt = \"Explain how {text} works, using only the retrieved context.\"\nquery = \"a generative adversarial net\"\n\nresponse = collection.generate.near_text(\n query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"]\n)\n\n# Prettify the output using Rich\nconsole = Console()\n\nconsole.print(\n Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\")\n)\nconsole.print(\n Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\")\n)\n</pre> # Create a prompt where context from the Weaviate collection will be injected prompt = \"Explain how {text} works, using only the retrieved context.\" query = \"a generative adversarial net\" response = collection.generate.near_text( query=query, limit=3, grouped_task=prompt, return_properties=[\"text\", \"title\"] ) # Prettify the output using Rich console = Console() console.print( Panel(f\"{prompt}\".replace(\"{text}\", query), title=\"Prompt\", border_style=\"bold red\") ) console.print( Panel(response.generated, title=\"Generated Content\", border_style=\"bold green\") ) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Prompt \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Explain how a generative adversarial net works, using only the retrieved context. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Generated Content \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Generative Adversarial Nets (GANs) operate within an adversarial framework where two models are trained \u2502\n\u2502 simultaneously: a generative model (G) and a discriminative model (D). The generative model aims to capture the \u2502\n\u2502 data distribution and generate samples that mimic real data, while the discriminative model's task is to \u2502\n\u2502 distinguish between samples from the real data and those generated by G. This setup is akin to a game where the \u2502\n\u2502 generative model acts like counterfeiters trying to produce indistinguishable fake currency, and the \u2502\n\u2502 discriminative model acts like the police trying to detect these counterfeits. \u2502\n\u2502 \u2502\n\u2502 The training process involves a minimax two-player game where G tries to maximize the probability of D making a \u2502\n\u2502 mistake, while D tries to minimize it. When both models are defined by multilayer perceptrons, they can be \u2502\n\u2502 trained using backpropagation without the need for Markov chains or approximate inference networks. The \u2502\n\u2502 ultimate goal is for G to perfectly replicate the training data distribution, making D's output equal to 1/2 \u2502\n\u2502 everywhere, indicating it cannot distinguish between real and generated data. This framework allows for \u2502\n\u2502 specific training algorithms and optimization techniques, such as backpropagation and dropout, to be \u2502\n\u2502 effectively utilized. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>We can see that our RAG pipeline performs relatively well for simple queries, especially given the small size of the dataset. Scaling this method for converting a larger sample of PDFs would require more compute (GPUs) and a more advanced deployment of Weaviate (like Docker, Kubernetes, or Weaviate Cloud). For more information on available Weaviate configurations, check out the documentation.</p>"},{"location":"examples/rag_weaviate/#rag-with-weaviate","title":"RAG with Weaviate\u00b6","text":""},{"location":"examples/rag_weaviate/#a-recipe","title":"A recipe \ud83e\uddd1\u200d\ud83c\udf73 \ud83d\udc25 \ud83d\udc9a\u00b6","text":"<p>This is a code recipe that uses Weaviate to perform RAG over PDF documents parsed by Docling.</p> <p>In this notebook, we accomplish the following:</p> <ul> <li>Parse the top machine learning papers on arXiv using Docling</li> <li>Perform hierarchical chunking of the documents using Docling</li> <li>Generate text embeddings with OpenAI</li> <li>Perform RAG using Weaviate</li> </ul> <p>To run this notebook, you'll need:</p> <ul> <li>An OpenAI API key</li> <li>Access to GPU/s</li> </ul> <p>Note: For best results, please use GPU acceleration to run this notebook. Here are two options for running this notebook:</p> <ol> <li>Locally on a MacBook with an Apple Silicon chip. Converting all documents in the notebook takes ~2 minutes on a MacBook M2 due to Docling's usage of MPS accelerators.</li> <li>Run this notebook on Google Colab. Converting all documents in the notebook takes ~8 minutes on a Google Colab T4 GPU.</li> </ol>"},{"location":"examples/rag_weaviate/#install-docling-and-weaviate-client","title":"Install Docling and Weaviate client\u00b6","text":"<p>Note: If Colab prompts you to restart the session after running the cell below, click \"restart\" and proceed with running the rest of the notebook.</p>"},{"location":"examples/rag_weaviate/#part-1-docling","title":"\ud83d\udc25 Part 1: Docling\u00b6","text":"<p>Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.</p> <p>The code below checks to see if a GPU is available, either via CUDA or MPS.</p>"},{"location":"examples/rag_weaviate/#convert-pdfs-to-docling-documents","title":"Convert PDFs to Docling documents\u00b6","text":"<p>Here we use Docling's <code>.convert_all()</code> to parse a batch of PDFs. The result is a list of Docling documents that we can use for text extraction.</p> <p>Note: Please ignore the <code>ERR#</code> message.</p>"},{"location":"examples/rag_weaviate/#post-process-extracted-document-data","title":"Post-process extracted document data\u00b6","text":""},{"location":"examples/rag_weaviate/#perform-hierarchical-chunking-on-documents","title":"Perform hierarchical chunking on documents\u00b6","text":"<p>We use Docling's <code>HierarchicalChunker()</code> to perform hierarchy-aware chunking of our list of documents. This is meant to preserve some of the structure and relationships within the document, which enables more accurate and relevant retrieval in our RAG pipeline.</p>"},{"location":"examples/rag_weaviate/#part-2-weaviate","title":"\ud83d\udc9a Part 2: Weaviate\u00b6","text":""},{"location":"examples/rag_weaviate/#create-and-configure-an-embedded-weaviate-collection","title":"Create and configure an embedded Weaviate collection\u00b6","text":""},{"location":"examples/rag_weaviate/#wrangle-data-into-an-acceptable-format-for-weaviate","title":"Wrangle data into an acceptable format for Weaviate\u00b6","text":"<p>Transform our data from lists to a list of dictionaries for insertion into our Weaviate collection.</p>"},{"location":"examples/rag_weaviate/#insert-data-into-weaviate-and-generate-embeddings","title":"Insert data into Weaviate and generate embeddings\u00b6","text":"<p>Embeddings will be generated upon insertion to our Weaviate collection.</p>"},{"location":"examples/rag_weaviate/#query-the-data","title":"Query the data\u00b6","text":"<p>Here, we perform a simple similarity search to return the most similar embedded chunks to our search query.</p>"},{"location":"examples/rag_weaviate/#perform-rag-on-parsed-articles","title":"Perform RAG on parsed articles\u00b6","text":"<p>Weaviate's <code>generate</code> module allows you to perform RAG over your embedded data without having to use a separate framework.</p> <p>We specify a prompt that includes the field we want to search through in the database (in this case it's <code>text</code>), a query that includes our search term, and the number of retrieved results to use in the generation.</p>"},{"location":"examples/rapidocr_with_custom_models/","title":"RapidOCR with custom OCR models","text":"In\u00a0[\u00a0]: Copied! <pre>import os\n</pre> import os In\u00a0[\u00a0]: Copied! <pre>from huggingface_hub import snapshot_download\n</pre> from huggingface_hub import snapshot_download In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions\nfrom docling.document_converter import (\n ConversionResult,\n DocumentConverter,\n InputFormat,\n PdfFormatOption,\n)\n</pre> from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions from docling.document_converter import ( ConversionResult, DocumentConverter, InputFormat, PdfFormatOption, ) In\u00a0[\u00a0]: Copied! <pre>def main():\n # Source document to convert\n source = \"https://arxiv.org/pdf/2408.09869v4\"\n\n # Download RappidOCR models from HuggingFace\n print(\"Downloading RapidOCR models\")\n download_path = snapshot_download(repo_id=\"SWHL/RapidOCR\")\n\n # Setup RapidOcrOptions for english detection\n det_model_path = os.path.join(\n download_path, \"PP-OCRv4\", \"en_PP-OCRv3_det_infer.onnx\"\n )\n rec_model_path = os.path.join(\n download_path, \"PP-OCRv4\", \"ch_PP-OCRv4_rec_server_infer.onnx\"\n )\n cls_model_path = os.path.join(\n download_path, \"PP-OCRv3\", \"ch_ppocr_mobile_v2.0_cls_train.onnx\"\n )\n ocr_options = RapidOcrOptions(\n det_model_path=det_model_path,\n rec_model_path=rec_model_path,\n cls_model_path=cls_model_path,\n )\n\n pipeline_options = PdfPipelineOptions(\n ocr_options=ocr_options,\n )\n\n # Convert the document\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n ),\n },\n )\n\n conversion_result: ConversionResult = converter.convert(source=source)\n doc = conversion_result.document\n md = doc.export_to_markdown()\n print(md)\n</pre> def main(): # Source document to convert source = \"https://arxiv.org/pdf/2408.09869v4\" # Download RappidOCR models from HuggingFace print(\"Downloading RapidOCR models\") download_path = snapshot_download(repo_id=\"SWHL/RapidOCR\") # Setup RapidOcrOptions for english detection det_model_path = os.path.join( download_path, \"PP-OCRv4\", \"en_PP-OCRv3_det_infer.onnx\" ) rec_model_path = os.path.join( download_path, \"PP-OCRv4\", \"ch_PP-OCRv4_rec_server_infer.onnx\" ) cls_model_path = os.path.join( download_path, \"PP-OCRv3\", \"ch_ppocr_mobile_v2.0_cls_train.onnx\" ) ocr_options = RapidOcrOptions( det_model_path=det_model_path, rec_model_path=rec_model_path, cls_model_path=cls_model_path, ) pipeline_options = PdfPipelineOptions( ocr_options=ocr_options, ) # Convert the document converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ), }, ) conversion_result: ConversionResult = converter.convert(source=source) doc = conversion_result.document md = doc.export_to_markdown() print(md) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/retrieval_qdrant/","title":"Retrieval with Qdrant","text":"Step Tech Execution Embedding FastEmbed \ud83d\udcbb Local Vector store Qdrant \ud83d\udcbb Local <p>This example demonstrates using Docling with Qdrant to perform a hybrid search across your documents using dense and sparse vectors.</p> <p>We'll chunk the documents using Docling before adding them to a Qdrant collection. By limiting the length of the chunks, we can preserve the meaning in each vector embedding.</p> <ul> <li>\ud83d\udc49 Qdrant client uses FastEmbed to generate vector embeddings. You can install the <code>fastembed-gpu</code> package if you've got the hardware to support it.</li> </ul> In\u00a0[1]: Copied! <pre>%pip install --no-warn-conflicts -q qdrant-client docling fastembed\n</pre> %pip install --no-warn-conflicts -q qdrant-client docling fastembed <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> <p>Let's import all the classes we'll be working with.</p> In\u00a0[2]: Copied! <pre>from qdrant_client import QdrantClient\n\nfrom docling.chunking import HybridChunker\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter\n</pre> from qdrant_client import QdrantClient from docling.chunking import HybridChunker from docling.datamodel.base_models import InputFormat from docling.document_converter import DocumentConverter <ul> <li>For Docling, we'll set the allowed formats to HTML since we'll only be working with webpages in this tutorial.</li> <li>If we set a sparse model, Qdrant client will fuse the dense and sparse results using RRF. Reference.</li> </ul> In\u00a0[3]: Copied! <pre>COLLECTION_NAME = \"docling\"\n\ndoc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML])\nclient = QdrantClient(location=\":memory:\")\n# The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI.\n# For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant\n# client = QdrantClient(location=\"http://localhost:6333\")\n\nclient.set_model(\"sentence-transformers/all-MiniLM-L6-v2\")\nclient.set_sparse_model(\"Qdrant/bm25\")\n</pre> COLLECTION_NAME = \"docling\" doc_converter = DocumentConverter(allowed_formats=[InputFormat.HTML]) client = QdrantClient(location=\":memory:\") # The :memory: mode is a Python imitation of Qdrant's APIs for prototyping and CI. # For production deployments, use the Docker image: docker run -p 6333:6333 qdrant/qdrant # client = QdrantClient(location=\"http://localhost:6333\") client.set_model(\"sentence-transformers/all-MiniLM-L6-v2\") client.set_sparse_model(\"Qdrant/bm25\") <pre>/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/tqdm.py:155: UserWarning: Cannot enable progress bars: environment variable `HF_HUB_DISABLE_PROGRESS_BARS=1` is set and has priority.\n warnings.warn(\n</pre> <p>We can now download and chunk the document using Docling. For demonstration, we'll use an article about chunking strategies :)</p> In\u00a0[4]: Copied! <pre>result = doc_converter.convert(\n \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\"\n)\ndocuments, metadatas = [], []\nfor chunk in HybridChunker().chunk(result.document):\n documents.append(chunk.text)\n metadatas.append(chunk.meta.export_json_dict())\n</pre> result = doc_converter.convert( \"https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag\" ) documents, metadatas = [], [] for chunk in HybridChunker().chunk(result.document): documents.append(chunk.text) metadatas.append(chunk.meta.export_json_dict()) <p>Let's now upload the documents to Qdrant.</p> <ul> <li>The <code>add()</code> method batches the documents and uses FastEmbed to generate vector embeddings on our machine.</li> </ul> In\u00a0[5]: Copied! <pre>_ = client.add(\n collection_name=COLLECTION_NAME,\n documents=documents,\n metadata=metadatas,\n batch_size=64,\n)\n</pre> _ = client.add( collection_name=COLLECTION_NAME, documents=documents, metadata=metadatas, batch_size=64, ) In\u00a0[6]: Copied! <pre>points = client.query(\n collection_name=COLLECTION_NAME,\n query_text=\"Can I split documents?\",\n limit=10,\n)\n</pre> points = client.query( collection_name=COLLECTION_NAME, query_text=\"Can I split documents?\", limit=10, ) In\u00a0[7]: Copied! <pre>for i, point in enumerate(points):\n print(f\"=== {i} ===\")\n print(point.document)\n print()\n</pre> for i, point in enumerate(points): print(f\"=== {i} ===\") print(point.document) print() <pre>=== 0 ===\nHave you ever wondered how we, humans, would chunk? Here's a breakdown of a possible way a human would process a new document:\n1. We start at the top of the document, treating the first part as a chunk.\n\u00a0\u00a0\u00a02. We continue down the document, deciding if a new sentence or piece of information belongs with the first chunk or should start a new one.\n \u00a0\u00a0\u00a03. We keep this up until we reach the end of the document.\nThe ultimate dream? Having an agent do this for you. But slow down! This approach is still being tested and isn't quite ready for the big leagues due to the time it takes to process multiple LLM calls and the cost of those calls. There's no implementation available in public libraries just yet. However, Greg Kamradt has his version available here.\n\n=== 1 ===\nDocument Specific Chunking is a strategy that respects the document's structure. Rather than using a set number of characters or a recursive process, it creates chunks that align with the logical sections of the document, like paragraphs or subsections. This approach maintains the original author's organization of content and helps keep the text coherent. It makes the retrieved information more relevant and useful, particularly for structured documents with clearly defined sections.\nDocument Specific Chunking can handle a variety of document formats, such as:\nMarkdown\nHTML\nPython\netc\nHere we\u2019ll take Markdown as our example and use a modified version of our first sample text:\n\u200d\nThe result is the following:\nYou can see here that with a chunk size of 105, the Markdown structure of the document is taken into account, and the chunks thus preserve the semantics of the text!\n\n=== 2 ===\nAnd there you have it! These chunking strategies are like a personal toolbox when it comes to implementing Retrieval Augmented Generation. They're a ton of ways to slice and dice text, each with its unique features and quirks. This variety gives you the freedom to pick the strategy that suits your project best, allowing you to tailor your approach to perfectly fit the unique needs of your work.\nTo put these strategies into action, there's a whole array of tools and libraries at your disposal. For example, llama_index is a fantastic tool that lets you create document indices and retrieve chunked documents. Let's not forget LangChain, another remarkable tool that makes implementing chunking strategies a breeze, particularly when dealing with multi-language data. Diving into these tools and understanding how they can work in harmony with the chunking strategies we've discussed is a crucial part of mastering Retrieval Augmented Generation.\nBy the way, if you're eager to experiment with your own examples using the chunking visualisation tool featured in this blog, feel free to give it a try! You can access it right here. Enjoy, and happy chunking! \ud83d\ude09\n\n=== 3 ===\nRetrieval Augmented Generation (RAG) has been a hot topic in understanding, interpreting, and generating text with AI for the last few months. It's like a wonderful union of retrieval-based and generative models, creating a playground for researchers, data scientists, and natural language processing enthusiasts, like you and me.\nTo truly control the results produced by our RAG, we need to understand chunking strategies and their role in the process of retrieving and generating text. Indeed, each chunking strategy enhances RAG's effectiveness in its unique way.\nThe goal of chunking is, as its name says, to chunk the information into multiple smaller pieces in order to store it in a more efficient and meaningful way. This allows the retrieval to capture pieces of information that are more related to the question at hand, and the generation to be more precise, but also less costly, as only a part of a document will be included in the LLM prompt, instead of the whole document.\nLet's explore some chunking strategies together.\nThe methods mentioned in the article you're about to read usually make use of two key parameters. First, we have [chunk_size]\u2014 which controls the size of your text chunks. Then there's [chunk_overlap], which takes care of how much text overlaps between one chunk and the next.\n\n=== 4 ===\nSemantic Chunking considers the relationships within the text. It divides the text into meaningful, semantically complete chunks. This approach ensures the information's integrity during retrieval, leading to a more accurate and contextually appropriate outcome.\nSemantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.\nBy focusing on the text's meaning and context, Semantic Chunking significantly enhances the quality of retrieval. It's a top-notch choice when maintaining the semantic integrity of the text is vital.\nHowever, this method does require more effort and is notably slower than the previous ones.\nOn our example text, since it is quite short and does not expose varied subjects, this method would only generate a single chunk.\n\n=== 5 ===\nLanguage models used in the rest of your possible RAG pipeline have a token limit, which should not be exceeded. When dividing your text into chunks, it's advisable to count the number of tokens. Plenty of tokenizers are available. To ensure accuracy, use the same tokenizer for counting tokens as the one used in the language model.\nConsequently, there are also splitters available for this purpose.\nFor instance, by using the [SpacyTextSplitter] from LangChain, the following chunks are created:\n\u200d\n\n=== 6 ===\nFirst things first, we have Character Chunking. This strategy divides the text into chunks based on a fixed number of characters. Its simplicity makes it a great starting point, but it can sometimes disrupt the text's flow, breaking sentences or words in unexpected places. Despite its limitations, it's a great stepping stone towards more advanced methods.\nNow let\u2019s see that in action with an example. Imagine a text that reads:\nIf we decide to set our chunk size to 100 and no chunk overlap, we'd end up with the following chunks. As you can see, Character Chunking can lead to some intriguing, albeit sometimes nonsensical, results, cutting some of the sentences in their middle.\nBy choosing a smaller chunk size, \u00a0we would obtain more chunks, and by setting a bigger chunk overlap, we could obtain something like this:\n\u200d\nAlso, by default this method creates chunks character by character based on the empty character [\u2019 \u2019]. But you can specify a different one in order to chunk on something else, even a complete word! For instance, by specifying [' '] as the separator, you can avoid cutting words in their middle.\n\n=== 7 ===\nNext, let's take a look at Recursive Character Chunking. Based on the basic concept of Character Chunking, this advanced version takes it up a notch by dividing the text into chunks until a certain condition is met, such as reaching a minimum chunk size. This method ensures that the chunking process aligns with the text's structure, preserving more meaning. Its adaptability makes Recursive Character Chunking great for texts with varied structures.\nAgain, let\u2019s use the same example in order to illustrate this method. With a chunk size of 100, and the default settings for the other parameters, we obtain the following chunks:\n\n</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/retrieval_qdrant/#retrieval-with-qdrant","title":"Retrieval with Qdrant\u00b6","text":""},{"location":"examples/retrieval_qdrant/#overview","title":"Overview\u00b6","text":""},{"location":"examples/retrieval_qdrant/#setup","title":"Setup\u00b6","text":""},{"location":"examples/retrieval_qdrant/#retrieval","title":"Retrieval\u00b6","text":""},{"location":"examples/run_md/","title":"Run md","text":"In\u00a0[\u00a0]: Copied! <pre>import json\nimport logging\nimport os\nfrom pathlib import Path\n</pre> import json import logging import os from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>import yaml\n</pre> import yaml In\u00a0[\u00a0]: Copied! <pre>from docling.backend.md_backend import MarkdownDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.document import InputDocument\n</pre> from docling.backend.md_backend import MarkdownDocumentBackend from docling.datamodel.base_models import InputFormat from docling.datamodel.document import InputDocument In\u00a0[\u00a0]: Copied! <pre>_log = logging.getLogger(__name__)\n</pre> _log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied! <pre>def main():\n input_paths = [Path(\"README.md\")]\n\n for path in input_paths:\n in_doc = InputDocument(\n path_or_stream=path,\n format=InputFormat.PDF,\n backend=MarkdownDocumentBackend,\n )\n mdb = MarkdownDocumentBackend(in_doc=in_doc, path_or_stream=path)\n document = mdb.convert()\n\n out_path = Path(\"scratch\")\n print(f\"Document {path} converted.\\nSaved markdown output to: {out_path!s}\")\n\n # Export Docling document format to markdowndoc:\n fn = os.path.basename(path)\n\n with (out_path / f\"{fn}.md\").open(\"w\") as fp:\n fp.write(document.export_to_markdown())\n\n with (out_path / f\"{fn}.json\").open(\"w\") as fp:\n fp.write(json.dumps(document.export_to_dict()))\n\n with (out_path / f\"{fn}.yaml\").open(\"w\") as fp:\n fp.write(yaml.safe_dump(document.export_to_dict()))\n</pre> def main(): input_paths = [Path(\"README.md\")] for path in input_paths: in_doc = InputDocument( path_or_stream=path, format=InputFormat.PDF, backend=MarkdownDocumentBackend, ) mdb = MarkdownDocumentBackend(in_doc=in_doc, path_or_stream=path) document = mdb.convert() out_path = Path(\"scratch\") print(f\"Document {path} converted.\\nSaved markdown output to: {out_path!s}\") # Export Docling document format to markdowndoc: fn = os.path.basename(path) with (out_path / f\"{fn}.md\").open(\"w\") as fp: fp.write(document.export_to_markdown()) with (out_path / f\"{fn}.json\").open(\"w\") as fp: fp.write(json.dumps(document.export_to_dict())) with (out_path / f\"{fn}.yaml\").open(\"w\") as fp: fp.write(yaml.safe_dump(document.export_to_dict())) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/run_with_accelerator/","title":"Accelerator options","text":"In\u00a0[\u00a0]: Copied! <pre>from pathlib import Path\n</pre> from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n)\nfrom docling.datamodel.settings import settings\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, ) from docling.datamodel.settings import settings from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied! <pre>def main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n # Explicitly set the accelerator\n # accelerator_options = AcceleratorOptions(\n # num_threads=8, device=AcceleratorDevice.AUTO\n # )\n accelerator_options = AcceleratorOptions(\n num_threads=8, device=AcceleratorDevice.CPU\n )\n # accelerator_options = AcceleratorOptions(\n # num_threads=8, device=AcceleratorDevice.MPS\n # )\n # accelerator_options = AcceleratorOptions(\n # num_threads=8, device=AcceleratorDevice.CUDA\n # )\n\n # easyocr doesnt support cuda:N allocation, defaults to cuda:0\n # accelerator_options = AcceleratorOptions(num_threads=8, device=\"cuda:1\")\n\n pipeline_options = PdfPipelineOptions()\n pipeline_options.accelerator_options = accelerator_options\n pipeline_options.do_ocr = True\n pipeline_options.do_table_structure = True\n pipeline_options.table_structure_options.do_cell_matching = True\n\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n\n # Enable the profiling to measure the time spent\n settings.debug.profile_pipeline_timings = True\n\n # Convert the document\n conversion_result = converter.convert(input_doc_path)\n doc = conversion_result.document\n\n # List with total time per document\n doc_conversion_secs = conversion_result.timings[\"pipeline_total\"].times\n\n md = doc.export_to_markdown()\n print(md)\n print(f\"Conversion secs: {doc_conversion_secs}\")\n</pre> def main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" # Explicitly set the accelerator # accelerator_options = AcceleratorOptions( # num_threads=8, device=AcceleratorDevice.AUTO # ) accelerator_options = AcceleratorOptions( num_threads=8, device=AcceleratorDevice.CPU ) # accelerator_options = AcceleratorOptions( # num_threads=8, device=AcceleratorDevice.MPS # ) # accelerator_options = AcceleratorOptions( # num_threads=8, device=AcceleratorDevice.CUDA # ) # easyocr doesnt support cuda:N allocation, defaults to cuda:0 # accelerator_options = AcceleratorOptions(num_threads=8, device=\"cuda:1\") pipeline_options = PdfPipelineOptions() pipeline_options.accelerator_options = accelerator_options pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) # Enable the profiling to measure the time spent settings.debug.profile_pipeline_timings = True # Convert the document conversion_result = converter.convert(input_doc_path) doc = conversion_result.document # List with total time per document doc_conversion_secs = conversion_result.timings[\"pipeline_total\"].times md = doc.export_to_markdown() print(md) print(f\"Conversion secs: {doc_conversion_secs}\") In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/run_with_formats/","title":"Multi-format conversion","text":"In\u00a0[\u00a0]: Copied! <pre>import json\nimport logging\nfrom pathlib import Path\n</pre> import json import logging from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>import yaml\n</pre> import yaml In\u00a0[\u00a0]: Copied! <pre>from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import (\n DocumentConverter,\n PdfFormatOption,\n WordFormatOption,\n)\nfrom docling.pipeline.simple_pipeline import SimplePipeline\nfrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\n</pre> from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend from docling.datamodel.base_models import InputFormat from docling.document_converter import ( DocumentConverter, PdfFormatOption, WordFormatOption, ) from docling.pipeline.simple_pipeline import SimplePipeline from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline In\u00a0[\u00a0]: Copied! <pre>_log = logging.getLogger(__name__)\n</pre> _log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied! <pre>def main():\n input_paths = [\n Path(\"README.md\"),\n Path(\"tests/data/html/wiki_duck.html\"),\n Path(\"tests/data/docx/word_sample.docx\"),\n Path(\"tests/data/docx/lorem_ipsum.docx\"),\n Path(\"tests/data/pptx/powerpoint_sample.pptx\"),\n Path(\"tests/data/2305.03393v1-pg9-img.png\"),\n Path(\"tests/data/pdf/2206.01062.pdf\"),\n Path(\"tests/data/asciidoc/test_01.asciidoc\"),\n ]\n\n ## for defaults use:\n # doc_converter = DocumentConverter()\n\n ## to customize use:\n\n doc_converter = (\n DocumentConverter( # all of the below is optional, has internal defaults.\n allowed_formats=[\n InputFormat.PDF,\n InputFormat.IMAGE,\n InputFormat.DOCX,\n InputFormat.HTML,\n InputFormat.PPTX,\n InputFormat.ASCIIDOC,\n InputFormat.CSV,\n InputFormat.MD,\n ], # whitelist formats, non-matching files are ignored.\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend\n ),\n InputFormat.DOCX: WordFormatOption(\n pipeline_cls=SimplePipeline # , backend=MsWordDocumentBackend\n ),\n },\n )\n )\n\n conv_results = doc_converter.convert_all(input_paths)\n\n for res in conv_results:\n out_path = Path(\"scratch\")\n print(\n f\"Document {res.input.file.name} converted.\"\n f\"\\nSaved markdown output to: {out_path!s}\"\n )\n _log.debug(res.document._export_to_indented_text(max_text_len=16))\n # Export Docling document format to markdowndoc:\n with (out_path / f\"{res.input.file.stem}.md\").open(\"w\") as fp:\n fp.write(res.document.export_to_markdown())\n\n with (out_path / f\"{res.input.file.stem}.json\").open(\"w\") as fp:\n fp.write(json.dumps(res.document.export_to_dict()))\n\n with (out_path / f\"{res.input.file.stem}.yaml\").open(\"w\") as fp:\n fp.write(yaml.safe_dump(res.document.export_to_dict()))\n</pre> def main(): input_paths = [ Path(\"README.md\"), Path(\"tests/data/html/wiki_duck.html\"), Path(\"tests/data/docx/word_sample.docx\"), Path(\"tests/data/docx/lorem_ipsum.docx\"), Path(\"tests/data/pptx/powerpoint_sample.pptx\"), Path(\"tests/data/2305.03393v1-pg9-img.png\"), Path(\"tests/data/pdf/2206.01062.pdf\"), Path(\"tests/data/asciidoc/test_01.asciidoc\"), ] ## for defaults use: # doc_converter = DocumentConverter() ## to customize use: doc_converter = ( DocumentConverter( # all of the below is optional, has internal defaults. allowed_formats=[ InputFormat.PDF, InputFormat.IMAGE, InputFormat.DOCX, InputFormat.HTML, InputFormat.PPTX, InputFormat.ASCIIDOC, InputFormat.CSV, InputFormat.MD, ], # whitelist formats, non-matching files are ignored. format_options={ InputFormat.PDF: PdfFormatOption( pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend ), InputFormat.DOCX: WordFormatOption( pipeline_cls=SimplePipeline # , backend=MsWordDocumentBackend ), }, ) ) conv_results = doc_converter.convert_all(input_paths) for res in conv_results: out_path = Path(\"scratch\") print( f\"Document {res.input.file.name} converted.\" f\"\\nSaved markdown output to: {out_path!s}\" ) _log.debug(res.document._export_to_indented_text(max_text_len=16)) # Export Docling document format to markdowndoc: with (out_path / f\"{res.input.file.stem}.md\").open(\"w\") as fp: fp.write(res.document.export_to_markdown()) with (out_path / f\"{res.input.file.stem}.json\").open(\"w\") as fp: fp.write(json.dumps(res.document.export_to_dict())) with (out_path / f\"{res.input.file.stem}.yaml\").open(\"w\") as fp: fp.write(yaml.safe_dump(res.document.export_to_dict())) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/serialization/","title":"Serialization","text":"<p>In this notebook we showcase the usage of Docling serializers.</p> In\u00a0[1]: Copied! <pre>%pip install -qU pip docling docling-core~=2.29 rich\n</pre> %pip install -qU pip docling docling-core~=2.29 rich <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>DOC_SOURCE = \"https://arxiv.org/pdf/2311.18481\"\n\n# we set some start-stop cues for defining an excerpt to print\nstart_cue = \"Copyright \u00a9 2024\"\nstop_cue = \"Application of NLP to ESG\"\n</pre> DOC_SOURCE = \"https://arxiv.org/pdf/2311.18481\" # we set some start-stop cues for defining an excerpt to print start_cue = \"Copyright \u00a9 2024\" stop_cue = \"Application of NLP to ESG\" In\u00a0[3]: Copied! <pre>from rich.console import Console\nfrom rich.panel import Panel\n\nconsole = Console(width=210) # for preventing Markdown table wrapped rendering\n\n\ndef print_in_console(text):\n console.print(Panel(text))\n</pre> from rich.console import Console from rich.panel import Panel console = Console(width=210) # for preventing Markdown table wrapped rendering def print_in_console(text): console.print(Panel(text)) <p>We first convert the document:</p> In\u00a0[4]: Copied! <pre>from docling.document_converter import DocumentConverter\n\nconverter = DocumentConverter()\ndoc = converter.convert(source=DOC_SOURCE).document\n</pre> from docling.document_converter import DocumentConverter converter = DocumentConverter() doc = converter.convert(source=DOC_SOURCE).document <pre>/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n warnings.warn(warn_msg)\n</pre> <p>We can now apply any <code>BaseDocSerializer</code> on the produced document.</p> <p>\ud83d\udc49 Note that, to keep the shown output brief, we only print an excerpt.</p> <p>E.g. below we apply an <code>HTMLDocSerializer</code>:</p> In\u00a0[5]: Copied! <pre>from docling_core.transforms.serializer.html import HTMLDocSerializer\n\nserializer = HTMLDocSerializer(doc=doc)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\n# we here only print an excerpt to keep the output brief:\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\n</pre> from docling_core.transforms.serializer.html import HTMLDocSerializer serializer = HTMLDocSerializer(doc=doc) ser_result = serializer.serialize() ser_text = ser_result.text # we here only print an excerpt to keep the output brief: print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)]) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.</p> \u2502\n\u2502 <table><tbody><tr><th>Report</th><th>Question</th><th>Answer</th></tr><tr><td>IBM 2022</td><td>How many hours were spent on employee learning in 2021?</td><td>22.5 million hours</td></tr><tr><td>IBM \u2502\n\u2502 2022</td><td>What was the rate of fatalities in 2021?</td><td>The rate of fatalities in 2021 was 0.0016.</td></tr><tr><td>IBM 2022</td><td>How many full audits were con- ducted in 2022 in \u2502\n\u2502 India?</td><td>2</td></tr><tr><td>Starbucks 2022</td><td>What is the percentage of women in the Board of Directors?</td><td>25%</td></tr><tr><td>Starbucks 2022</td><td>What was the total energy con- \u2502\n\u2502 sumption in 2021?</td><td>According to the table, the total energy consumption in 2021 was 2,491,543 MWh.</td></tr><tr><td>Starbucks 2022</td><td>How much packaging material was made from renewable mate- \u2502\n\u2502 rials?</td><td>According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.</td></tr></tbody></table> \u2502\n\u2502 <p>Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.</p> \u2502\n\u2502 <p>ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response.</p> \u2502\n\u2502 <h2>Related Work</h2> \u2502\n\u2502 <p>The DocQA integrates multiple AI technologies, namely:</p> \u2502\n\u2502 <p>Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric \u2502\n\u2502 layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et \u2502\n\u2502 al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-</p> \u2502\n\u2502 <figure><figcaption>Figure 1: System architecture: Simplified sketch of document question-answering pipeline.</figcaption></figure> \u2502\n\u2502 <p>based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).</p> \u2502\n\u2502 <p> \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>In the following example, we use a <code>MarkdownDocSerializer</code>:</p> In\u00a0[6]: Copied! <pre>from docling_core.transforms.serializer.markdown import MarkdownDocSerializer\n\nserializer = MarkdownDocSerializer(doc=doc)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\n</pre> from docling_core.transforms.serializer.markdown import MarkdownDocSerializer serializer = MarkdownDocSerializer(doc=doc) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)]) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. \u2502\n\u2502 \u2502\n\u2502 | Report | Question | Answer | \u2502\n\u2502 |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| \u2502\n\u2502 | IBM 2022 | How many hours were spent on employee learning in 2021? | 22.5 million hours | \u2502\n\u2502 | IBM 2022 | What was the rate of fatalities in 2021? | The rate of fatalities in 2021 was 0.0016. | \u2502\n\u2502 | IBM 2022 | How many full audits were con- ducted in 2022 in India? | 2 | \u2502\n\u2502 | Starbucks 2022 | What is the percentage of women in the Board of Directors? | 25% | \u2502\n\u2502 | Starbucks 2022 | What was the total energy con- sumption in 2021? | According to the table, the total energy consumption in 2021 was 2,491,543 MWh. | \u2502\n\u2502 | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. | \u2502\n\u2502 \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. \u2502\n\u2502 \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response. \u2502\n\u2502 \u2502\n\u2502 ## Related Work \u2502\n\u2502 \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely: \u2502\n\u2502 \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- \u2502\n\u2502 \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline. \u2502\n\u2502 \u2502\n\u2502 <!-- image --> \u2502\n\u2502 \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). \u2502\n\u2502 \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>Let's now assume we would like to reconfigure the Markdown serialization such that:</p> <ul> <li>it uses a different component serializer, e.g. if we'd prefer tables to be printed in a triplet format (which could potentially improve the vector representation compared to Markdown tables)</li> <li>it uses specific user-defined parameters, e.g. if we'd prefer a different image placeholder text than the default one</li> </ul> <p>Check out the following configuration and notice the serialization differences in the output further below:</p> In\u00a0[7]: Copied! <pre>from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer\nfrom docling_core.transforms.serializer.markdown import MarkdownParams\n\nserializer = MarkdownDocSerializer(\n doc=doc,\n table_serializer=TripletTableSerializer(),\n params=MarkdownParams(\n image_placeholder=\"<!-- demo picture placeholder -->\",\n # ...\n ),\n)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\n</pre> from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer from docling_core.transforms.serializer.markdown import MarkdownParams serializer = MarkdownDocSerializer( doc=doc, table_serializer=TripletTableSerializer(), params=MarkdownParams( image_placeholder=\"\", # ... ), ) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)]) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. \u2502\n\u2502 \u2502\n\u2502 IBM 2022, Question = How many hours were spent on employee learning in 2021?. IBM 2022, Answer = 22.5 million hours. IBM 2022, Question = What was the rate of fatalities in 2021?. IBM 2022, Answer = The \u2502\n\u2502 rate of fatalities in 2021 was 0.0016.. IBM 2022, Question = How many full audits were con- ducted in 2022 in India?. IBM 2022, Answer = 2. Starbucks 2022, Question = What is the percentage of women in the \u2502\n\u2502 Board of Directors?. Starbucks 2022, Answer = 25%. Starbucks 2022, Question = What was the total energy con- sumption in 2021?. Starbucks 2022, Answer = According to the table, the total energy consumption \u2502\n\u2502 in 2021 was 2,491,543 MWh.. Starbucks 2022, Question = How much packaging material was made from renewable mate- rials?. Starbucks 2022, Answer = According to the given data, 31% of packaging materials were \u2502\n\u2502 made from recycled or renewable materials in FY22. \u2502\n\u2502 \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. \u2502\n\u2502 \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response. \u2502\n\u2502 \u2502\n\u2502 ## Related Work \u2502\n\u2502 \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely: \u2502\n\u2502 \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- \u2502\n\u2502 \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline. \u2502\n\u2502 \u2502\n\u2502 <!-- demo picture placeholder --> \u2502\n\u2502 \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). \u2502\n\u2502 \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre> <p>In the examples above, we were able to reuse existing implementations for our desired serialization strategy, but let's now assume we want to define a custom serialization logic, e.g. we would like picture serialization to include any available picture description (captioning) annotations.</p> <p>To that end, we first need to revisit our conversion and include all pipeline options needed for picture description enrichment.</p> In\u00a0[8]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n PictureDescriptionVlmOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions(\n do_picture_description=True,\n picture_description_options=PictureDescriptionVlmOptions(\n repo_id=\"HuggingFaceTB/SmolVLM-256M-Instruct\",\n prompt=\"Describe this picture in three to five sentences. Be precise and concise.\",\n ),\n generate_picture_images=True,\n images_scale=2,\n)\n\nconverter = DocumentConverter(\n format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}\n)\ndoc = converter.convert(source=DOC_SOURCE).document\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, PictureDescriptionVlmOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption pipeline_options = PdfPipelineOptions( do_picture_description=True, picture_description_options=PictureDescriptionVlmOptions( repo_id=\"HuggingFaceTB/SmolVLM-256M-Instruct\", prompt=\"Describe this picture in three to five sentences. Be precise and concise.\", ), generate_picture_images=True, images_scale=2, ) converter = DocumentConverter( format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} ) doc = converter.convert(source=DOC_SOURCE).document <pre>/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used.\n warnings.warn(warn_msg)\n</pre> <p>We can then define our custom picture serializer:</p> In\u00a0[9]: Copied! <pre>from typing import Any, Optional\n\nfrom docling_core.transforms.serializer.base import (\n BaseDocSerializer,\n SerializationResult,\n)\nfrom docling_core.transforms.serializer.common import create_ser_result\nfrom docling_core.transforms.serializer.markdown import (\n MarkdownParams,\n MarkdownPictureSerializer,\n)\nfrom docling_core.types.doc.document import (\n DoclingDocument,\n ImageRefMode,\n PictureDescriptionData,\n PictureItem,\n)\nfrom typing_extensions import override\n\n\nclass AnnotationPictureSerializer(MarkdownPictureSerializer):\n @override\n def serialize(\n self,\n *,\n item: PictureItem,\n doc_serializer: BaseDocSerializer,\n doc: DoclingDocument,\n separator: Optional[str] = None,\n **kwargs: Any,\n ) -> SerializationResult:\n text_parts: list[str] = []\n\n # reusing the existing result:\n parent_res = super().serialize(\n item=item,\n doc_serializer=doc_serializer,\n doc=doc,\n **kwargs,\n )\n text_parts.append(parent_res.text)\n\n # appending annotations:\n for annotation in item.annotations:\n if isinstance(annotation, PictureDescriptionData):\n text_parts.append(f\"<!-- Picture description: {annotation.text} -->\")\n\n text_res = (separator or \"\\n\").join(text_parts)\n return create_ser_result(text=text_res, span_source=item)\n</pre> from typing import Any, Optional from docling_core.transforms.serializer.base import ( BaseDocSerializer, SerializationResult, ) from docling_core.transforms.serializer.common import create_ser_result from docling_core.transforms.serializer.markdown import ( MarkdownParams, MarkdownPictureSerializer, ) from docling_core.types.doc.document import ( DoclingDocument, ImageRefMode, PictureDescriptionData, PictureItem, ) from typing_extensions import override class AnnotationPictureSerializer(MarkdownPictureSerializer): @override def serialize( self, *, item: PictureItem, doc_serializer: BaseDocSerializer, doc: DoclingDocument, separator: Optional[str] = None, **kwargs: Any, ) -> SerializationResult: text_parts: list[str] = [] # reusing the existing result: parent_res = super().serialize( item=item, doc_serializer=doc_serializer, doc=doc, **kwargs, ) text_parts.append(parent_res.text) # appending annotations: for annotation in item.annotations: if isinstance(annotation, PictureDescriptionData): text_parts.append(f\"\") text_res = (separator or \"\\n\").join(text_parts) return create_ser_result(text=text_res, span_source=item) <p>Last but not least, we define a new doc serializer which leverages our custom picture serializer.</p> <p>Notice the picture description annotations in the output below:</p> In\u00a0[10]: Copied! <pre>serializer = MarkdownDocSerializer(\n doc=doc,\n picture_serializer=AnnotationPictureSerializer(),\n params=MarkdownParams(\n image_mode=ImageRefMode.PLACEHOLDER,\n image_placeholder=\"\",\n ),\n)\nser_result = serializer.serialize()\nser_text = ser_result.text\n\nprint_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])\n</pre> serializer = MarkdownDocSerializer( doc=doc, picture_serializer=AnnotationPictureSerializer(), params=MarkdownParams( image_mode=ImageRefMode.PLACEHOLDER, image_placeholder=\"\", ), ) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)]) <pre>\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Copyright \u00a9 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. \u2502\n\u2502 \u2502\n\u2502 | Report | Question | Answer | \u2502\n\u2502 |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| \u2502\n\u2502 | IBM 2022 | How many hours were spent on employee learning in 2021? | 22.5 million hours | \u2502\n\u2502 | IBM 2022 | What was the rate of fatalities in 2021? | The rate of fatalities in 2021 was 0.0016. | \u2502\n\u2502 | IBM 2022 | How many full audits were con- ducted in 2022 in India? | 2 | \u2502\n\u2502 | Starbucks 2022 | What is the percentage of women in the Board of Directors? | 25% | \u2502\n\u2502 | Starbucks 2022 | What was the total energy con- sumption in 2021? | According to the table, the total energy consumption in 2021 was 2,491,543 MWh. | \u2502\n\u2502 | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. | \u2502\n\u2502 \u2502\n\u2502 Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. \u2502\n\u2502 \u2502\n\u2502 ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the \u2502\n\u2502 response. \u2502\n\u2502 \u2502\n\u2502 ## Related Work \u2502\n\u2502 \u2502\n\u2502 The DocQA integrates multiple AI technologies, namely: \u2502\n\u2502 \u2502\n\u2502 Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout \u2502\n\u2502 analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. \u2502\n\u2502 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . \u2502\n\u2502 Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- \u2502\n\u2502 \u2502\n\u2502 Figure 1: System architecture: Simplified sketch of document question-answering pipeline. \u2502\n\u2502 <!-- Picture description: The image depicts a document conversion process. It is a sequence of steps that includes document conversion, information retrieval, and response generation. The document \u2502\n\u2502 conversion step involves converting the document from a text format to a markdown format. The information retrieval step involves retrieving the document from a database or other source. The response \u2502\n\u2502 generation step involves generating a response from the information retrieval step. --> \u2502\n\u2502 \u2502\n\u2502 based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). \u2502\n\u2502 \u2502\n\u2502 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n</pre>"},{"location":"examples/serialization/#serialization","title":"Serialization\u00b6","text":""},{"location":"examples/serialization/#overview","title":"Overview\u00b6","text":""},{"location":"examples/serialization/#setup","title":"Setup\u00b6","text":""},{"location":"examples/serialization/#basic-usage","title":"Basic usage\u00b6","text":""},{"location":"examples/serialization/#configuring-a-serializer","title":"Configuring a serializer\u00b6","text":""},{"location":"examples/serialization/#creating-a-custom-serializer","title":"Creating a custom serializer\u00b6","text":""},{"location":"examples/tesseract_lang_detection/","title":"Automatic OCR language detection with tesseract","text":"In\u00a0[\u00a0]: Copied! <pre>from pathlib import Path\n</pre> from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n PdfPipelineOptions,\n TesseractCliOcrOptions,\n)\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, TesseractCliOcrOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied! <pre>def main():\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n\n # Set lang=[\"auto\"] with a tesseract OCR engine: TesseractOcrOptions, TesseractCliOcrOptions\n # ocr_options = TesseractOcrOptions(lang=[\"auto\"])\n ocr_options = TesseractCliOcrOptions(lang=[\"auto\"])\n\n pipeline_options = PdfPipelineOptions(\n do_ocr=True, force_full_page_ocr=True, ocr_options=ocr_options\n )\n\n converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n )\n }\n )\n\n doc = converter.convert(input_doc_path).document\n md = doc.export_to_markdown()\n print(md)\n</pre> def main(): data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" # Set lang=[\"auto\"] with a tesseract OCR engine: TesseractOcrOptions, TesseractCliOcrOptions # ocr_options = TesseractOcrOptions(lang=[\"auto\"]) ocr_options = TesseractCliOcrOptions(lang=[\"auto\"]) pipeline_options = PdfPipelineOptions( do_ocr=True, force_full_page_ocr=True, ocr_options=ocr_options ) converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(input_doc_path).document md = doc.export_to_markdown() print(md) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/translate/","title":"Simple translation","text":"In\u00a0[\u00a0]: Copied! <pre>import logging\nfrom pathlib import Path\n</pre> import logging from pathlib import Path In\u00a0[\u00a0]: Copied! <pre>from docling_core.types.doc import ImageRefMode, TableItem, TextItem\n</pre> from docling_core.types.doc import ImageRefMode, TableItem, TextItem In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption In\u00a0[\u00a0]: Copied! <pre>_log = logging.getLogger(__name__)\n</pre> _log = logging.getLogger(__name__) In\u00a0[\u00a0]: Copied! <pre>IMAGE_RESOLUTION_SCALE = 2.0\n</pre> IMAGE_RESOLUTION_SCALE = 2.0 In\u00a0[\u00a0]: Copied! <pre># FIXME: put in your favorite translation code ....\ndef translate(text: str, src: str = \"en\", dest: str = \"de\"):\n _log.warning(\"!!! IMPLEMENT HERE YOUR FAVORITE TRANSLATION CODE!!!\")\n # from googletrans import Translator\n\n # Initialize the translator\n # translator = Translator()\n\n # Translate text from English to German\n # text = \"Hello, how are you?\"\n # translated = translator.translate(text, src=\"en\", dest=\"de\")\n\n return text\n</pre> # FIXME: put in your favorite translation code .... def translate(text: str, src: str = \"en\", dest: str = \"de\"): _log.warning(\"!!! IMPLEMENT HERE YOUR FAVORITE TRANSLATION CODE!!!\") # from googletrans import Translator # Initialize the translator # translator = Translator() # Translate text from English to German # text = \"Hello, how are you?\" # translated = translator.translate(text, src=\"en\", dest=\"de\") return text In\u00a0[\u00a0]: Copied! <pre>def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2206.01062.pdf\"\n output_dir = Path(\"scratch\")\n\n # Important: For operating with page images, we must keep them, otherwise the DocumentConverter\n # will destroy them for cleaning up memory.\n # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images.\n # scale=1 correspond of a standard 72 DPI image\n # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched\n # with the image field\n pipeline_options = PdfPipelineOptions()\n pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE\n pipeline_options.generate_page_images = True\n pipeline_options.generate_picture_images = True\n\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n )\n\n conv_res = doc_converter.convert(input_doc_path)\n conv_doc = conv_res.document\n doc_filename = conv_res.input.file\n\n # Save markdown with embedded pictures in original text\n md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\"\n conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n\n for element, _level in conv_res.document.iterate_items():\n if isinstance(element, TextItem):\n element.orig = element.text\n element.text = translate(text=element.text)\n\n elif isinstance(element, TableItem):\n for cell in element.data.table_cells:\n cell.text = translate(text=element.text)\n\n # Save markdown with embedded pictures in translated text\n md_filename = output_dir / f\"{doc_filename}-with-images-translated.md\"\n conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)\n</pre> def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2206.01062.pdf\" output_dir = Path(\"scratch\") # Important: For operating with page images, we must keep them, otherwise the DocumentConverter # will destroy them for cleaning up memory. # This is done by setting PdfPipelineOptions.images_scale, which also defines the scale of images. # scale=1 correspond of a standard 72 DPI image # The PdfPipelineOptions.generate_* are the selectors for the document elements which will be enriched # with the image field pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) conv_res = doc_converter.convert(input_doc_path) conv_doc = conv_res.document doc_filename = conv_res.input.file # Save markdown with embedded pictures in original text md_filename = output_dir / f\"{doc_filename}-with-images-orig.md\" conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED) for element, _level in conv_res.document.iterate_items(): if isinstance(element, TextItem): element.orig = element.text element.text = translate(text=element.text) elif isinstance(element, TableItem): for cell in element.data.table_cells: cell.text = translate(text=element.text) # Save markdown with embedded pictures in translated text md_filename = output_dir / f\"{doc_filename}-with-images-translated.md\" conv_doc.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)"},{"location":"examples/visual_grounding/","title":"Visual grounding","text":"Step Tech Execution Embedding Hugging Face / Sentence Transformers \ud83d\udcbb Local Vector store Milvus \ud83d\udcbb Local Gen AI Hugging Face Inference API \ud83c\udf10 Remote <p>This example showcases Docling's visual grounding capabilities, which can be combined with any agentic AI / RAG framework.</p> <p>In this instance, we illustrate these capabilities leveraging the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings.</p> <ul> <li>\ud83d\udc49 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.</li> <li>Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var <code>HF_TOKEN</code>.</li> <li>Requirements can be installed as shown below (<code>--no-warn-conflicts</code> meant for Colab's pre-populated Python env; feel free to remove for stricter usage):</li> </ul> In\u00a0[1]: Copied! <pre>%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain matplotlib python-dotenv\n</pre> %pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain matplotlib python-dotenv <pre>Note: you may need to restart the kernel to use updated packages.\n</pre> In\u00a0[2]: Copied! <pre>import os\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom dotenv import load_dotenv\nfrom langchain_core.prompts import PromptTemplate\nfrom langchain_docling.loader import ExportType\n\n\ndef _get_env_from_colab_or_os(key):\n try:\n from google.colab import userdata\n\n try:\n return userdata.get(key)\n except userdata.SecretNotFoundError:\n pass\n except ImportError:\n pass\n return os.getenv(key)\n\n\nload_dotenv()\n\n# https://github.com/huggingface/transformers/issues/5486:\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n\nHF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\")\nSOURCES = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report\nEMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\nGEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\nQUESTION = \"Which are the main AI models in Docling?\"\nPROMPT = PromptTemplate.from_template(\n \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\",\n)\nTOP_K = 3\nMILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")\n</pre> import os from pathlib import Path from tempfile import mkdtemp from dotenv import load_dotenv from langchain_core.prompts import PromptTemplate from langchain_docling.loader import ExportType def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() # https://github.com/huggingface/transformers/issues/5486: os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" HF_TOKEN = _get_env_from_colab_or_os(\"HF_TOKEN\") SOURCES = [\"https://arxiv.org/pdf/2408.09869\"] # Docling Technical Report EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\" GEN_MODEL_ID = \"mistralai/Mixtral-8x7B-Instruct-v0.1\" QUESTION = \"Which are the main AI models in Docling?\" PROMPT = PromptTemplate.from_template( \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\", ) TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\") In\u00a0[3]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=PdfPipelineOptions(\n generate_page_images=True,\n images_scale=2.0,\n ),\n )\n }\n)\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.document_converter import DocumentConverter, PdfFormatOption converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=PdfPipelineOptions( generate_page_images=True, images_scale=2.0, ), ) } ) <p>We set up a simple doc store for keeping converted documents, as that is needed for visual grounding further below.</p> In\u00a0[4]: Copied! <pre>doc_store = {}\ndoc_store_root = Path(mkdtemp())\nfor source in SOURCES:\n dl_doc = converter.convert(source=source).document\n file_path = Path(doc_store_root / f\"{dl_doc.origin.binary_hash}.json\")\n dl_doc.save_as_json(file_path)\n doc_store[dl_doc.origin.binary_hash] = file_path\n</pre> doc_store = {} doc_store_root = Path(mkdtemp()) for source in SOURCES: dl_doc = converter.convert(source=source).document file_path = Path(doc_store_root / f\"{dl_doc.origin.binary_hash}.json\") dl_doc.save_as_json(file_path) doc_store[dl_doc.origin.binary_hash] = file_path <p>Now we can instantiate our loader and load documents.</p> In\u00a0[5]: Copied! <pre>from langchain_docling import DoclingLoader\n\nfrom docling.chunking import HybridChunker\n\nloader = DoclingLoader(\n file_path=SOURCES,\n converter=converter,\n export_type=ExportType.DOC_CHUNKS,\n chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),\n)\n\ndocs = loader.load()\n</pre> from langchain_docling import DoclingLoader from docling.chunking import HybridChunker loader = DoclingLoader( file_path=SOURCES, converter=converter, export_type=ExportType.DOC_CHUNKS, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ) docs = loader.load() <pre>Token indices sequence length is longer than the specified maximum sequence length for this model (648 > 512). Running this sequence through the model will result in indexing errors\n</pre> <p>\ud83d\udc49 NOTE: As you see above, using the <code>HybridChunker</code> can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" \u2014 for details check here.</p> <p>Inspecting some sample splits:</p> In\u00a0[6]: Copied! <pre>for d in docs[:3]:\n print(f\"- {d.page_content=}\")\nprint(\"...\")\n</pre> for d in docs[:3]: print(f\"- {d.page_content=}\") print(\"...\") <pre>- d.page_content='Docling Technical Report\\nVersion 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R\u00a8 uschlikon, Switzerland'\n- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n- d.page_content='1 Introduction\\nConverting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variability in formats, weak standardization and printing-optimized characteristic, which discards most structural features and metadata. With the advent of LLMs and popular application patterns such as retrieval-augmented generation (RAG), leveraging the rich content embedded in PDFs has become ever more relevant. In the past decade, several powerful document understanding solutions have emerged on the market, most of which are commercial software, cloud offerings [3] and most recently, multi-modal vision-language models. As of today, only a handful of open-source tools cover PDF conversion, leaving a significant feature and quality gap to proprietary solutions.\\nWith Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.\\nHere is what Docling delivers today:\\n\u00b7 Converts PDF documents to JSON or Markdown format, stable and lightning fast\\n\u00b7 Understands detailed page layout, reading order, locates figures and recovers table structures\\n\u00b7 Extracts metadata from the document, such as title, authors, references and language\\n\u00b7 Optionally applies OCR, e.g. for scanned PDFs\\n\u00b7 Can be configured to be optimal for batch-mode (i.e high throughput, low time-to-solution) or interactive mode (compromise on efficiency, low time-to-solution)\\n\u00b7 Can leverage different accelerators (GPU, MPS, etc).'\n...\n</pre> In\u00a0[7]: Copied! <pre>import json\nfrom pathlib import Path\nfrom tempfile import mkdtemp\n\nfrom langchain_huggingface.embeddings import HuggingFaceEmbeddings\nfrom langchain_milvus import Milvus\n\nembedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)\n\n\nmilvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed\nvectorstore = Milvus.from_documents(\n documents=docs,\n embedding=embedding,\n collection_name=\"docling_demo\",\n connection_args={\"uri\": milvus_uri},\n index_params={\"index_type\": \"FLAT\"},\n drop_old=True,\n)\n</pre> import json from pathlib import Path from tempfile import mkdtemp from langchain_huggingface.embeddings import HuggingFaceEmbeddings from langchain_milvus import Milvus embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID) milvus_uri = str(Path(mkdtemp()) / \"docling.db\") # or set as needed vectorstore = Milvus.from_documents( documents=docs, embedding=embedding, collection_name=\"docling_demo\", connection_args={\"uri\": milvus_uri}, index_params={\"index_type\": \"FLAT\"}, drop_old=True, ) In\u00a0[8]: Copied! <pre>from langchain.chains import create_retrieval_chain\nfrom langchain.chains.combine_documents import create_stuff_documents_chain\nfrom langchain_huggingface import HuggingFaceEndpoint\n\nretriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K})\nllm = HuggingFaceEndpoint(\n repo_id=GEN_MODEL_ID,\n huggingfacehub_api_token=HF_TOKEN,\n)\n\n\ndef clip_text(text, threshold=100):\n return f\"{text[:threshold]}...\" if len(text) > threshold else text\n</pre> from langchain.chains import create_retrieval_chain from langchain.chains.combine_documents import create_stuff_documents_chain from langchain_huggingface import HuggingFaceEndpoint retriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K}) llm = HuggingFaceEndpoint( repo_id=GEN_MODEL_ID, huggingfacehub_api_token=HF_TOKEN, ) def clip_text(text, threshold=100): return f\"{text[:threshold]}...\" if len(text) > threshold else text <pre>Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.\n</pre> In\u00a0[9]: Copied! <pre>from docling.chunking import DocMeta\nfrom docling.datamodel.document import DoclingDocument\n\nquestion_answer_chain = create_stuff_documents_chain(llm, PROMPT)\nrag_chain = create_retrieval_chain(retriever, question_answer_chain)\nresp_dict = rag_chain.invoke({\"input\": QUESTION})\n\nclipped_answer = clip_text(resp_dict[\"answer\"], threshold=200)\nprint(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")\n</pre> from docling.chunking import DocMeta from docling.datamodel.document import DoclingDocument question_answer_chain = create_stuff_documents_chain(llm, PROMPT) rag_chain = create_retrieval_chain(retriever, question_answer_chain) resp_dict = rag_chain.invoke({\"input\": QUESTION}) clipped_answer = clip_text(resp_dict[\"answer\"], threshold=200) print(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\") <pre>/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.12/site-packages/huggingface_hub/utils/_deprecation.py:131: FutureWarning: 'post' (from 'huggingface_hub.inference._client') is deprecated and will be removed from version '0.31.0'. Making direct POST requests to the inference server is not supported anymore. Please use task methods instead (e.g. `InferenceClient.chat_completion`). If your use case is not supported, please open an issue in https://github.com/huggingface/huggingface_hub.\n warnings.warn(warning_message, FutureWarning)\n</pre> <pre>Question:\nWhich are the main AI models in Docling?\n\nAnswer:\nThe main AI models in Docling are:\n1. A layout analysis model, an accurate object-detector for page elements.\n2. TableFormer, a state-of-the-art table structure recognition model.\n</pre> In\u00a0[10]: Copied! <pre>import matplotlib.pyplot as plt\nfrom PIL import ImageDraw\n\nfor i, doc in enumerate(resp_dict[\"context\"][:]):\n image_by_page = {}\n print(f\"Source {i + 1}:\")\n print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")\n meta = DocMeta.model_validate(doc.metadata[\"dl_meta\"])\n\n # loading the full DoclingDocument from the document store:\n dl_doc = DoclingDocument.load_from_json(doc_store.get(meta.origin.binary_hash))\n\n for doc_item in meta.doc_items:\n if doc_item.prov:\n prov = doc_item.prov[0] # here we only consider the first provenence item\n page_no = prov.page_no\n if img := image_by_page.get(page_no):\n pass\n else:\n page = dl_doc.pages[prov.page_no]\n print(f\" page: {prov.page_no}\")\n img = page.image.pil_image\n image_by_page[page_no] = img\n bbox = prov.bbox.to_top_left_origin(page_height=page.size.height)\n bbox = bbox.normalized(page.size)\n thickness = 2\n padding = thickness + 2\n bbox.l = round(bbox.l * img.width - padding)\n bbox.r = round(bbox.r * img.width + padding)\n bbox.t = round(bbox.t * img.height - padding)\n bbox.b = round(bbox.b * img.height + padding)\n draw = ImageDraw.Draw(img)\n draw.rectangle(\n xy=bbox.as_tuple(),\n outline=\"blue\",\n width=thickness,\n )\n for p in image_by_page:\n img = image_by_page[p]\n plt.figure(figsize=[15, 15])\n plt.imshow(img)\n plt.axis(\"off\")\n plt.show()\n</pre> import matplotlib.pyplot as plt from PIL import ImageDraw for i, doc in enumerate(resp_dict[\"context\"][:]): image_by_page = {} print(f\"Source {i + 1}:\") print(f\" text: {json.dumps(clip_text(doc.page_content, threshold=350))}\") meta = DocMeta.model_validate(doc.metadata[\"dl_meta\"]) # loading the full DoclingDocument from the document store: dl_doc = DoclingDocument.load_from_json(doc_store.get(meta.origin.binary_hash)) for doc_item in meta.doc_items: if doc_item.prov: prov = doc_item.prov[0] # here we only consider the first provenence item page_no = prov.page_no if img := image_by_page.get(page_no): pass else: page = dl_doc.pages[prov.page_no] print(f\" page: {prov.page_no}\") img = page.image.pil_image image_by_page[page_no] = img bbox = prov.bbox.to_top_left_origin(page_height=page.size.height) bbox = bbox.normalized(page.size) thickness = 2 padding = thickness + 2 bbox.l = round(bbox.l * img.width - padding) bbox.r = round(bbox.r * img.width + padding) bbox.t = round(bbox.t * img.height - padding) bbox.b = round(bbox.b * img.height + padding) draw = ImageDraw.Draw(img) draw.rectangle( xy=bbox.as_tuple(), outline=\"blue\", width=thickness, ) for p in image_by_page: img = image_by_page[p] plt.figure(figsize=[15, 15]) plt.imshow(img) plt.axis(\"off\") plt.show() <pre>Source 1:\n text: \"3.2 AI models\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re...\"\n page: 3\n</pre> <pre>Source 2:\n text: \"3 Processing pipeline\\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ...\"\n page: 2\n</pre> <pre>Source 3:\n text: \"6 Future work and contributions\\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ...\"\n page: 5\n</pre> In\u00a0[\u00a0]: Copied! <pre>\n</pre>"},{"location":"examples/visual_grounding/#visual-grounding","title":"Visual grounding\u00b6","text":""},{"location":"examples/visual_grounding/#setup","title":"Setup\u00b6","text":""},{"location":"examples/visual_grounding/#document-store-setup","title":"Document store setup\u00b6","text":""},{"location":"examples/visual_grounding/#document-loading","title":"Document loading\u00b6","text":"<p>We first define our converter, in this case including options for keeping page images (for visual grounding).</p>"},{"location":"examples/visual_grounding/#ingestion","title":"Ingestion\u00b6","text":""},{"location":"examples/visual_grounding/#rag","title":"RAG\u00b6","text":""},{"location":"examples/visual_grounding/#visual-grounding","title":"Visual grounding\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/","title":"VLM pipeline with remote model","text":"In\u00a0[\u00a0]: Copied! <pre>import logging\nimport os\nfrom pathlib import Path\nfrom typing import Optional\n</pre> import logging import os from pathlib import Path from typing import Optional In\u00a0[\u00a0]: Copied! <pre>import requests\nfrom docling_core.types.doc.page import SegmentedPage\nfrom dotenv import load_dotenv\n</pre> import requests from docling_core.types.doc.page import SegmentedPage from dotenv import load_dotenv In\u00a0[\u00a0]: Copied! <pre>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n</pre> from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( VlmPipelineOptions, ) from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat from docling.document_converter import DocumentConverter, PdfFormatOption from docling.pipeline.vlm_pipeline import VlmPipeline In\u00a0[\u00a0]: Copied! <pre>def lms_vlm_options(model: str, prompt: str, format: ResponseFormat):\n options = ApiVlmOptions(\n url=\"http://localhost:1234/v1/chat/completions\", # the default LM Studio\n params=dict(\n model=model,\n ),\n prompt=prompt,\n timeout=90,\n scale=1.0,\n response_format=format,\n )\n return options\n</pre> def lms_vlm_options(model: str, prompt: str, format: ResponseFormat): options = ApiVlmOptions( url=\"http://localhost:1234/v1/chat/completions\", # the default LM Studio params=dict( model=model, ), prompt=prompt, timeout=90, scale=1.0, response_format=format, ) return options In\u00a0[\u00a0]: Copied! <pre>def lms_olmocr_vlm_options(model: str):\n def _dynamic_olmocr_prompt(page: Optional[SegmentedPage]):\n if page is None:\n return (\n \"Below is the image of one page of a document. Just return the plain text\"\n \" representation of this document as if you were reading it naturally.\\n\"\n \"Do not hallucinate.\\n\"\n )\n\n anchor = [\n f\"Page dimensions: {int(page.dimension.width)}x{int(page.dimension.height)}\"\n ]\n\n for text_cell in page.textline_cells:\n if not text_cell.text.strip():\n continue\n bbox = text_cell.rect.to_bounding_box().to_bottom_left_origin(\n page.dimension.height\n )\n anchor.append(f\"[{int(bbox.l)}x{int(bbox.b)}] {text_cell.text}\")\n\n for image_cell in page.bitmap_resources:\n bbox = image_cell.rect.to_bounding_box().to_bottom_left_origin(\n page.dimension.height\n )\n anchor.append(\n f\"[Image {int(bbox.l)}x{int(bbox.b)} to {int(bbox.r)}x{int(bbox.t)}]\"\n )\n\n if len(anchor) == 1:\n anchor.append(\n f\"[Image 0x0 to {int(page.dimension.width)}x{int(page.dimension.height)}]\"\n )\n\n # Original prompt uses cells sorting. We are skipping it in this demo.\n\n base_text = \"\\n\".join(anchor)\n\n return (\n f\"Below is the image of one page of a document, as well as some raw textual\"\n f\" content that was previously extracted for it. Just return the plain text\"\n f\" representation of this document as if you were reading it naturally.\\n\"\n f\"Do not hallucinate.\\n\"\n f\"RAW_TEXT_START\\n{base_text}\\nRAW_TEXT_END\"\n )\n\n options = ApiVlmOptions(\n url=\"http://localhost:1234/v1/chat/completions\",\n params=dict(\n model=model,\n ),\n prompt=_dynamic_olmocr_prompt,\n timeout=90,\n scale=1.0,\n max_size=1024, # from OlmOcr pipeline\n response_format=ResponseFormat.MARKDOWN,\n )\n return options\n</pre> def lms_olmocr_vlm_options(model: str): def _dynamic_olmocr_prompt(page: Optional[SegmentedPage]): if page is None: return ( \"Below is the image of one page of a document. Just return the plain text\" \" representation of this document as if you were reading it naturally.\\n\" \"Do not hallucinate.\\n\" ) anchor = [ f\"Page dimensions: {int(page.dimension.width)}x{int(page.dimension.height)}\" ] for text_cell in page.textline_cells: if not text_cell.text.strip(): continue bbox = text_cell.rect.to_bounding_box().to_bottom_left_origin( page.dimension.height ) anchor.append(f\"[{int(bbox.l)}x{int(bbox.b)}] {text_cell.text}\") for image_cell in page.bitmap_resources: bbox = image_cell.rect.to_bounding_box().to_bottom_left_origin( page.dimension.height ) anchor.append( f\"[Image {int(bbox.l)}x{int(bbox.b)} to {int(bbox.r)}x{int(bbox.t)}]\" ) if len(anchor) == 1: anchor.append( f\"[Image 0x0 to {int(page.dimension.width)}x{int(page.dimension.height)}]\" ) # Original prompt uses cells sorting. We are skipping it in this demo. base_text = \"\\n\".join(anchor) return ( f\"Below is the image of one page of a document, as well as some raw textual\" f\" content that was previously extracted for it. Just return the plain text\" f\" representation of this document as if you were reading it naturally.\\n\" f\"Do not hallucinate.\\n\" f\"RAW_TEXT_START\\n{base_text}\\nRAW_TEXT_END\" ) options = ApiVlmOptions( url=\"http://localhost:1234/v1/chat/completions\", params=dict( model=model, ), prompt=_dynamic_olmocr_prompt, timeout=90, scale=1.0, max_size=1024, # from OlmOcr pipeline response_format=ResponseFormat.MARKDOWN, ) return options In\u00a0[\u00a0]: Copied! <pre>def ollama_vlm_options(model: str, prompt: str):\n options = ApiVlmOptions(\n url=\"http://localhost:11434/v1/chat/completions\", # the default Ollama endpoint\n params=dict(\n model=model,\n ),\n prompt=prompt,\n timeout=90,\n scale=1.0,\n response_format=ResponseFormat.MARKDOWN,\n )\n return options\n</pre> def ollama_vlm_options(model: str, prompt: str): options = ApiVlmOptions( url=\"http://localhost:11434/v1/chat/completions\", # the default Ollama endpoint params=dict( model=model, ), prompt=prompt, timeout=90, scale=1.0, response_format=ResponseFormat.MARKDOWN, ) return options In\u00a0[\u00a0]: Copied! <pre>def watsonx_vlm_options(model: str, prompt: str):\n load_dotenv()\n api_key = os.environ.get(\"WX_API_KEY\")\n project_id = os.environ.get(\"WX_PROJECT_ID\")\n\n def _get_iam_access_token(api_key: str) -> str:\n res = requests.post(\n url=\"https://iam.cloud.ibm.com/identity/token\",\n headers={\n \"Content-Type\": \"application/x-www-form-urlencoded\",\n },\n data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\",\n )\n res.raise_for_status()\n api_out = res.json()\n print(f\"{api_out=}\")\n return api_out[\"access_token\"]\n\n options = ApiVlmOptions(\n url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\",\n params=dict(\n model_id=model,\n project_id=project_id,\n parameters=dict(\n max_new_tokens=400,\n ),\n ),\n headers={\n \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key),\n },\n prompt=prompt,\n timeout=60,\n response_format=ResponseFormat.MARKDOWN,\n )\n return options\n</pre> def watsonx_vlm_options(model: str, prompt: str): load_dotenv() api_key = os.environ.get(\"WX_API_KEY\") project_id = os.environ.get(\"WX_PROJECT_ID\") def _get_iam_access_token(api_key: str) -> str: res = requests.post( url=\"https://iam.cloud.ibm.com/identity/token\", headers={ \"Content-Type\": \"application/x-www-form-urlencoded\", }, data=f\"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}\", ) res.raise_for_status() api_out = res.json() print(f\"{api_out=}\") return api_out[\"access_token\"] options = ApiVlmOptions( url=\"https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29\", params=dict( model_id=model, project_id=project_id, parameters=dict( max_new_tokens=400, ), ), headers={ \"Authorization\": \"Bearer \" + _get_iam_access_token(api_key=api_key), }, prompt=prompt, timeout=60, response_format=ResponseFormat.MARKDOWN, ) return options In\u00a0[\u00a0]: Copied! <pre>def main():\n logging.basicConfig(level=logging.INFO)\n\n data_folder = Path(__file__).parent / \"../../tests/data\"\n input_doc_path = data_folder / \"pdf/2305.03393v1-pg9.pdf\"\n\n pipeline_options = VlmPipelineOptions(\n enable_remote_services=True # <-- this is required!\n )\n\n # The ApiVlmOptions() allows to interface with APIs supporting\n # the multi-modal chat interface. Here follow a few example on how to configure those.\n\n # One possibility is self-hosting model, e.g. via LM Studio, Ollama or others.\n\n # Example using the SmolDocling model with LM Studio:\n # (uncomment the following lines)\n pipeline_options.vlm_options = lms_vlm_options(\n model=\"smoldocling-256m-preview-mlx-docling-snap\",\n prompt=\"Convert this page to docling.\",\n format=ResponseFormat.DOCTAGS,\n )\n\n # Example using the Granite Vision model with LM Studio:\n # (uncomment the following lines)\n # pipeline_options.vlm_options = lms_vlm_options(\n # model=\"granite-vision-3.2-2b\",\n # prompt=\"OCR the full page to markdown.\",\n # format=ResponseFormat.MARKDOWN,\n # )\n\n # Example using the OlmOcr (dynamic prompt) model with LM Studio:\n # (uncomment the following lines)\n # pipeline_options.vlm_options = lms_olmocr_vlm_options(\n # model=\"hf.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF\",\n # )\n\n # Example using the Granite Vision model with Ollama:\n # (uncomment the following lines)\n # pipeline_options.vlm_options = ollama_vlm_options(\n # model=\"granite3.2-vision:2b\",\n # prompt=\"OCR the full page to markdown.\",\n # )\n\n # Another possibility is using online services, e.g. watsonx.ai.\n # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID.\n # (uncomment the following lines)\n # pipeline_options.vlm_options = watsonx_vlm_options(\n # model=\"ibm/granite-vision-3-2-2b\", prompt=\"OCR the full page to markdown.\"\n # )\n\n # Create the DocumentConverter and launch the conversion.\n doc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_options=pipeline_options,\n pipeline_cls=VlmPipeline,\n )\n }\n )\n result = doc_converter.convert(input_doc_path)\n print(result.document.export_to_markdown())\n</pre> def main(): logging.basicConfig(level=logging.INFO) data_folder = Path(__file__).parent / \"../../tests/data\" input_doc_path = data_folder / \"pdf/2305.03393v1-pg9.pdf\" pipeline_options = VlmPipelineOptions( enable_remote_services=True # <-- this is required! ) # The ApiVlmOptions() allows to interface with APIs supporting # the multi-modal chat interface. Here follow a few example on how to configure those. # One possibility is self-hosting model, e.g. via LM Studio, Ollama or others. # Example using the SmolDocling model with LM Studio: # (uncomment the following lines) pipeline_options.vlm_options = lms_vlm_options( model=\"smoldocling-256m-preview-mlx-docling-snap\", prompt=\"Convert this page to docling.\", format=ResponseFormat.DOCTAGS, ) # Example using the Granite Vision model with LM Studio: # (uncomment the following lines) # pipeline_options.vlm_options = lms_vlm_options( # model=\"granite-vision-3.2-2b\", # prompt=\"OCR the full page to markdown.\", # format=ResponseFormat.MARKDOWN, # ) # Example using the OlmOcr (dynamic prompt) model with LM Studio: # (uncomment the following lines) # pipeline_options.vlm_options = lms_olmocr_vlm_options( # model=\"hf.co/lmstudio-community/olmOCR-7B-0225-preview-GGUF\", # ) # Example using the Granite Vision model with Ollama: # (uncomment the following lines) # pipeline_options.vlm_options = ollama_vlm_options( # model=\"granite3.2-vision:2b\", # prompt=\"OCR the full page to markdown.\", # ) # Another possibility is using online services, e.g. watsonx.ai. # Using requires setting the env variables WX_API_KEY and WX_PROJECT_ID. # (uncomment the following lines) # pipeline_options.vlm_options = watsonx_vlm_options( # model=\"ibm/granite-vision-3-2-2b\", prompt=\"OCR the full page to markdown.\" # ) # Create the DocumentConverter and launch the conversion. doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, pipeline_cls=VlmPipeline, ) } ) result = doc_converter.convert(input_doc_path) print(result.document.export_to_markdown()) In\u00a0[\u00a0]: Copied! <pre>if __name__ == \"__main__\":\n main()\n</pre> if __name__ == \"__main__\": main()"},{"location":"examples/vlm_pipeline_api_model/#example-of-apivlmoptions-definitions","title":"Example of ApiVlmOptions definitions\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/#using-lm-studio","title":"Using LM Studio\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/#using-lm-studio-with-olmocr-model","title":"Using LM Studio with OlmOcr model\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/#using-ollama","title":"Using Ollama\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/#using-a-cloud-service-like-ibm-watsonxai","title":"Using a cloud service like IBM watsonx.ai\u00b6","text":""},{"location":"examples/vlm_pipeline_api_model/#usage-and-conversion","title":"Usage and conversion\u00b6","text":""},{"location":"faq/","title":"FAQ","text":"<p>This is a collection of FAQ collected from the user questions on https://github.com/docling-project/docling/discussions.</p> Is Python 3.13 supported? Install conflicts with numpy (python 3.13) Is macOS x86_64 supported? Are text styles (bold, underline, etc) supported? How do I run completely offline? Which model weights are needed to run Docling? SSL error downloading model weights Which OCR languages are supported? Some images are missing from MS Word and Powerpoint <code>HybridChunker</code> triggers warning: 'Token indices sequence length is longer than the specified maximum sequence length for this model' How to use flash attention?"},{"location":"faq/#is-python-313-supported","title":"Is Python 3.13 supported?","text":"<p>Python 3.13 is supported from Docling 2.18.0.</p>"},{"location":"faq/#install-conflicts-with-numpy-python-313","title":"Install conflicts with numpy (python 3.13)","text":"<p>When using <code>docling-ibm-models>=2.0.7</code> and <code>deepsearch-glm>=0.26.2</code> these issues should not show up anymore. Docling supports numpy versions <code>>=1.24.4,<3.0.0</code> which should match all usages.</p> <p>For older versions</p> <p>This has been observed installing docling and langchain via poetry.</p> <pre><code>...\nThus, docling (>=2.7.0,<3.0.0) requires numpy (>=1.26.4,<2.0.0).\nSo, because ... depends on both numpy (>=2.0.2,<3.0.0) and docling (^2.7.0), version solving failed.\n</code></pre> <p>Numpy is only adding Python 3.13 support starting in some 2.x.y version. In order to prepare for 3.13, Docling depends on a 2.x.y for 3.13, otherwise depending an 1.x.y version. If you are allowing 3.13 in your pyproject.toml, Poetry will try to find some way to reconcile Docling's numpy version for 3.13 (some 2.x.y) with LangChain's version for that (some 1.x.y) \u2014 leading to the error above.</p> <p>Check if Python 3.13 is among the Python versions allowed by your pyproject.toml and if so, remove it and try again. E.g., if you have python = \"^3.10\", use python = \">=3.10,<3.13\" instead.</p> <p>If you want to retain compatibility with python 3.9-3.13, you can also use a selector in pyproject.toml similar to the following</p> <pre><code>numpy = [\n { version = \"^2.1.0\", markers = 'python_version >= \"3.13\"' },\n { version = \"^1.24.4\", markers = 'python_version < \"3.13\"' },\n]\n</code></pre> <p>Source: Issue #283</p>"},{"location":"faq/#is-macos-x86_64-supported","title":"Is macOS x86_64 supported?","text":"<p>Yes, Docling (still) supports running the standard pipeline on macOS x86_64.</p> <p>However, users might get into a combination of incompatible dependencies on a fresh install. Because Docling depends on PyTorch which dropped support for macOS x86_64 after the 2.2.2 release, and this old version of PyTorch works only with NumPy 1.x, users must ensure the correct NumPy version is running.</p> <pre><code>pip install docling \"numpy<2.0.0\"\n</code></pre> <p>Source: Issue #1694.</p>"},{"location":"faq/#are-text-styles-bold-underline-etc-supported","title":"Are text styles (bold, underline, etc) supported?","text":"<p>Currently text styles are not supported in the <code>DoclingDocument</code> format. If you are interest in contributing this feature, please open a discussion topic to brainstorm on the design.</p> <p>Note: this is not a simple topic</p>"},{"location":"faq/#how-do-i-run-completely-offline","title":"How do I run completely offline?","text":"<p>Docling is not using any remote service, hence it can run in completely isolated air-gapped environments.</p> <p>The only requirement is pointing the Docling runtime to the location where the model artifacts have been stored.</p> <p>For example</p> <pre><code>pipeline_options = PdfPipelineOptions(artifacts_path=\"your location\")\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n</code></pre> <p>Source: Issue #326</p>"},{"location":"faq/#which-model-weights-are-needed-to-run-docling","title":"Which model weights are needed to run Docling?","text":"<p>Model weights are needed for the AI models used in the PDF pipeline. Other document types (docx, pptx, etc) do not have any such requirement.</p> <p>For processing PDF documents, Docling requires the model weights from https://huggingface.co/ds4sd/docling-models.</p> <p>When OCR is enabled, some engines also require model artifacts. For example EasyOCR, for which Docling has special pipeline options to control the runtime behavior.</p>"},{"location":"faq/#ssl-error-downloading-model-weights","title":"SSL error downloading model weights","text":"<pre><code>URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)>\n</code></pre> <p>Similar SSL download errors have been observed by some users. This happens when model weights are fetched from Hugging Face. The error could happen when the python environment doesn't have an up-to-date list of trusted certificates.</p> <p>Possible solutions were</p> <ul> <li>Update to the latest version of certifi, i.e. <code>pip install --upgrade certifi</code></li> <li>Use pip-system-certs to use the latest trusted certificates on your system.</li> <li>Set environment variables <code>SSL_CERT_FILE</code> and <code>REQUESTS_CA_BUNDLE</code> to the value of <code>python -m certifi</code>: <pre><code>CERT_PATH=$(python -m certifi)\nexport SSL_CERT_FILE=${CERT_PATH}\nexport REQUESTS_CA_BUNDLE=${CERT_PATH}\n</code></pre></li> </ul>"},{"location":"faq/#which-ocr-languages-are-supported","title":"Which OCR languages are supported?","text":"<p>Docling supports multiple OCR engine, each one has its own list of supported languages. Here is a collection of links to the original OCR engine's documentation listing the OCR languages.</p> <ul> <li>EasyOCR</li> <li>Tesseract</li> <li>RapidOCR</li> <li>Mac OCR</li> </ul> <p>Setting the OCR language in Docling is done via the OCR pipeline options:</p> <pre><code>from docling.datamodel.pipeline_options import PdfPipelineOptions\n\npipeline_options = PdfPipelineOptions()\npipeline_options.ocr_options.lang = [\"fr\", \"de\", \"es\", \"en\"] # example of languages for EasyOCR\n</code></pre>"},{"location":"faq/#some-images-are-missing-from-ms-word-and-powerpoint","title":"Some images are missing from MS Word and Powerpoint","text":"<p>The image processing library used by Docling is able to handle embedded WMF images only on Windows platform. If you are on other operating systems, these images will be ignored.</p>"},{"location":"faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model","title":"<code>HybridChunker</code> triggers warning: 'Token indices sequence length is longer than the specified maximum sequence length for this model'","text":"<p>TLDR: In the context of the <code>HybridChunker</code>, this is a known & ancitipated \"false alarm\".</p> <p>Details:</p> <p>Using the <code>HybridChunker</code> often triggers a warning like this:</p> <p>Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors</p> <p>This is a warning that is emitted by transformers, saying that actually running this sequence through the model will result in indexing errors, i.e. the problematic case is only if one indeed passes the particular sequence through the (embedding) model.</p> <p>In our case though, this occurs as a \"false alarm\", since what happens is the following:</p> <ul> <li>the chunker invokes the tokenizer on a potentially long sequence (e.g. 530 tokens as mentioned in the warning) in order to count its tokens, i.e. to assess if it is short enough. At this point transformers already emits the warning above!</li> <li>whenever the sequence at hand is oversized, the chunker proceeds to split it (but the transformers warning has already been shown nonetheless)</li> </ul> <p>What is important is the actual token length of the produced chunks. The snippet below can be used for getting the actual maximum chunk size (for users wanting to confirm that this does not exceed the model limit):</p> <pre><code>chunk_max_len = 0\nfor i, chunk in enumerate(chunks):\n ser_txt = chunker.serialize(chunk=chunk)\n ser_tokens = len(tokenizer.tokenize(ser_txt))\n if ser_tokens > chunk_max_len:\n chunk_max_len = ser_tokens\n print(f\"{i}\\t{ser_tokens}\\t{repr(ser_txt[:100])}...\")\nprint(f\"Longest chunk yielded: {chunk_max_len} tokens\")\nprint(f\"Model max length: {tokenizer.model_max_length}\")\n</code></pre> <p>Also see docling#725.</p> <p>Source: Issue docling-core#119</p>"},{"location":"faq/#how-to-use-flash-attention","title":"How to use flash attention?","text":"<p>When running models in Docling on CUDA devices, you can enable the usage of the Flash Attention2 library.</p> <p>Using environment variables:</p> <pre><code>DOCLING_CUDA_USE_FLASH_ATTENTION2=1\n</code></pre> <p>Using code:</p> <pre><code>from docling.datamodel.accelerator_options import (\n AcceleratorOptions,\n)\n\npipeline_options = VlmPipelineOptions(\n accelerator_options=AcceleratorOptions(cuda_use_flash_attention2=True)\n)\n</code></pre> <p>This requires having the flash-attn package installed. Below are two alternative ways for installing it:</p> <pre><code># Building from sources (required the CUDA dev environment)\npip install flash-attn\n\n# Using pre-built wheels (not available in all possible setups)\nFLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn\n</code></pre>"},{"location":"installation/","title":"Installation","text":"<p>To use Docling, simply install <code>docling</code> from your Python package manager, e.g. pip: <pre><code>pip install docling\n</code></pre></p> <p>Works on macOS, Linux, and Windows, with support for both x86_64 and arm64 architectures.</p> Alternative PyTorch distributions <p>The Docling models depend on the PyTorch library. Depending on your architecture, you might want to use a different distribution of <code>torch</code>. For example, you might want support for different accelerator or for a cpu-only version. All the different ways for installing <code>torch</code> are listed on their website https://pytorch.org/.</p> <p>One common situation is the installation on Linux systems with cpu-only support. In this case, we suggest the installation of Docling with the following options</p> <pre><code># Example for installing on the Linux cpu-only version\npip install docling --extra-index-url https://download.pytorch.org/whl/cpu\n</code></pre> Alternative OCR engines <p>Docling supports multiple OCR engines for processing scanned documents. The current version provides the following engines.</p> Engine Installation Usage EasyOCR Default in Docling or via <code>pip install easyocr</code>. <code>EasyOcrOptions</code> Tesseract System dependency. See description for Tesseract and Tesserocr below. <code>TesseractOcrOptions</code> Tesseract CLI System dependency. See description below. <code>TesseractCliOcrOptions</code> OcrMac System dependency. See description below. <code>OcrMacOptions</code> RapidOCR Extra feature not included in Default Docling installation can be installed via <code>pip install rapidocr_onnxruntime</code> <code>RapidOcrOptions</code> OnnxTR Can be installed via the plugin system <code>pip install \"docling-ocr-onnxtr[cpu]\"</code>. Please take a look at docling-OCR-OnnxTR. <code>OnnxtrOcrOptions</code> <p>The Docling <code>DocumentConverter</code> allows to choose the OCR engine with the <code>ocr_options</code> settings. For example</p> <pre><code>from docling.datamodel.base_models import ConversionStatus, PipelineOptions\nfrom docling.datamodel.pipeline_options import PipelineOptions, EasyOcrOptions, TesseractOcrOptions\nfrom docling.document_converter import DocumentConverter\n\npipeline_options = PipelineOptions()\npipeline_options.do_ocr = True\npipeline_options.ocr_options = TesseractOcrOptions() # Use Tesseract\n\ndoc_converter = DocumentConverter(\n pipeline_options=pipeline_options,\n)\n</code></pre> <p>Tesseract installation</p> <p>Tesseract is a popular OCR engine which is available on most operating systems. For using this engine with Docling, Tesseract must be installed on your system, using the packaging tool of your choice. Below we provide example commands. After installing Tesseract you are expected to provide the path to its language files using the <code>TESSDATA_PREFIX</code> environment variable (note that it must terminate with a slash <code>/</code>).</p> macOS (via Homebrew)Debian-basedRHEL <pre><code>brew install tesseract leptonica pkg-config\nTESSDATA_PREFIX=/opt/homebrew/share/tessdata/\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n</code></pre> <pre><code>apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config\nTESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n</code></pre> <pre><code>dnf install tesseract tesseract-devel tesseract-langpack-eng tesseract-osd leptonica-devel\nTESSDATA_PREFIX=/usr/share/tesseract/tessdata/\necho \"Set TESSDATA_PREFIX=${TESSDATA_PREFIX}\"\n</code></pre> <p>Linking to Tesseract The most efficient usage of the Tesseract library is via linking. Docling is using the Tesserocr package for this.</p> <p>If you get into installation issues of Tesserocr, we suggest using the following installation options:</p> <pre><code>pip uninstall tesserocr\npip install --no-binary :all: tesserocr\n</code></pre> <p>ocrmac installation</p> <p>ocrmac is using Apple's vision(or livetext) framework as OCR backend. For using this engine with Docling, ocrmac must be installed on your system. This only works on macOS systems with newer macOS versions (10.15+).</p> <pre><code>pip install ocrmac\n</code></pre> Installation on macOS Intel (x86_64) <p>When installing Docling on macOS with Intel processors, you might encounter errors with PyTorch compatibility. This happens because newer PyTorch versions (2.6.0+) no longer provide wheels for Intel-based Macs.</p> <p>If you're using an Intel Mac, install Docling with compatible PyTorch Note: PyTorch 2.2.2 requires Python 3.12 or lower. Make sure you're not using Python 3.13+.</p> <pre><code># For uv users\nuv add torch==2.2.2 torchvision==0.17.2 docling\n\n# For pip users\npip install \"docling[mac_intel]\"\n\n# For Poetry users\npoetry add docling\n</code></pre>"},{"location":"installation/#development-setup","title":"Development setup","text":"<p>To develop Docling features, bugfixes etc., install as follows from your local clone's root dir:</p> <pre><code>uv sync --all-extras\n</code></pre>"},{"location":"integrations/","title":"Integrations","text":"<p>Use the navigation on the left to browse through Docling integrations with popular frameworks and tools.</p> <p> </p>"},{"location":"integrations/apify/","title":"Apify","text":"<p>You can run Docling in the cloud without installation using the Docling Actor on Apify platform. Simply provide a document URL and get the processed result:</p> <p></p> <pre><code>apify call vancura/docling -i '{\n \"options\": {\n \"to_formats\": [\"md\", \"json\", \"html\", \"text\", \"doctags\"]\n },\n \"http_sources\": [\n {\"url\": \"https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf\"},\n {\"url\": \"https://arxiv.org/pdf/2408.09869\"}\n ]\n}'\n</code></pre> <p>The Actor stores results in:</p> <ul> <li>Processed document in key-value store (<code>OUTPUT_RESULT</code>)</li> <li>Processing logs (<code>DOCLING_LOG</code>)</li> <li>Dataset record with result URL and status</li> </ul> <p>Read more about the Docling Actor, including how to use it via the Apify API and CLI.</p> <ul> <li>\ud83d\udcbb GitHub</li> <li>\ud83d\udcd6 Docs</li> <li>\ud83d\udce6 Docling Actor</li> </ul>"},{"location":"integrations/bee/","title":"Bee Agent Framework","text":"<p>Docling is available as an extraction backend in the Bee framework.</p> <ul> <li>\ud83d\udcbb Bee GitHub</li> <li>\ud83d\udcd6 Bee docs</li> <li>\ud83d\udce6 Bee NPM</li> </ul>"},{"location":"integrations/cloudera/","title":"Cloudera","text":"<p>Docling is available in Cloudera through the RAG Studio Accelerator for Machine Learning Projects (AMP).</p> <ul> <li>\ud83d\udcbb RAG Studio AMP GitHub</li> </ul>"},{"location":"integrations/crewai/","title":"Crew AI","text":"<p>Docling is available in CrewAI as the <code>CrewDoclingSource</code> knowledge source.</p> <ul> <li>\ud83d\udcbb Crew AI GitHub</li> <li>\ud83d\udcd6 Crew AI knowledge docs</li> <li>\ud83d\udce6 Crew AI PyPI</li> </ul>"},{"location":"integrations/data_prep_kit/","title":"Data Prep Kit","text":"<p>Docling is used by the Data Prep Kit open-source toolkit for preparing unstructured data for LLM application development ranging from laptop scale to datacenter scale.</p>"},{"location":"integrations/data_prep_kit/#components","title":"Components","text":""},{"location":"integrations/data_prep_kit/#pdf-ingestion-to-parquet","title":"PDF ingestion to Parquet","text":"<ul> <li>\ud83d\udcbb Docling2Parquet source</li> <li>\ud83d\udcd6 Docling2Parquet docs</li> </ul>"},{"location":"integrations/data_prep_kit/#document-chunking","title":"Document chunking","text":"<ul> <li>\ud83d\udcbb Doc Chunking source</li> <li>\ud83d\udcd6 Doc Chunking docs</li> </ul>"},{"location":"integrations/docetl/","title":"DocETL","text":"<p>Docling is available as a file conversion method in DocETL:</p> <ul> <li>\ud83d\udcbb DocETL GitHub</li> <li>\ud83d\udcd6 DocETL docs</li> <li>\ud83d\udce6 DocETL PyPI</li> </ul>"},{"location":"integrations/haystack/","title":"Haystack","text":"<p>Docling is available as a converter in Haystack:</p> <ul> <li>\ud83d\udcd6 Docling Haystack integration docs</li> <li>\ud83d\udcbb Docling Haystack integration GitHub</li> <li>\ud83e\uddd1\ud83c\udffd\u200d\ud83c\udf73 Docling Haystack integration example</li> <li>\ud83d\udce6 Docling Haystack integration PyPI</li> </ul>"},{"location":"integrations/instructlab/","title":"InstructLab","text":"<p>Docling is powering document processing in InstructLab, enabling users to unlock the knowledge hidden in documents and present it to InstructLab's fine-tuning for aligning AI models to the user's specific data.</p> <p>More details can be found in this blog post.</p> <ul> <li>\ud83c\udfe0 InstructLab home</li> <li>\ud83d\udcbb InstructLab GitHub</li> <li>\ud83e\uddd1\ud83c\udffb\u200d\ud83d\udcbb InstructLab UI</li> <li>\ud83d\udcd6 InstructLab docs</li> </ul>"},{"location":"integrations/kotaemon/","title":"Kotaemon","text":"<p>Docling is available in Kotaemon as the <code>DoclingReader</code> loader:</p> <ul> <li>\ud83d\udcbb Kotaemon GitHub</li> <li>\ud83d\udcd6 DoclingReader docs</li> <li>\u2699\ufe0f Docling setup in Kotaemon</li> </ul>"},{"location":"integrations/langchain/","title":"LangChain","text":"<p>Docling is available as an official LangChain extension.</p> <p>To get started, check out the step-by-step guide in LangChain.</p> <ul> <li>\ud83d\udcd6 LangChain Docling integration docs</li> <li>\ud83d\udcbb LangChain Docling integration GitHub</li> <li>\ud83e\uddd1\ud83c\udffd\u200d\ud83c\udf73 LangChain Docling integration example</li> <li>\ud83d\udce6 LangChain Docling integration PyPI</li> </ul>"},{"location":"integrations/llamaindex/","title":"LlamaIndex","text":"<p>Docling is available as an official LlamaIndex extension.</p> <p>To get started, check out the step-by-step guide in LlamaIndex.</p>"},{"location":"integrations/llamaindex/#components","title":"Components","text":""},{"location":"integrations/llamaindex/#docling-reader","title":"Docling Reader","text":"<p>Reads document files and uses Docling to populate LlamaIndex <code>Document</code> objects \u2014 either serializing Docling's data model (losslessly, e.g. as JSON) or exporting to a simplified format (lossily, e.g. as Markdown).</p> <ul> <li>\ud83d\udcbb Docling Reader GitHub</li> <li>\ud83d\udcd6 Docling Reader docs</li> <li>\ud83d\udce6 Docling Reader PyPI</li> </ul>"},{"location":"integrations/llamaindex/#docling-node-parser","title":"Docling Node Parser","text":"<p>Reads LlamaIndex <code>Document</code> objects populated in Docling's format by Docling Reader and, using its knowledge of the Docling format, parses them to LlamaIndex <code>Node</code> objects for downstream usage in LlamaIndex applications, e.g. as chunks for embedding.</p> <ul> <li>\ud83d\udcbb Docling Node Parser GitHub</li> <li>\ud83d\udcd6 Docling Node Parser docs</li> <li>\ud83d\udce6 Docling Node Parser PyPI</li> </ul>"},{"location":"integrations/nvidia/","title":"NVIDIA","text":"<p>Docling is powering the NVIDIA PDF to Podcast agentic AI blueprint:</p> <ul> <li>\ud83c\udfe0 PDF to Podcast home</li> <li>\ud83d\udcbb PDF to Podcast GitHub</li> <li>\ud83d\udce3 PDF to Podcast announcement</li> <li>\u270d\ufe0f PDF to Podcast blog post</li> </ul>"},{"location":"integrations/opencontracts/","title":"OpenContracts","text":"<p>Docling is available an ingestion engine for OpenContracts, allowing you to use Docling's OCR engine(s), chunker(s), labels, etc. and load them into a platform supporting bulk data extraction, text annotating, and question-answering:</p> <ul> <li>\ud83d\udcbb OpenContracts GitHub</li> <li>\ud83d\udcd6 OpenContracts Docs</li> <li>\u25b6\ufe0f OpenContracts x Docling PDF annotation screen capture</li> </ul>"},{"location":"integrations/openwebui/","title":"Open WebUI","text":"<p>Docling is available as a plugin for Open WebUI.</p> <ul> <li>\ud83d\udcd6 Docs</li> <li>\ud83d\udcbb GitHub</li> </ul>"},{"location":"integrations/prodigy/","title":"Prodigy","text":"<p>Docling is available in Prodigy as a Prodigy-PDF plugin recipe.</p> <p>More details can be found in this blog post.</p> <ul> <li>\ud83c\udf10 Prodigy home</li> <li>\ud83d\udd0c Prodigy-PDF plugin</li> <li>\ud83e\uddd1\ud83c\udffd\u200d\ud83c\udf73 pdf-spans.manual recipe</li> </ul>"},{"location":"integrations/rhel_ai/","title":"RHEL AI","text":"<p>Docling is powering document processing in Red Hat Enterprise Linux AI (RHEL AI), enabling users to unlock the knowledge hidden in documents and present it to InstructLab's fine-tuning for aligning AI models to the user's specific data.</p> <ul> <li>\ud83d\udce3 RHEL AI 1.3 announcement</li> <li>\u270d\ufe0f RHEL blog posts:<ul> <li>RHEL AI 1.3 Docling context aware chunking: What you need to know</li> <li>Docling: The missing document processing companion for generative AI</li> </ul> </li> </ul>"},{"location":"integrations/spacy/","title":"spaCy","text":"<p>Docling is available in spaCy as the spaCy Layout plugin.</p> <p>More details can be found in this blog post.</p> <ul> <li>\ud83d\udcbb SpacyLayout GitHub</li> <li>\ud83d\udcd6 SpacyLayout docs</li> <li>\ud83d\udce6 SpacyLayout PyPI</li> </ul>"},{"location":"integrations/txtai/","title":"txtai","text":"<p>Docling is available as a text extraction backend for txtai.</p> <ul> <li>\ud83d\udcbb txtai GitHub</li> <li>\ud83d\udcd6 txtai docs</li> <li>\ud83d\udcd6 txtai Docling backend</li> </ul>"},{"location":"integrations/vectara/","title":"Vectara","text":"<p>Docling is available as a document parser in Vectara.</p> <ul> <li>\ud83d\udcbb Vectara GitHub org<ul> <li>vectara-ingest GitHub repo</li> </ul> </li> <li>\ud83d\udcd6 Vectara docs</li> </ul>"},{"location":"reference/cli/","title":"CLI reference","text":"<p>This page provides documentation for our command line tools.</p>"},{"location":"reference/cli/#docling","title":"docling","text":"<p>Usage:</p> <pre><code>docling [OPTIONS] source\n</code></pre> <p>Options:</p> Name Type Description Default <code>--from</code> choice (<code>docx</code> | <code>pptx</code> | <code>html</code> | <code>image</code> | <code>pdf</code> | <code>asciidoc</code> | <code>md</code> | <code>csv</code> | <code>xlsx</code> | <code>xml_uspto</code> | <code>xml_jats</code> | <code>json_docling</code> | <code>audio</code>) Specify input formats to convert from. Defaults to all formats. None <code>--to</code> choice (<code>md</code> | <code>json</code> | <code>html</code> | <code>html_split_page</code> | <code>text</code> | <code>doctags</code>) Specify output formats. Defaults to Markdown. None <code>--show-layout</code> / <code>--no-show-layout</code> boolean If enabled, the page images will show the bounding-boxes of the items. <code>False</code> <code>--headers</code> text Specify http request headers used when fetching url input sources in the form of a JSON string None <code>--image-export-mode</code> choice (<code>placeholder</code> | <code>embedded</code> | <code>referenced</code>) Image export mode for the document (only in case of JSON, Markdown or HTML). With <code>placeholder</code>, only the position of the image is marked in the output. In <code>embedded</code> mode, the image is embedded as base64 encoded string. In <code>referenced</code> mode, the image is exported in PNG format and referenced from the main exported document. <code>ImageRefMode.EMBEDDED</code> <code>--pipeline</code> choice (<code>standard</code> | <code>vlm</code> | <code>asr</code>) Choose the pipeline to process PDF or image files. <code>ProcessingPipeline.STANDARD</code> <code>--vlm-model</code> choice (<code>smoldocling</code> | <code>granite_vision</code> | <code>granite_vision_ollama</code>) Choose the VLM model to use with PDF or image files. <code>VlmModelType.SMOLDOCLING</code> <code>--asr-model</code> choice (<code>whisper_tiny</code> | <code>whisper_small</code> | <code>whisper_medium</code> | <code>whisper_base</code> | <code>whisper_large</code> | <code>whisper_turbo</code>) Choose the ASR model to use with audio/video files. <code>AsrModelType.WHISPER_TINY</code> <code>--ocr</code> / <code>--no-ocr</code> boolean If enabled, the bitmap content will be processed using OCR. <code>True</code> <code>--force-ocr</code> / <code>--no-force-ocr</code> boolean Replace any existing text with OCR generated text over the full content. <code>False</code> <code>--ocr-engine</code> text The OCR engine to use. When --allow-external-plugins is not set, the available values are: easyocr, ocrmac, rapidocr, tesserocr, tesseract. Use the option --show-external-plugins to see the options allowed with external plugins. <code>easyocr</code> <code>--ocr-lang</code> text Provide a comma-separated list of languages used by the OCR engine. Note that each OCR engine has different values for the language names. None <code>--pdf-backend</code> choice (<code>pypdfium2</code> | <code>dlparse_v1</code> | <code>dlparse_v2</code> | <code>dlparse_v4</code>) The PDF backend to use. <code>PdfBackend.DLPARSE_V2</code> <code>--table-mode</code> choice (<code>fast</code> | <code>accurate</code>) The mode to use in the table structure model. <code>TableFormerMode.ACCURATE</code> <code>--enrich-code</code> / <code>--no-enrich-code</code> boolean Enable the code enrichment model in the pipeline. <code>False</code> <code>--enrich-formula</code> / <code>--no-enrich-formula</code> boolean Enable the formula enrichment model in the pipeline. <code>False</code> <code>--enrich-picture-classes</code> / <code>--no-enrich-picture-classes</code> boolean Enable the picture classification enrichment model in the pipeline. <code>False</code> <code>--enrich-picture-description</code> / <code>--no-enrich-picture-description</code> boolean Enable the picture description model in the pipeline. <code>False</code> <code>--artifacts-path</code> path If provided, the location of the model artifacts. None <code>--enable-remote-services</code> / <code>--no-enable-remote-services</code> boolean Must be enabled when using models connecting to remote services. <code>False</code> <code>--allow-external-plugins</code> / <code>--no-allow-external-plugins</code> boolean Must be enabled for loading modules from third-party plugins. <code>False</code> <code>--show-external-plugins</code> / <code>--no-show-external-plugins</code> boolean List the third-party plugins which are available when the option --allow-external-plugins is set. <code>False</code> <code>--abort-on-error</code> / <code>--no-abort-on-error</code> boolean If enabled, the processing will be aborted when the first error is encountered. <code>False</code> <code>--output</code> path Output directory where results are saved. <code>.</code> <code>--verbose</code>, <code>-v</code> integer Set the verbosity level. -v for info logging, -vv for debug logging. <code>0</code> <code>--debug-visualize-cells</code> / <code>--no-debug-visualize-cells</code> boolean Enable debug output which visualizes the PDF cells <code>False</code> <code>--debug-visualize-ocr</code> / <code>--no-debug-visualize-ocr</code> boolean Enable debug output which visualizes the OCR cells <code>False</code> <code>--debug-visualize-layout</code> / <code>--no-debug-visualize-layout</code> boolean Enable debug output which visualizes the layour clusters <code>False</code> <code>--debug-visualize-tables</code> / <code>--no-debug-visualize-tables</code> boolean Enable debug output which visualizes the table cells <code>False</code> <code>--version</code> boolean Show version information. None <code>--document-timeout</code> float The timeout for processing each document, in seconds. None <code>--num-threads</code> integer Number of threads <code>4</code> <code>--device</code> choice (<code>auto</code> | <code>cpu</code> | <code>cuda</code> | <code>mps</code>) Accelerator device <code>AcceleratorDevice.AUTO</code> <code>--logo</code> boolean Docling logo None <code>--help</code> boolean Show this message and exit. <code>False</code>"},{"location":"reference/docling_document/","title":"Docling Document","text":"<p>This is an automatic generated API reference of the DoclingDocument type.</p>"},{"location":"reference/docling_document/#docling_core.types.doc","title":"doc","text":"<p>Package for models defined by the Document type.</p> <p>Classes:</p> <ul> <li> <code>DoclingDocument</code> \u2013 <p>DoclingDocument.</p> </li> <li> <code>DocumentOrigin</code> \u2013 <p>FileSource.</p> </li> <li> <code>DocItem</code> \u2013 <p>DocItem.</p> </li> <li> <code>DocItemLabel</code> \u2013 <p>DocItemLabel.</p> </li> <li> <code>ProvenanceItem</code> \u2013 <p>ProvenanceItem.</p> </li> <li> <code>GroupItem</code> \u2013 <p>GroupItem.</p> </li> <li> <code>GroupLabel</code> \u2013 <p>GroupLabel.</p> </li> <li> <code>NodeItem</code> \u2013 <p>NodeItem.</p> </li> <li> <code>PageItem</code> \u2013 <p>PageItem.</p> </li> <li> <code>FloatingItem</code> \u2013 <p>FloatingItem.</p> </li> <li> <code>TextItem</code> \u2013 <p>TextItem.</p> </li> <li> <code>TableItem</code> \u2013 <p>TableItem.</p> </li> <li> <code>TableCell</code> \u2013 <p>TableCell.</p> </li> <li> <code>TableData</code> \u2013 <p>BaseTableData.</p> </li> <li> <code>TableCellLabel</code> \u2013 <p>TableCellLabel.</p> </li> <li> <code>KeyValueItem</code> \u2013 <p>KeyValueItem.</p> </li> <li> <code>SectionHeaderItem</code> \u2013 <p>SectionItem.</p> </li> <li> <code>PictureItem</code> \u2013 <p>PictureItem.</p> </li> <li> <code>ImageRef</code> \u2013 <p>ImageRef.</p> </li> <li> <code>PictureClassificationClass</code> \u2013 <p>PictureClassificationData.</p> </li> <li> <code>PictureClassificationData</code> \u2013 <p>PictureClassificationData.</p> </li> <li> <code>RefItem</code> \u2013 <p>RefItem.</p> </li> <li> <code>BoundingBox</code> \u2013 <p>BoundingBox.</p> </li> <li> <code>CoordOrigin</code> \u2013 <p>CoordOrigin.</p> </li> <li> <code>ImageRefMode</code> \u2013 <p>ImageRefMode.</p> </li> <li> <code>Size</code> \u2013 <p>Size.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument","title":"DoclingDocument","text":"<p> Bases: <code>BaseModel</code></p> <p>DoclingDocument.</p> <p>Methods:</p> <ul> <li> <code>add_code</code> \u2013 <p>add_code.</p> </li> <li> <code>add_document</code> \u2013 <p>Adds the content from the body of a DoclingDocument to this document under a specific parent.</p> </li> <li> <code>add_form</code> \u2013 <p>add_form.</p> </li> <li> <code>add_formula</code> \u2013 <p>add_formula.</p> </li> <li> <code>add_group</code> \u2013 <p>add_group.</p> </li> <li> <code>add_heading</code> \u2013 <p>add_heading.</p> </li> <li> <code>add_inline_group</code> \u2013 <p>add_inline_group.</p> </li> <li> <code>add_key_values</code> \u2013 <p>add_key_values.</p> </li> <li> <code>add_list_group</code> \u2013 <p>add_list_group.</p> </li> <li> <code>add_list_item</code> \u2013 <p>add_list_item.</p> </li> <li> <code>add_node_items</code> \u2013 <p>Adds multiple NodeItems and their children under a parent in this document.</p> </li> <li> <code>add_ordered_list</code> \u2013 <p>add_ordered_list.</p> </li> <li> <code>add_page</code> \u2013 <p>add_page.</p> </li> <li> <code>add_picture</code> \u2013 <p>add_picture.</p> </li> <li> <code>add_table</code> \u2013 <p>add_table.</p> </li> <li> <code>add_text</code> \u2013 <p>add_text.</p> </li> <li> <code>add_title</code> \u2013 <p>add_title.</p> </li> <li> <code>add_unordered_list</code> \u2013 <p>add_unordered_list.</p> </li> <li> <code>append_child_item</code> \u2013 <p>Adds an item.</p> </li> <li> <code>check_version_is_compatible</code> \u2013 <p>Check if this document version is compatible with SDK schema version.</p> </li> <li> <code>delete_items</code> \u2013 <p>Deletes an item, given its instance or ref, and any children it has.</p> </li> <li> <code>delete_items_range</code> \u2013 <p>Deletes all NodeItems and their children in the range from the start NodeItem to the end NodeItem.</p> </li> <li> <code>export_to_dict</code> \u2013 <p>Export to dict.</p> </li> <li> <code>export_to_doctags</code> \u2013 <p>Exports the document content to a DocumentToken format.</p> </li> <li> <code>export_to_document_tokens</code> \u2013 <p>Export to DocTags format.</p> </li> <li> <code>export_to_element_tree</code> \u2013 <p>Export_to_element_tree.</p> </li> <li> <code>export_to_html</code> \u2013 <p>Serialize to HTML.</p> </li> <li> <code>export_to_markdown</code> \u2013 <p>Serialize to Markdown.</p> </li> <li> <code>export_to_text</code> \u2013 <p>export_to_text.</p> </li> <li> <code>extract_items_range</code> \u2013 <p>Extracts NodeItems and children in the range from the start NodeItem to the end as a new DoclingDocument.</p> </li> <li> <code>get_visualization</code> \u2013 <p>Get visualization of the document as images by page.</p> </li> <li> <code>insert_code</code> \u2013 <p>Creates a new CodeItem item and inserts it into the document.</p> </li> <li> <code>insert_document</code> \u2013 <p>Inserts the content from the body of a DoclingDocument into this document at a specific position.</p> </li> <li> <code>insert_form</code> \u2013 <p>Creates a new FormItem item and inserts it into the document.</p> </li> <li> <code>insert_formula</code> \u2013 <p>Creates a new FormulaItem item and inserts it into the document.</p> </li> <li> <code>insert_group</code> \u2013 <p>Creates a new GroupItem item and inserts it into the document.</p> </li> <li> <code>insert_heading</code> \u2013 <p>Creates a new SectionHeaderItem item and inserts it into the document.</p> </li> <li> <code>insert_inline_group</code> \u2013 <p>Creates a new InlineGroup item and inserts it into the document.</p> </li> <li> <code>insert_item_after_sibling</code> \u2013 <p>Inserts an item, given its node_item instance, after other as a sibling.</p> </li> <li> <code>insert_item_before_sibling</code> \u2013 <p>Inserts an item, given its node_item instance, before other as a sibling.</p> </li> <li> <code>insert_key_values</code> \u2013 <p>Creates a new KeyValueItem item and inserts it into the document.</p> </li> <li> <code>insert_list_group</code> \u2013 <p>Creates a new ListGroup item and inserts it into the document.</p> </li> <li> <code>insert_list_item</code> \u2013 <p>Creates a new ListItem item and inserts it into the document.</p> </li> <li> <code>insert_node_items</code> \u2013 <p>Insert multiple NodeItems and their children at a specific position in the document.</p> </li> <li> <code>insert_picture</code> \u2013 <p>Creates a new PictureItem item and inserts it into the document.</p> </li> <li> <code>insert_table</code> \u2013 <p>Creates a new TableItem item and inserts it into the document.</p> </li> <li> <code>insert_text</code> \u2013 <p>Creates a new TextItem item and inserts it into the document.</p> </li> <li> <code>insert_title</code> \u2013 <p>Creates a new TitleItem item and inserts it into the document.</p> </li> <li> <code>iterate_items</code> \u2013 <p>Iterate elements with level.</p> </li> <li> <code>load_from_doctags</code> \u2013 <p>Load Docling document from lists of DocTags and Images.</p> </li> <li> <code>load_from_json</code> \u2013 <p>load_from_json.</p> </li> <li> <code>load_from_yaml</code> \u2013 <p>load_from_yaml.</p> </li> <li> <code>num_pages</code> \u2013 <p>num_pages.</p> </li> <li> <code>print_element_tree</code> \u2013 <p>Print_element_tree.</p> </li> <li> <code>replace_item</code> \u2013 <p>Replace item with new item.</p> </li> <li> <code>save_as_doctags</code> \u2013 <p>Save the document content to DocTags format.</p> </li> <li> <code>save_as_document_tokens</code> \u2013 <p>Save the document content to a DocumentToken format.</p> </li> <li> <code>save_as_html</code> \u2013 <p>Save to HTML.</p> </li> <li> <code>save_as_json</code> \u2013 <p>Save as json.</p> </li> <li> <code>save_as_markdown</code> \u2013 <p>Save to markdown.</p> </li> <li> <code>save_as_yaml</code> \u2013 <p>Save as yaml.</p> </li> <li> <code>transform_to_content_layer</code> \u2013 <p>transform_to_content_layer.</p> </li> <li> <code>validate_document</code> \u2013 <p>validate_document.</p> </li> <li> <code>validate_misplaced_list_items</code> \u2013 <p>validate_misplaced_list_items.</p> </li> <li> <code>validate_tree</code> \u2013 <p>validate_tree.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>body</code> (<code>GroupItem</code>) \u2013 </li> <li> <code>form_items</code> (<code>List[FormItem]</code>) \u2013 </li> <li> <code>furniture</code> (<code>Annotated[GroupItem, Field(deprecated=True)]</code>) \u2013 </li> <li> <code>groups</code> (<code>List[Union[ListGroup, InlineGroup, GroupItem]]</code>) \u2013 </li> <li> <code>key_value_items</code> (<code>List[KeyValueItem]</code>) \u2013 </li> <li> <code>name</code> (<code>str</code>) \u2013 </li> <li> <code>origin</code> (<code>Optional[DocumentOrigin]</code>) \u2013 </li> <li> <code>pages</code> (<code>Dict[int, PageItem]</code>) \u2013 </li> <li> <code>pictures</code> (<code>List[PictureItem]</code>) \u2013 </li> <li> <code>schema_name</code> (<code>Literal['DoclingDocument']</code>) \u2013 </li> <li> <code>tables</code> (<code>List[TableItem]</code>) \u2013 </li> <li> <code>texts</code> (<code>List[Union[TitleItem, SectionHeaderItem, ListItem, CodeItem, FormulaItem, TextItem]]</code>) \u2013 </li> <li> <code>version</code> (<code>Annotated[str, StringConstraints(pattern=VERSION_PATTERN, strict=True)]</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.body","title":"body","text":"<pre><code>body: GroupItem = GroupItem(name='_root_', self_ref='#/body')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.form_items","title":"form_items","text":"<pre><code>form_items: List[FormItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.furniture","title":"furniture","text":"<pre><code>furniture: Annotated[GroupItem, Field(deprecated=True)] = GroupItem(name='_root_', self_ref='#/furniture', content_layer=FURNITURE)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.groups","title":"groups","text":"<pre><code>groups: List[Union[ListGroup, InlineGroup, GroupItem]] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.key_value_items","title":"key_value_items","text":"<pre><code>key_value_items: List[KeyValueItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.name","title":"name","text":"<pre><code>name: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.origin","title":"origin","text":"<pre><code>origin: Optional[DocumentOrigin] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.pages","title":"pages","text":"<pre><code>pages: Dict[int, PageItem] = {}\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.pictures","title":"pictures","text":"<pre><code>pictures: List[PictureItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.schema_name","title":"schema_name","text":"<pre><code>schema_name: Literal['DoclingDocument'] = 'DoclingDocument'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.tables","title":"tables","text":"<pre><code>tables: List[TableItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.texts","title":"texts","text":"<pre><code>texts: List[Union[TitleItem, SectionHeaderItem, ListItem, CodeItem, FormulaItem, TextItem]] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.version","title":"version","text":"<pre><code>version: Annotated[str, StringConstraints(pattern=VERSION_PATTERN, strict=True)] = CURRENT_VERSION\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_code","title":"add_code","text":"<pre><code>add_code(text: str, code_language: Optional[CodeLanguageLabel] = None, orig: Optional[str] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_code.</p> <p>Parameters:</p> <ul> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>code_language</code> (<code>Optional[CodeLanguageLabel]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>caption</code> (<code>Optional[Union[TextItem, RefItem]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[TextItem:</p> </li> <li> <code>RefItem]]</code> \u2013 <p>(Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_document","title":"add_document","text":"<pre><code>add_document(doc: DoclingDocument, parent: Optional[NodeItem] = None) -> None\n</code></pre> <p>Adds the content from the body of a DoclingDocument to this document under a specific parent.</p> <p>Parameters:</p> <ul> <li> <code>doc</code> (<code>DoclingDocument</code>) \u2013 <p>DoclingDocument: The document whose content will be added</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: The parent NodeItem under which new items are added (Default value = None)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code> \u2013 <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_form","title":"add_form","text":"<pre><code>add_form(graph: GraphData, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None)\n</code></pre> <p>add_form.</p> <p>Parameters:</p> <ul> <li> <code>graph</code> (<code>GraphData</code>) \u2013 <p>GraphData:</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_formula","title":"add_formula","text":"<pre><code>add_formula(text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_formula.</p> <p>Parameters:</p> <ul> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>level</code> \u2013 <p>LevelNumber: (Default value = 1)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_group","title":"add_group","text":"<pre><code>add_group(label: Optional[GroupLabel] = None, name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> GroupItem\n</code></pre> <p>add_group.</p> <p>Parameters:</p> <ul> <li> <code>label</code> (<code>Optional[GroupLabel]</code>, default: <code>None</code> ) \u2013 <p>Optional[GroupLabel]: (Default value = None)</p> </li> <li> <code>name</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_heading","title":"add_heading","text":"<pre><code>add_heading(text: str, orig: Optional[str] = None, level: LevelNumber = 1, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_heading.</p> <p>Parameters:</p> <ul> <li> <code>label</code> \u2013 <p>DocItemLabel:</p> </li> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>level</code> (<code>LevelNumber</code>, default: <code>1</code> ) \u2013 <p>LevelNumber: (Default value = 1)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_inline_group","title":"add_inline_group","text":"<pre><code>add_inline_group(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> InlineGroup\n</code></pre> <p>add_inline_group.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_key_values","title":"add_key_values","text":"<pre><code>add_key_values(graph: GraphData, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None)\n</code></pre> <p>add_key_values.</p> <p>Parameters:</p> <ul> <li> <code>graph</code> (<code>GraphData</code>) \u2013 <p>GraphData:</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_list_group","title":"add_list_group","text":"<pre><code>add_list_group(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> ListGroup\n</code></pre> <p>add_list_group.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_list_item","title":"add_list_item","text":"<pre><code>add_list_item(text: str, enumerated: bool = False, marker: Optional[str] = None, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_list_item.</p> <p>Parameters:</p> <ul> <li> <code>label</code> \u2013 <p>str:</p> </li> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_node_items","title":"add_node_items","text":"<pre><code>add_node_items(node_items: List[NodeItem], doc: DoclingDocument, parent: Optional[NodeItem] = None) -> None\n</code></pre> <p>Adds multiple NodeItems and their children under a parent in this document.</p> <p>Parameters:</p> <ul> <li> <code>node_items</code> (<code>List[NodeItem]</code>) \u2013 <p>list[NodeItem]: The NodeItems to be added</p> </li> <li> <code>doc</code> (<code>DoclingDocument</code>) \u2013 <p>DoclingDocument: The document to which the NodeItems and their children belong</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: The parent NodeItem under which new items are added (Default value = None)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code> \u2013 <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_ordered_list","title":"add_ordered_list","text":"<pre><code>add_ordered_list(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> GroupItem\n</code></pre> <p>add_ordered_list.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_page","title":"add_page","text":"<pre><code>add_page(page_no: int, size: Size, image: Optional[ImageRef] = None) -> PageItem\n</code></pre> <p>add_page.</p> <p>Parameters:</p> <ul> <li> <code>page_no</code> (<code>int</code>) \u2013 <p>int:</p> </li> <li> <code>size</code> (<code>Size</code>) \u2013 <p>Size:</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_picture","title":"add_picture","text":"<pre><code>add_picture(annotations: Optional[List[PictureDataType]] = None, image: Optional[ImageRef] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None)\n</code></pre> <p>add_picture.</p> <p>Parameters:</p> <ul> <li> <code>data</code> \u2013 <p>Optional[List[PictureData]]: (Default value = None)</p> </li> <li> <code>caption</code> (<code>Optional[Union[TextItem, RefItem]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[TextItem:</p> </li> <li> <code>RefItem]]</code> \u2013 <p>(Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_table","title":"add_table","text":"<pre><code>add_table(data: TableData, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, label: DocItemLabel = TABLE, content_layer: Optional[ContentLayer] = None, annotations: Optional[list[TableAnnotationType]] = None)\n</code></pre> <p>add_table.</p> <p>Parameters:</p> <ul> <li> <code>data</code> (<code>TableData</code>) \u2013 <p>TableData:</p> </li> <li> <code>caption</code> (<code>Optional[Union[TextItem, RefItem]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[TextItem, RefItem]]: (Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: (Default value = None)</p> </li> <li> <code>label</code> (<code>DocItemLabel</code>, default: <code>TABLE</code> ) \u2013 <p>DocItemLabel: (Default value = DocItemLabel.TABLE)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_text","title":"add_text","text":"<pre><code>add_text(label: DocItemLabel, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_text.</p> <p>Parameters:</p> <ul> <li> <code>label</code> (<code>DocItemLabel</code>) \u2013 <p>str:</p> </li> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_title","title":"add_title","text":"<pre><code>add_title(text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None)\n</code></pre> <p>add_title.</p> <p>Parameters:</p> <ul> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>level</code> \u2013 <p>LevelNumber: (Default value = 1)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>parent</code> (<code>Optional[NodeItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[NodeItem]: (Default value = None)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.add_unordered_list","title":"add_unordered_list","text":"<pre><code>add_unordered_list(name: Optional[str] = None, parent: Optional[NodeItem] = None, content_layer: Optional[ContentLayer] = None) -> GroupItem\n</code></pre> <p>add_unordered_list.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.append_child_item","title":"append_child_item","text":"<pre><code>append_child_item(*, child: NodeItem, parent: Optional[NodeItem] = None) -> None\n</code></pre> <p>Adds an item.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.check_version_is_compatible","title":"check_version_is_compatible","text":"<pre><code>check_version_is_compatible(v: str) -> str\n</code></pre> <p>Check if this document version is compatible with SDK schema version.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.delete_items","title":"delete_items","text":"<pre><code>delete_items(*, node_items: List[NodeItem]) -> None\n</code></pre> <p>Deletes an item, given its instance or ref, and any children it has.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.delete_items_range","title":"delete_items_range","text":"<pre><code>delete_items_range(*, start: NodeItem, end: NodeItem, start_inclusive: bool = True, end_inclusive: bool = True) -> None\n</code></pre> <p>Deletes all NodeItems and their children in the range from the start NodeItem to the end NodeItem.</p> <p>Parameters:</p> <ul> <li> <code>start</code> (<code>NodeItem</code>) \u2013 <p>NodeItem: The starting NodeItem of the range</p> </li> <li> <code>end</code> (<code>NodeItem</code>) \u2013 <p>NodeItem: The ending NodeItem of the range</p> </li> <li> <code>start_inclusive</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True): If True, the start NodeItem will also be deleted</p> </li> <li> <code>end_inclusive</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True): If True, the end NodeItem will also be deleted</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code> \u2013 <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_dict","title":"export_to_dict","text":"<pre><code>export_to_dict(mode: str = 'json', by_alias: bool = True, exclude_none: bool = True, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None) -> Dict[str, Any]\n</code></pre> <p>Export to dict.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_doctags","title":"export_to_doctags","text":"<pre><code>export_to_doctags(delim: str = '', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True, add_page_index: bool = True, add_table_cell_location: bool = False, add_table_cell_text: bool = True, minified: bool = False, pages: Optional[set[int]] = None) -> str\n</code></pre> <p>Exports the document content to a DocumentToken format.</p> <p>Operates on a slice of the document's body as defined through arguments from_element and to_element; defaulting to the whole main_text.</p> <p>Parameters:</p> <ul> <li> <code>delim</code> (<code>str</code>, default: <code>''</code> ) \u2013 <p>str: (Default value = \"\") Deprecated</p> </li> <li> <code>from_element</code> (<code>int</code>, default: <code>0</code> ) \u2013 <p>int: (Default value = 0)</p> </li> <li> <code>to_element</code> (<code>int</code>, default: <code>maxsize</code> ) \u2013 <p>Optional[int]: (Default value = None)</p> </li> <li> <code>labels</code> (<code>Optional[set[DocItemLabel]]</code>, default: <code>None</code> ) \u2013 <p>set[DocItemLabel]</p> </li> <li> <code>xsize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>ysize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>add_location</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>add_content</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>add_page_index</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>add_table_cell_text</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>minified</code> (<code>bool</code>, default: <code>False</code> ) \u2013 <p>bool: (Default value = False)</p> </li> <li> <code>pages</code> (<code>Optional[set[int]]</code>, default: <code>None</code> ) \u2013 <p>set[int]: (Default value = None)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>str</code> \u2013 <p>The content of the document formatted as a DocTags string.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(*args, **kwargs)\n</code></pre> <p>Export to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_element_tree","title":"export_to_element_tree","text":"<pre><code>export_to_element_tree() -> str\n</code></pre> <p>Export_to_element_tree.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_html","title":"export_to_html","text":"<pre><code>export_to_html(from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, enable_chart_tables: bool = True, image_mode: ImageRefMode = PLACEHOLDER, formula_to_mathml: bool = True, page_no: Optional[int] = None, html_lang: str = 'en', html_head: str = 'null', included_content_layers: Optional[set[ContentLayer]] = None, split_page_view: bool = False, include_annotations: bool = True) -> str\n</code></pre> <p>Serialize to HTML.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_markdown","title":"export_to_markdown","text":"<pre><code>export_to_markdown(delim: str = '\\n\\n', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, strict_text: bool = False, escape_underscores: bool = True, image_placeholder: str = '<!-- image -->', enable_chart_tables: bool = True, image_mode: ImageRefMode = PLACEHOLDER, indent: int = 4, text_width: int = -1, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, page_break_placeholder: Optional[str] = None, include_annotations: bool = True, mark_annotations: bool = False) -> str\n</code></pre> <p>Serialize to Markdown.</p> <p>Operates on a slice of the document's body as defined through arguments from_element and to_element; defaulting to the whole document.</p> <p>Parameters:</p> <ul> <li> <code>delim</code> (<code>str</code>, default: <code>'\\n\\n'</code> ) \u2013 <p>Deprecated.</p> </li> <li> <code>from_element</code> (<code>int</code>, default: <code>0</code> ) \u2013 <p>Body slicing start index (inclusive). (Default value = 0).</p> </li> <li> <code>to_element</code> (<code>int</code>, default: <code>maxsize</code> ) \u2013 <p>Body slicing stop index (exclusive). (Default value = maxint).</p> </li> <li> <code>labels</code> (<code>Optional[set[DocItemLabel]]</code>, default: <code>None</code> ) \u2013 <p>The set of document labels to include in the export. None falls back to the system-defined default.</p> </li> <li> <code>strict_text</code> (<code>bool</code>, default: <code>False</code> ) \u2013 <p>Deprecated.</p> </li> <li> <code>escape_underscores</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: Whether to escape underscores in the text content of the document. (Default value = True).</p> </li> <li> <code>image_placeholder</code> (<code>str</code>, default: <code>'<!-- image -->'</code> ) \u2013 <p>The placeholder to include to position images in the markdown. (Default value = \"\\<!-- image -->\").</p> </li> <li> <code>image_mode</code> (<code>ImageRefMode</code>, default: <code>PLACEHOLDER</code> ) \u2013 <p>The mode to use for including images in the markdown. (Default value = ImageRefMode.PLACEHOLDER).</p> </li> <li> <code>indent</code> (<code>int</code>, default: <code>4</code> ) \u2013 <p>The indent in spaces of the nested lists. (Default value = 4).</p> </li> <li> <code>included_content_layers</code> (<code>Optional[set[ContentLayer]]</code>, default: <code>None</code> ) \u2013 <p>The set of layels to include in the export. None falls back to the system-defined default.</p> </li> <li> <code>page_break_placeholder</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>The placeholder to include for marking page breaks. None means no page break placeholder will be used.</p> </li> <li> <code>include_annotations</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: Whether to include annotations in the export. (Default value = True).</p> </li> <li> <code>mark_annotations</code> (<code>bool</code>, default: <code>False</code> ) \u2013 <p>bool: Whether to mark annotations in the export; only relevant if include_annotations is True. (Default value = False).</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>str</code> \u2013 <p>The exported Markdown representation.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_text","title":"export_to_text","text":"<pre><code>export_to_text(delim: str = '\\n\\n', from_element: int = 0, to_element: int = 1000000, labels: Optional[set[DocItemLabel]] = None) -> str\n</code></pre> <p>export_to_text.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.extract_items_range","title":"extract_items_range","text":"<pre><code>extract_items_range(*, start: NodeItem, end: NodeItem, start_inclusive: bool = True, end_inclusive: bool = True, delete: bool = False) -> DoclingDocument\n</code></pre> <p>Extracts NodeItems and children in the range from the start NodeItem to the end as a new DoclingDocument.</p> <p>Parameters:</p> <ul> <li> <code>start</code> (<code>NodeItem</code>) \u2013 <p>NodeItem: The starting NodeItem of the range (must be a direct child of the document body)</p> </li> <li> <code>end</code> (<code>NodeItem</code>) \u2013 <p>NodeItem: The ending NodeItem of the range (must be a direct child of the document body)</p> </li> <li> <code>start_inclusive</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True): If True, the start NodeItem will also be extracted</p> </li> <li> <code>end_inclusive</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True): If True, the end NodeItem will also be extracted</p> </li> <li> <code>delete</code> (<code>bool</code>, default: <code>False</code> ) \u2013 <p>bool: (Default value = False): If True, extracted items are deleted in the original document</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>DoclingDocument</code> \u2013 <p>DoclingDocument: A new document containing the extracted NodeItems and their children</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.get_visualization","title":"get_visualization","text":"<pre><code>get_visualization(show_label: bool = True, show_branch_numbering: bool = False) -> dict[Optional[int], Image]\n</code></pre> <p>Get visualization of the document as images by page.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_code","title":"insert_code","text":"<pre><code>insert_code(sibling: NodeItem, text: str, code_language: Optional[CodeLanguageLabel] = None, orig: Optional[str] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> CodeItem\n</code></pre> <p>Creates a new CodeItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>code_language</code> (<code>Optional[CodeLanguageLabel]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>caption</code> (<code>Optional[Union[TextItem, RefItem]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[TextItem, RefItem]]: (Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>content_layer</code> (<code>Optional[ContentLayer]</code>, default: <code>None</code> ) \u2013 <p>Optional[ContentLayer]: (Default value = None)</p> </li> <li> <code>formatting</code> (<code>Optional[Formatting]</code>, default: <code>None</code> ) \u2013 <p>Optional[Formatting]: (Default value = None)</p> </li> <li> <code>hyperlink</code> (<code>Optional[Union[AnyUrl, Path]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[AnyUrl, Path]]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>CodeItem</code> \u2013 <p>CodeItem: The newly created CodeItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_document","title":"insert_document","text":"<pre><code>insert_document(doc: DoclingDocument, sibling: NodeItem, after: bool = True) -> None\n</code></pre> <p>Inserts the content from the body of a DoclingDocument into this document at a specific position.</p> <p>Parameters:</p> <ul> <li> <code>doc</code> (<code>DoclingDocument</code>) \u2013 <p>DoclingDocument: The document whose content will be inserted</p> </li> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem: The NodeItem after/before which the new items will be inserted</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: If True, insert after the sibling; if False, insert before (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code> \u2013 <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_form","title":"insert_form","text":"<pre><code>insert_form(sibling: NodeItem, graph: GraphData, prov: Optional[ProvenanceItem] = None, after: bool = True) -> FormItem\n</code></pre> <p>Creates a new FormItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>graph</code> (<code>GraphData</code>) \u2013 <p>GraphData:</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>FormItem</code> \u2013 <p>FormItem: The newly created FormItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_formula","title":"insert_formula","text":"<pre><code>insert_formula(sibling: NodeItem, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> FormulaItem\n</code></pre> <p>Creates a new FormulaItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>content_layer</code> (<code>Optional[ContentLayer]</code>, default: <code>None</code> ) \u2013 <p>Optional[ContentLayer]: (Default value = None)</p> </li> <li> <code>formatting</code> (<code>Optional[Formatting]</code>, default: <code>None</code> ) \u2013 <p>Optional[Formatting]: (Default value = None)</p> </li> <li> <code>hyperlink</code> (<code>Optional[Union[AnyUrl, Path]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[AnyUrl, Path]]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>FormulaItem</code> \u2013 <p>FormulaItem: The newly created FormulaItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_group","title":"insert_group","text":"<pre><code>insert_group(sibling: NodeItem, label: Optional[GroupLabel] = None, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> GroupItem\n</code></pre> <p>Creates a new GroupItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>label</code> (<code>Optional[GroupLabel]</code>, default: <code>None</code> ) \u2013 <p>Optional[GroupLabel]: (Default value = None)</p> </li> <li> <code>name</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>content_layer</code> (<code>Optional[ContentLayer]</code>, default: <code>None</code> ) \u2013 <p>Optional[ContentLayer]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>GroupItem</code> \u2013 <p>GroupItem: The newly created GroupItem.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_heading","title":"insert_heading","text":"<pre><code>insert_heading(sibling: NodeItem, text: str, orig: Optional[str] = None, level: LevelNumber = 1, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> SectionHeaderItem\n</code></pre> <p>Creates a new SectionHeaderItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>level</code> (<code>LevelNumber</code>, default: <code>1</code> ) \u2013 <p>LevelNumber: (Default value = 1)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>content_layer</code> (<code>Optional[ContentLayer]</code>, default: <code>None</code> ) \u2013 <p>Optional[ContentLayer]: (Default value = None)</p> </li> <li> <code>formatting</code> (<code>Optional[Formatting]</code>, default: <code>None</code> ) \u2013 <p>Optional[Formatting]: (Default value = None)</p> </li> <li> <code>hyperlink</code> (<code>Optional[Union[AnyUrl, Path]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[AnyUrl, Path]]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>SectionHeaderItem</code> \u2013 <p>SectionHeaderItem: The newly created SectionHeaderItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_inline_group","title":"insert_inline_group","text":"<pre><code>insert_inline_group(sibling: NodeItem, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> InlineGroup\n</code></pre> <p>Creates a new InlineGroup item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>name</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>content_layer</code> (<code>Optional[ContentLayer]</code>, default: <code>None</code> ) \u2013 <p>Optional[ContentLayer]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>InlineGroup</code> \u2013 <p>InlineGroup: The newly created InlineGroup item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_item_after_sibling","title":"insert_item_after_sibling","text":"<pre><code>insert_item_after_sibling(*, new_item: NodeItem, sibling: NodeItem) -> None\n</code></pre> <p>Inserts an item, given its node_item instance, after other as a sibling.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_item_before_sibling","title":"insert_item_before_sibling","text":"<pre><code>insert_item_before_sibling(*, new_item: NodeItem, sibling: NodeItem) -> None\n</code></pre> <p>Inserts an item, given its node_item instance, before other as a sibling.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_key_values","title":"insert_key_values","text":"<pre><code>insert_key_values(sibling: NodeItem, graph: GraphData, prov: Optional[ProvenanceItem] = None, after: bool = True) -> KeyValueItem\n</code></pre> <p>Creates a new KeyValueItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>graph</code> (<code>GraphData</code>) \u2013 <p>GraphData:</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>KeyValueItem</code> \u2013 <p>KeyValueItem: The newly created KeyValueItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_list_group","title":"insert_list_group","text":"<pre><code>insert_list_group(sibling: NodeItem, name: Optional[str] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> ListGroup\n</code></pre> <p>Creates a new ListGroup item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>name</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>content_layer</code> (<code>Optional[ContentLayer]</code>, default: <code>None</code> ) \u2013 <p>Optional[ContentLayer]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>ListGroup</code> \u2013 <p>ListGroup: The newly created ListGroup item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_list_item","title":"insert_list_item","text":"<pre><code>insert_list_item(sibling: NodeItem, text: str, enumerated: bool = False, marker: Optional[str] = None, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> ListItem\n</code></pre> <p>Creates a new ListItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>enumerated</code> (<code>bool</code>, default: <code>False</code> ) \u2013 <p>bool: (Default value = False)</p> </li> <li> <code>marker</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>content_layer</code> (<code>Optional[ContentLayer]</code>, default: <code>None</code> ) \u2013 <p>Optional[ContentLayer]: (Default value = None)</p> </li> <li> <code>formatting</code> (<code>Optional[Formatting]</code>, default: <code>None</code> ) \u2013 <p>Optional[Formatting]: (Default value = None)</p> </li> <li> <code>hyperlink</code> (<code>Optional[Union[AnyUrl, Path]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[AnyUrl, Path]]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>ListItem</code> \u2013 <p>ListItem: The newly created ListItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_node_items","title":"insert_node_items","text":"<pre><code>insert_node_items(sibling: NodeItem, node_items: List[NodeItem], doc: DoclingDocument, after: bool = True) -> None\n</code></pre> <p>Insert multiple NodeItems and their children at a specific position in the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem: The NodeItem after/before which the new items will be inserted</p> </li> <li> <code>node_items</code> (<code>List[NodeItem]</code>) \u2013 <p>list[NodeItem]: The NodeItems to be inserted</p> </li> <li> <code>doc</code> (<code>DoclingDocument</code>) \u2013 <p>DoclingDocument: The document to which the NodeItems and their children belong</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: If True, insert after the sibling; if False, insert before (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code> \u2013 <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_picture","title":"insert_picture","text":"<pre><code>insert_picture(sibling: NodeItem, annotations: Optional[List[PictureDataType]] = None, image: Optional[ImageRef] = None, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, after: bool = True) -> PictureItem\n</code></pre> <p>Creates a new PictureItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>annotations</code> (<code>Optional[List[PictureDataType]]</code>, default: <code>None</code> ) \u2013 <p>Optional[List[PictureDataType]]: (Default value = None)</p> </li> <li> <code>image</code> (<code>Optional[ImageRef]</code>, default: <code>None</code> ) \u2013 <p>Optional[ImageRef]: (Default value = None)</p> </li> <li> <code>caption</code> (<code>Optional[Union[TextItem, RefItem]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[TextItem, RefItem]]: (Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>content_layer</code> (<code>Optional[ContentLayer]</code>, default: <code>None</code> ) \u2013 <p>Optional[ContentLayer]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>PictureItem</code> \u2013 <p>PictureItem: The newly created PictureItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_table","title":"insert_table","text":"<pre><code>insert_table(sibling: NodeItem, data: TableData, caption: Optional[Union[TextItem, RefItem]] = None, prov: Optional[ProvenanceItem] = None, label: DocItemLabel = TABLE, content_layer: Optional[ContentLayer] = None, annotations: Optional[list[TableAnnotationType]] = None, after: bool = True) -> TableItem\n</code></pre> <p>Creates a new TableItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>data</code> (<code>TableData</code>) \u2013 <p>TableData:</p> </li> <li> <code>caption</code> (<code>Optional[Union[TextItem, RefItem]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[TextItem, RefItem]]: (Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>label</code> (<code>DocItemLabel</code>, default: <code>TABLE</code> ) \u2013 <p>DocItemLabel: (Default value = DocItemLabel.TABLE)</p> </li> <li> <code>content_layer</code> (<code>Optional[ContentLayer]</code>, default: <code>None</code> ) \u2013 <p>Optional[ContentLayer]: (Default value = None)</p> </li> <li> <code>annotations</code> (<code>Optional[list[TableAnnotationType]]</code>, default: <code>None</code> ) \u2013 <p>Optional[List[TableAnnotationType]]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>TableItem</code> \u2013 <p>TableItem: The newly created TableItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_text","title":"insert_text","text":"<pre><code>insert_text(sibling: NodeItem, label: DocItemLabel, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> TextItem\n</code></pre> <p>Creates a new TextItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>label</code> (<code>DocItemLabel</code>) \u2013 <p>DocItemLabel:</p> </li> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>content_layer</code> (<code>Optional[ContentLayer]</code>, default: <code>None</code> ) \u2013 <p>Optional[ContentLayer]: (Default value = None)</p> </li> <li> <code>formatting</code> (<code>Optional[Formatting]</code>, default: <code>None</code> ) \u2013 <p>Optional[Formatting]: (Default value = None)</p> </li> <li> <code>hyperlink</code> (<code>Optional[Union[AnyUrl, Path]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[AnyUrl, Path]]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>TextItem</code> \u2013 <p>TextItem: The newly created TextItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.insert_title","title":"insert_title","text":"<pre><code>insert_title(sibling: NodeItem, text: str, orig: Optional[str] = None, prov: Optional[ProvenanceItem] = None, content_layer: Optional[ContentLayer] = None, formatting: Optional[Formatting] = None, hyperlink: Optional[Union[AnyUrl, Path]] = None, after: bool = True) -> TitleItem\n</code></pre> <p>Creates a new TitleItem item and inserts it into the document.</p> <p>Parameters:</p> <ul> <li> <code>sibling</code> (<code>NodeItem</code>) \u2013 <p>NodeItem:</p> </li> <li> <code>text</code> (<code>str</code>) \u2013 <p>str:</p> </li> <li> <code>orig</code> (<code>Optional[str]</code>, default: <code>None</code> ) \u2013 <p>Optional[str]: (Default value = None)</p> </li> <li> <code>prov</code> (<code>Optional[ProvenanceItem]</code>, default: <code>None</code> ) \u2013 <p>Optional[ProvenanceItem]: (Default value = None)</p> </li> <li> <code>content_layer</code> (<code>Optional[ContentLayer]</code>, default: <code>None</code> ) \u2013 <p>Optional[ContentLayer]: (Default value = None)</p> </li> <li> <code>formatting</code> (<code>Optional[Formatting]</code>, default: <code>None</code> ) \u2013 <p>Optional[Formatting]: (Default value = None)</p> </li> <li> <code>hyperlink</code> (<code>Optional[Union[AnyUrl, Path]]</code>, default: <code>None</code> ) \u2013 <p>Optional[Union[AnyUrl, Path]]: (Default value = None)</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>TitleItem</code> \u2013 <p>TitleItem: The newly created TitleItem item.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.iterate_items","title":"iterate_items","text":"<pre><code>iterate_items(root: Optional[NodeItem] = None, with_groups: bool = False, traverse_pictures: bool = False, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, _level: int = 0) -> Iterable[Tuple[NodeItem, int]]\n</code></pre> <p>Iterate elements with level.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.load_from_doctags","title":"load_from_doctags","text":"<pre><code>load_from_doctags(doctag_document: DocTagsDocument, document_name: str = 'Document') -> DoclingDocument\n</code></pre> <p>Load Docling document from lists of DocTags and Images.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.load_from_json","title":"load_from_json","text":"<pre><code>load_from_json(filename: Union[str, Path]) -> DoclingDocument\n</code></pre> <p>load_from_json.</p> <p>Parameters:</p> <ul> <li> <code>filename</code> (<code>Union[str, Path]</code>) \u2013 <p>The filename to load a saved DoclingDocument from a .json.</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>DoclingDocument</code> \u2013 <p>The loaded DoclingDocument.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.load_from_yaml","title":"load_from_yaml","text":"<pre><code>load_from_yaml(filename: Union[str, Path]) -> DoclingDocument\n</code></pre> <p>load_from_yaml.</p> <p>Args: filename: The filename to load a YAML-serialized DoclingDocument from.</p> <p>Returns: DoclingDocument: the loaded DoclingDocument</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.num_pages","title":"num_pages","text":"<pre><code>num_pages()\n</code></pre> <p>num_pages.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.print_element_tree","title":"print_element_tree","text":"<pre><code>print_element_tree()\n</code></pre> <p>Print_element_tree.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.replace_item","title":"replace_item","text":"<pre><code>replace_item(*, new_item: NodeItem, old_item: NodeItem) -> None\n</code></pre> <p>Replace item with new item.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_doctags","title":"save_as_doctags","text":"<pre><code>save_as_doctags(filename: Union[str, Path], delim: str = '', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True, add_page_index: bool = True, add_table_cell_location: bool = False, add_table_cell_text: bool = True, minified: bool = False)\n</code></pre> <p>Save the document content to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_document_tokens","title":"save_as_document_tokens","text":"<pre><code>save_as_document_tokens(*args, **kwargs)\n</code></pre> <p>Save the document content to a DocumentToken format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_html","title":"save_as_html","text":"<pre><code>save_as_html(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, image_mode: ImageRefMode = PLACEHOLDER, formula_to_mathml: bool = True, page_no: Optional[int] = None, html_lang: str = 'en', html_head: str = 'null', included_content_layers: Optional[set[ContentLayer]] = None, split_page_view: bool = False, include_annotations: bool = True)\n</code></pre> <p>Save to HTML.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_json","title":"save_as_json","text":"<pre><code>save_as_json(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, image_mode: ImageRefMode = EMBEDDED, indent: int = 2, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None)\n</code></pre> <p>Save as json.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_markdown","title":"save_as_markdown","text":"<pre><code>save_as_markdown(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, delim: str = '\\n\\n', from_element: int = 0, to_element: int = maxsize, labels: Optional[set[DocItemLabel]] = None, strict_text: bool = False, escaping_underscores: bool = True, image_placeholder: str = '<!-- image -->', image_mode: ImageRefMode = PLACEHOLDER, indent: int = 4, text_width: int = -1, page_no: Optional[int] = None, included_content_layers: Optional[set[ContentLayer]] = None, page_break_placeholder: Optional[str] = None, include_annotations: bool = True)\n</code></pre> <p>Save to markdown.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.save_as_yaml","title":"save_as_yaml","text":"<pre><code>save_as_yaml(filename: Union[str, Path], artifacts_dir: Optional[Path] = None, image_mode: ImageRefMode = EMBEDDED, default_flow_style: bool = False, coord_precision: Optional[int] = None, confid_precision: Optional[int] = None)\n</code></pre> <p>Save as yaml.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.transform_to_content_layer","title":"transform_to_content_layer","text":"<pre><code>transform_to_content_layer(data: dict) -> dict\n</code></pre> <p>transform_to_content_layer.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_document","title":"validate_document","text":"<pre><code>validate_document(d: DoclingDocument)\n</code></pre> <p>validate_document.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_misplaced_list_items","title":"validate_misplaced_list_items","text":"<pre><code>validate_misplaced_list_items()\n</code></pre> <p>validate_misplaced_list_items.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DoclingDocument.validate_tree","title":"validate_tree","text":"<pre><code>validate_tree(root) -> bool\n</code></pre> <p>validate_tree.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin","title":"DocumentOrigin","text":"<p> Bases: <code>BaseModel</code></p> <p>FileSource.</p> <p>Methods:</p> <ul> <li> <code>parse_hex_string</code> \u2013 <p>parse_hex_string.</p> </li> <li> <code>validate_mimetype</code> \u2013 <p>validate_mimetype.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>binary_hash</code> (<code>Uint64</code>) \u2013 </li> <li> <code>filename</code> (<code>str</code>) \u2013 </li> <li> <code>mimetype</code> (<code>str</code>) \u2013 </li> <li> <code>uri</code> (<code>Optional[AnyUrl]</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.binary_hash","title":"binary_hash","text":"<pre><code>binary_hash: Uint64\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.filename","title":"filename","text":"<pre><code>filename: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.mimetype","title":"mimetype","text":"<pre><code>mimetype: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.uri","title":"uri","text":"<pre><code>uri: Optional[AnyUrl] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.parse_hex_string","title":"parse_hex_string","text":"<pre><code>parse_hex_string(value)\n</code></pre> <p>parse_hex_string.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocumentOrigin.validate_mimetype","title":"validate_mimetype","text":"<pre><code>validate_mimetype(v)\n</code></pre> <p>validate_mimetype.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem","title":"DocItem","text":"<p> Bases: <code>NodeItem</code></p> <p>DocItem.</p> <p>Methods:</p> <ul> <li> <code>get_annotations</code> \u2013 <p>Get the annotations of this DocItem.</p> </li> <li> <code>get_image</code> \u2013 <p>Returns the image of this DocItem.</p> </li> <li> <code>get_location_tokens</code> \u2013 <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code> \u2013 <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>children</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>content_layer</code> (<code>ContentLayer</code>) \u2013 </li> <li> <code>label</code> (<code>DocItemLabel</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>parent</code> (<code>Optional[RefItem]</code>) \u2013 </li> <li> <code>prov</code> (<code>List[ProvenanceItem]</code>) \u2013 </li> <li> <code>self_ref</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.label","title":"label","text":"<pre><code>label: DocItemLabel\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -> Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this DocItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n</code></pre> <p>Returns the image of this DocItem.</p> <p>The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -> RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel","title":"DocItemLabel","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>DocItemLabel.</p> <p>Methods:</p> <ul> <li> <code>get_color</code> \u2013 <p>Return the RGB color associated with a given label.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>CAPTION</code> \u2013 </li> <li> <code>CHART</code> \u2013 </li> <li> <code>CHECKBOX_SELECTED</code> \u2013 </li> <li> <code>CHECKBOX_UNSELECTED</code> \u2013 </li> <li> <code>CODE</code> \u2013 </li> <li> <code>DOCUMENT_INDEX</code> \u2013 </li> <li> <code>EMPTY_VALUE</code> \u2013 </li> <li> <code>FOOTNOTE</code> \u2013 </li> <li> <code>FORM</code> \u2013 </li> <li> <code>FORMULA</code> \u2013 </li> <li> <code>GRADING_SCALE</code> \u2013 </li> <li> <code>HANDWRITTEN_TEXT</code> \u2013 </li> <li> <code>KEY_VALUE_REGION</code> \u2013 </li> <li> <code>LIST_ITEM</code> \u2013 </li> <li> <code>PAGE_FOOTER</code> \u2013 </li> <li> <code>PAGE_HEADER</code> \u2013 </li> <li> <code>PARAGRAPH</code> \u2013 </li> <li> <code>PICTURE</code> \u2013 </li> <li> <code>REFERENCE</code> \u2013 </li> <li> <code>SECTION_HEADER</code> \u2013 </li> <li> <code>TABLE</code> \u2013 </li> <li> <code>TEXT</code> \u2013 </li> <li> <code>TITLE</code> \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CAPTION","title":"CAPTION","text":"<pre><code>CAPTION = 'caption'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHART","title":"CHART","text":"<pre><code>CHART = 'chart'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHECKBOX_SELECTED","title":"CHECKBOX_SELECTED","text":"<pre><code>CHECKBOX_SELECTED = 'checkbox_selected'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CHECKBOX_UNSELECTED","title":"CHECKBOX_UNSELECTED","text":"<pre><code>CHECKBOX_UNSELECTED = 'checkbox_unselected'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.CODE","title":"CODE","text":"<pre><code>CODE = 'code'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.DOCUMENT_INDEX","title":"DOCUMENT_INDEX","text":"<pre><code>DOCUMENT_INDEX = 'document_index'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.EMPTY_VALUE","title":"EMPTY_VALUE","text":"<pre><code>EMPTY_VALUE = 'empty_value'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FOOTNOTE","title":"FOOTNOTE","text":"<pre><code>FOOTNOTE = 'footnote'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FORM","title":"FORM","text":"<pre><code>FORM = 'form'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.FORMULA","title":"FORMULA","text":"<pre><code>FORMULA = 'formula'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.GRADING_SCALE","title":"GRADING_SCALE","text":"<pre><code>GRADING_SCALE = 'grading_scale'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.HANDWRITTEN_TEXT","title":"HANDWRITTEN_TEXT","text":"<pre><code>HANDWRITTEN_TEXT = 'handwritten_text'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.KEY_VALUE_REGION","title":"KEY_VALUE_REGION","text":"<pre><code>KEY_VALUE_REGION = 'key_value_region'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.LIST_ITEM","title":"LIST_ITEM","text":"<pre><code>LIST_ITEM = 'list_item'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PAGE_FOOTER","title":"PAGE_FOOTER","text":"<pre><code>PAGE_FOOTER = 'page_footer'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PAGE_HEADER","title":"PAGE_HEADER","text":"<pre><code>PAGE_HEADER = 'page_header'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PARAGRAPH","title":"PARAGRAPH","text":"<pre><code>PARAGRAPH = 'paragraph'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.PICTURE","title":"PICTURE","text":"<pre><code>PICTURE = 'picture'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.REFERENCE","title":"REFERENCE","text":"<pre><code>REFERENCE = 'reference'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.SECTION_HEADER","title":"SECTION_HEADER","text":"<pre><code>SECTION_HEADER = 'section_header'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TABLE","title":"TABLE","text":"<pre><code>TABLE = 'table'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TEXT","title":"TEXT","text":"<pre><code>TEXT = 'text'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.TITLE","title":"TITLE","text":"<pre><code>TITLE = 'title'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.DocItemLabel.get_color","title":"get_color","text":"<pre><code>get_color(label: DocItemLabel) -> Tuple[int, int, int]\n</code></pre> <p>Return the RGB color associated with a given label.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem","title":"ProvenanceItem","text":"<p> Bases: <code>BaseModel</code></p> <p>ProvenanceItem.</p> <p>Attributes:</p> <ul> <li> <code>bbox</code> (<code>BoundingBox</code>) \u2013 </li> <li> <code>charspan</code> (<code>Tuple[int, int]</code>) \u2013 </li> <li> <code>page_no</code> (<code>int</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem.bbox","title":"bbox","text":"<pre><code>bbox: BoundingBox\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem.charspan","title":"charspan","text":"<pre><code>charspan: Tuple[int, int]\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ProvenanceItem.page_no","title":"page_no","text":"<pre><code>page_no: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem","title":"GroupItem","text":"<p> Bases: <code>NodeItem</code></p> <p>GroupItem.</p> <p>Methods:</p> <ul> <li> <code>get_ref</code> \u2013 <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>children</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>content_layer</code> (<code>ContentLayer</code>) \u2013 </li> <li> <code>label</code> (<code>GroupLabel</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>name</code> (<code>str</code>) \u2013 </li> <li> <code>parent</code> (<code>Optional[RefItem]</code>) \u2013 </li> <li> <code>self_ref</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.label","title":"label","text":"<pre><code>label: GroupLabel = UNSPECIFIED\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.name","title":"name","text":"<pre><code>name: str = 'group'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -> RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel","title":"GroupLabel","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>GroupLabel.</p> <p>Attributes:</p> <ul> <li> <code>CHAPTER</code> \u2013 </li> <li> <code>COMMENT_SECTION</code> \u2013 </li> <li> <code>FORM_AREA</code> \u2013 </li> <li> <code>INLINE</code> \u2013 </li> <li> <code>KEY_VALUE_AREA</code> \u2013 </li> <li> <code>LIST</code> \u2013 </li> <li> <code>ORDERED_LIST</code> \u2013 </li> <li> <code>PICTURE_AREA</code> \u2013 </li> <li> <code>SECTION</code> \u2013 </li> <li> <code>SHEET</code> \u2013 </li> <li> <code>SLIDE</code> \u2013 </li> <li> <code>UNSPECIFIED</code> \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.CHAPTER","title":"CHAPTER","text":"<pre><code>CHAPTER = 'chapter'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.COMMENT_SECTION","title":"COMMENT_SECTION","text":"<pre><code>COMMENT_SECTION = 'comment_section'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.FORM_AREA","title":"FORM_AREA","text":"<pre><code>FORM_AREA = 'form_area'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.INLINE","title":"INLINE","text":"<pre><code>INLINE = 'inline'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.KEY_VALUE_AREA","title":"KEY_VALUE_AREA","text":"<pre><code>KEY_VALUE_AREA = 'key_value_area'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.LIST","title":"LIST","text":"<pre><code>LIST = 'list'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.ORDERED_LIST","title":"ORDERED_LIST","text":"<pre><code>ORDERED_LIST = 'ordered_list'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.PICTURE_AREA","title":"PICTURE_AREA","text":"<pre><code>PICTURE_AREA = 'picture_area'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SECTION","title":"SECTION","text":"<pre><code>SECTION = 'section'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SHEET","title":"SHEET","text":"<pre><code>SHEET = 'sheet'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.SLIDE","title":"SLIDE","text":"<pre><code>SLIDE = 'slide'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.GroupLabel.UNSPECIFIED","title":"UNSPECIFIED","text":"<pre><code>UNSPECIFIED = 'unspecified'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem","title":"NodeItem","text":"<p> Bases: <code>BaseModel</code></p> <p>NodeItem.</p> <p>Methods:</p> <ul> <li> <code>get_ref</code> \u2013 <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>children</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>content_layer</code> (<code>ContentLayer</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>parent</code> (<code>Optional[RefItem]</code>) \u2013 </li> <li> <code>self_ref</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.NodeItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -> RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem","title":"PageItem","text":"<p> Bases: <code>BaseModel</code></p> <p>PageItem.</p> <p>Attributes:</p> <ul> <li> <code>image</code> (<code>Optional[ImageRef]</code>) \u2013 </li> <li> <code>page_no</code> (<code>int</code>) \u2013 </li> <li> <code>size</code> (<code>Size</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem.image","title":"image","text":"<pre><code>image: Optional[ImageRef] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem.page_no","title":"page_no","text":"<pre><code>page_no: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PageItem.size","title":"size","text":"<pre><code>size: Size\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem","title":"FloatingItem","text":"<p> Bases: <code>DocItem</code></p> <p>FloatingItem.</p> <p>Methods:</p> <ul> <li> <code>caption_text</code> \u2013 <p>Computes the caption as a single text.</p> </li> <li> <code>get_annotations</code> \u2013 <p>Get the annotations of this DocItem.</p> </li> <li> <code>get_image</code> \u2013 <p>Returns the image corresponding to this FloatingItem.</p> </li> <li> <code>get_location_tokens</code> \u2013 <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code> \u2013 <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>captions</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>children</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>content_layer</code> (<code>ContentLayer</code>) \u2013 </li> <li> <code>footnotes</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>image</code> (<code>Optional[ImageRef]</code>) \u2013 </li> <li> <code>label</code> (<code>DocItemLabel</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>parent</code> (<code>Optional[RefItem]</code>) \u2013 </li> <li> <code>prov</code> (<code>List[ProvenanceItem]</code>) \u2013 </li> <li> <code>references</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>self_ref</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.captions","title":"captions","text":"<pre><code>captions: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.footnotes","title":"footnotes","text":"<pre><code>footnotes: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.image","title":"image","text":"<pre><code>image: Optional[ImageRef] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.label","title":"label","text":"<pre><code>label: DocItemLabel\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.references","title":"references","text":"<pre><code>references: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.caption_text","title":"caption_text","text":"<pre><code>caption_text(doc: DoclingDocument) -> str\n</code></pre> <p>Computes the caption as a single text.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -> Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this DocItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n</code></pre> <p>Returns the image corresponding to this FloatingItem.</p> <p>This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.</p> <p>In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.FloatingItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -> RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem","title":"TextItem","text":"<p> Bases: <code>DocItem</code></p> <p>TextItem.</p> <p>Methods:</p> <ul> <li> <code>export_to_doctags</code> \u2013 <p>Export text element to document tokens format.</p> </li> <li> <code>export_to_document_tokens</code> \u2013 <p>Export to DocTags format.</p> </li> <li> <code>get_annotations</code> \u2013 <p>Get the annotations of this DocItem.</p> </li> <li> <code>get_image</code> \u2013 <p>Returns the image of this DocItem.</p> </li> <li> <code>get_location_tokens</code> \u2013 <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code> \u2013 <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>children</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>content_layer</code> (<code>ContentLayer</code>) \u2013 </li> <li> <code>formatting</code> (<code>Optional[Formatting]</code>) \u2013 </li> <li> <code>hyperlink</code> (<code>Optional[Union[AnyUrl, Path]]</code>) \u2013 </li> <li> <code>label</code> (<code>Literal[CAPTION, CHECKBOX_SELECTED, CHECKBOX_UNSELECTED, FOOTNOTE, PAGE_FOOTER, PAGE_HEADER, PARAGRAPH, REFERENCE, TEXT, EMPTY_VALUE]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>orig</code> (<code>str</code>) \u2013 </li> <li> <code>parent</code> (<code>Optional[RefItem]</code>) \u2013 </li> <li> <code>prov</code> (<code>List[ProvenanceItem]</code>) \u2013 </li> <li> <code>self_ref</code> (<code>str</code>) \u2013 </li> <li> <code>text</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.formatting","title":"formatting","text":"<pre><code>formatting: Optional[Formatting] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.hyperlink","title":"hyperlink","text":"<pre><code>hyperlink: Optional[Union[AnyUrl, Path]] = Field(union_mode='left_to_right', default=None)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.label","title":"label","text":"<pre><code>label: Literal[CAPTION, CHECKBOX_SELECTED, CHECKBOX_UNSELECTED, FOOTNOTE, PAGE_FOOTER, PAGE_HEADER, PARAGRAPH, REFERENCE, TEXT, EMPTY_VALUE]\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.orig","title":"orig","text":"<pre><code>orig: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.text","title":"text","text":"<pre><code>text: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.export_to_doctags","title":"export_to_doctags","text":"<pre><code>export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n</code></pre> <p>Export text element to document tokens format.</p> <p>Parameters:</p> <ul> <li> <code>doc</code> (<code>DoclingDocument</code>) \u2013 <p>\"DoclingDocument\":</p> </li> <li> <code>new_line</code> (<code>str</code>, default: <code>''</code> ) \u2013 <p>str (Default value = \"\") Deprecated</p> </li> <li> <code>xsize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>ysize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>add_location</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>add_content</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(*args, **kwargs)\n</code></pre> <p>Export to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -> Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this DocItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n</code></pre> <p>Returns the image of this DocItem.</p> <p>The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TextItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -> RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem","title":"TableItem","text":"<p> Bases: <code>FloatingItem</code></p> <p>TableItem.</p> <p>Methods:</p> <ul> <li> <code>add_annotation</code> \u2013 <p>Add an annotation to the table.</p> </li> <li> <code>caption_text</code> \u2013 <p>Computes the caption as a single text.</p> </li> <li> <code>export_to_dataframe</code> \u2013 <p>Export the table as a Pandas DataFrame.</p> </li> <li> <code>export_to_doctags</code> \u2013 <p>Export table to document tokens format.</p> </li> <li> <code>export_to_document_tokens</code> \u2013 <p>Export to DocTags format.</p> </li> <li> <code>export_to_html</code> \u2013 <p>Export the table as html.</p> </li> <li> <code>export_to_markdown</code> \u2013 <p>Export the table as markdown.</p> </li> <li> <code>export_to_otsl</code> \u2013 <p>Export the table as OTSL.</p> </li> <li> <code>get_annotations</code> \u2013 <p>Get the annotations of this TableItem.</p> </li> <li> <code>get_image</code> \u2013 <p>Returns the image corresponding to this FloatingItem.</p> </li> <li> <code>get_location_tokens</code> \u2013 <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code> \u2013 <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>annotations</code> (<code>List[TableAnnotationType]</code>) \u2013 </li> <li> <code>captions</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>children</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>content_layer</code> (<code>ContentLayer</code>) \u2013 </li> <li> <code>data</code> (<code>TableData</code>) \u2013 </li> <li> <code>footnotes</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>image</code> (<code>Optional[ImageRef]</code>) \u2013 </li> <li> <code>label</code> (<code>Literal[DOCUMENT_INDEX, TABLE]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>parent</code> (<code>Optional[RefItem]</code>) \u2013 </li> <li> <code>prov</code> (<code>List[ProvenanceItem]</code>) \u2013 </li> <li> <code>references</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>self_ref</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.annotations","title":"annotations","text":"<pre><code>annotations: List[TableAnnotationType] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.captions","title":"captions","text":"<pre><code>captions: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.data","title":"data","text":"<pre><code>data: TableData\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.footnotes","title":"footnotes","text":"<pre><code>footnotes: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.image","title":"image","text":"<pre><code>image: Optional[ImageRef] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.label","title":"label","text":"<pre><code>label: Literal[DOCUMENT_INDEX, TABLE] = TABLE\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.references","title":"references","text":"<pre><code>references: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.add_annotation","title":"add_annotation","text":"<pre><code>add_annotation(annotation: TableAnnotationType) -> None\n</code></pre> <p>Add an annotation to the table.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.caption_text","title":"caption_text","text":"<pre><code>caption_text(doc: DoclingDocument) -> str\n</code></pre> <p>Computes the caption as a single text.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_dataframe","title":"export_to_dataframe","text":"<pre><code>export_to_dataframe() -> DataFrame\n</code></pre> <p>Export the table as a Pandas DataFrame.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_doctags","title":"export_to_doctags","text":"<pre><code>export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_cell_location: bool = True, add_cell_text: bool = True, add_caption: bool = True)\n</code></pre> <p>Export table to document tokens format.</p> <p>Parameters:</p> <ul> <li> <code>doc</code> (<code>DoclingDocument</code>) \u2013 <p>\"DoclingDocument\":</p> </li> <li> <code>new_line</code> (<code>str</code>, default: <code>''</code> ) \u2013 <p>str (Default value = \"\") Deprecated</p> </li> <li> <code>xsize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>ysize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>add_location</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>add_cell_location</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>add_cell_text</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>add_caption</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(*args, **kwargs)\n</code></pre> <p>Export to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_html","title":"export_to_html","text":"<pre><code>export_to_html(doc: Optional[DoclingDocument] = None, add_caption: bool = True) -> str\n</code></pre> <p>Export the table as html.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_markdown","title":"export_to_markdown","text":"<pre><code>export_to_markdown(doc: Optional[DoclingDocument] = None) -> str\n</code></pre> <p>Export the table as markdown.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.export_to_otsl","title":"export_to_otsl","text":"<pre><code>export_to_otsl(doc: DoclingDocument, add_cell_location: bool = True, add_cell_text: bool = True, xsize: int = 500, ysize: int = 500) -> str\n</code></pre> <p>Export the table as OTSL.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -> Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this TableItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n</code></pre> <p>Returns the image corresponding to this FloatingItem.</p> <p>This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.</p> <p>In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -> RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell","title":"TableCell","text":"<p> Bases: <code>BaseModel</code></p> <p>TableCell.</p> <p>Methods:</p> <ul> <li> <code>from_dict_format</code> \u2013 <p>from_dict_format.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>bbox</code> (<code>Optional[BoundingBox]</code>) \u2013 </li> <li> <code>col_span</code> (<code>int</code>) \u2013 </li> <li> <code>column_header</code> (<code>bool</code>) \u2013 </li> <li> <code>end_col_offset_idx</code> (<code>int</code>) \u2013 </li> <li> <code>end_row_offset_idx</code> (<code>int</code>) \u2013 </li> <li> <code>row_header</code> (<code>bool</code>) \u2013 </li> <li> <code>row_section</code> (<code>bool</code>) \u2013 </li> <li> <code>row_span</code> (<code>int</code>) \u2013 </li> <li> <code>start_col_offset_idx</code> (<code>int</code>) \u2013 </li> <li> <code>start_row_offset_idx</code> (<code>int</code>) \u2013 </li> <li> <code>text</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.bbox","title":"bbox","text":"<pre><code>bbox: Optional[BoundingBox] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.col_span","title":"col_span","text":"<pre><code>col_span: int = 1\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.column_header","title":"column_header","text":"<pre><code>column_header: bool = False\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.end_col_offset_idx","title":"end_col_offset_idx","text":"<pre><code>end_col_offset_idx: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.end_row_offset_idx","title":"end_row_offset_idx","text":"<pre><code>end_row_offset_idx: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_header","title":"row_header","text":"<pre><code>row_header: bool = False\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_section","title":"row_section","text":"<pre><code>row_section: bool = False\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.row_span","title":"row_span","text":"<pre><code>row_span: int = 1\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.start_col_offset_idx","title":"start_col_offset_idx","text":"<pre><code>start_col_offset_idx: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.start_row_offset_idx","title":"start_row_offset_idx","text":"<pre><code>start_row_offset_idx: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.text","title":"text","text":"<pre><code>text: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCell.from_dict_format","title":"from_dict_format","text":"<pre><code>from_dict_format(data: Any) -> Any\n</code></pre> <p>from_dict_format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData","title":"TableData","text":"<p> Bases: <code>BaseModel</code></p> <p>BaseTableData.</p> <p>Methods:</p> <ul> <li> <code>add_row</code> \u2013 <p>Add a new row to the table from a list of strings.</p> </li> <li> <code>add_rows</code> \u2013 <p>Add multiple new rows to the table from a list of lists of strings.</p> </li> <li> <code>get_column_bounding_boxes</code> \u2013 <p>Get the minimal bounding box for each column in the table.</p> </li> <li> <code>get_row_bounding_boxes</code> \u2013 <p>Get the minimal bounding box for each row in the table.</p> </li> <li> <code>insert_row</code> \u2013 <p>Insert a new row from a list of strings before/after a specific index in the table.</p> </li> <li> <code>insert_rows</code> \u2013 <p>Insert multiple new rows from a list of lists of strings before/after a specific index in the table.</p> </li> <li> <code>pop_row</code> \u2013 <p>Remove and return the last row from the table.</p> </li> <li> <code>remove_row</code> \u2013 <p>Remove a row from the table by its index.</p> </li> <li> <code>remove_rows</code> \u2013 <p>Remove rows from the table by their indices.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>grid</code> (<code>List[List[TableCell]]</code>) \u2013 <p>grid.</p> </li> <li> <code>num_cols</code> (<code>int</code>) \u2013 </li> <li> <code>num_rows</code> (<code>int</code>) \u2013 </li> <li> <code>table_cells</code> (<code>List[TableCell]</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.grid","title":"grid","text":"<pre><code>grid: List[List[TableCell]]\n</code></pre> <p>grid.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.num_cols","title":"num_cols","text":"<pre><code>num_cols: int = 0\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.num_rows","title":"num_rows","text":"<pre><code>num_rows: int = 0\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.table_cells","title":"table_cells","text":"<pre><code>table_cells: List[TableCell] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.add_row","title":"add_row","text":"<pre><code>add_row(row: List[str]) -> None\n</code></pre> <p>Add a new row to the table from a list of strings.</p> <p>Parameters:</p> <ul> <li> <code>row</code> (<code>List[str]</code>) \u2013 <p>List[str]: A list of strings representing the content of the new row.</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code> \u2013 <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.add_rows","title":"add_rows","text":"<pre><code>add_rows(rows: List[List[str]]) -> None\n</code></pre> <p>Add multiple new rows to the table from a list of lists of strings.</p> <p>Parameters:</p> <ul> <li> <code>rows</code> (<code>List[List[str]]</code>) \u2013 <p>List[List[str]]: A list of lists, where each inner list represents the content of a new row.</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code> \u2013 <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.get_column_bounding_boxes","title":"get_column_bounding_boxes","text":"<pre><code>get_column_bounding_boxes() -> dict[int, BoundingBox]\n</code></pre> <p>Get the minimal bounding box for each column in the table.</p> <p>Returns: List[Optional[BoundingBox]]: A list where each element is the minimal bounding box that encompasses all cells in that column, or None if no cells in the column have bounding boxes.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.get_row_bounding_boxes","title":"get_row_bounding_boxes","text":"<pre><code>get_row_bounding_boxes() -> dict[int, BoundingBox]\n</code></pre> <p>Get the minimal bounding box for each row in the table.</p> <p>Returns: List[Optional[BoundingBox]]: A list where each element is the minimal bounding box that encompasses all cells in that row, or None if no cells in the row have bounding boxes.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.insert_row","title":"insert_row","text":"<pre><code>insert_row(row_index: int, row: List[str], after: bool = False) -> None\n</code></pre> <p>Insert a new row from a list of strings before/after a specific index in the table.</p> <p>Parameters:</p> <ul> <li> <code>row_index</code> (<code>int</code>) \u2013 <p>int: The index at which to insert the new row. (Starting from 0)</p> </li> <li> <code>row</code> (<code>List[str]</code>) \u2013 <p>List[str]: A list of strings representing the content of the new row.</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>False</code> ) \u2013 <p>bool: If True, insert the row after the specified index, otherwise before it. (Default is False)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code> \u2013 <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.insert_rows","title":"insert_rows","text":"<pre><code>insert_rows(row_index: int, rows: List[List[str]], after: bool = False) -> None\n</code></pre> <p>Insert multiple new rows from a list of lists of strings before/after a specific index in the table.</p> <p>Parameters:</p> <ul> <li> <code>row_index</code> (<code>int</code>) \u2013 <p>int: The index at which to insert the new rows. (Starting from 0)</p> </li> <li> <code>rows</code> (<code>List[List[str]]</code>) \u2013 <p>List[List[str]]: A list of lists, where each inner list represents the content of a new row.</p> </li> <li> <code>after</code> (<code>bool</code>, default: <code>False</code> ) \u2013 <p>bool: If True, insert the rows after the specified index, otherwise before it. (Default is False)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>None</code> \u2013 <p>None</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.pop_row","title":"pop_row","text":"<pre><code>pop_row() -> List[TableCell]\n</code></pre> <p>Remove and return the last row from the table.</p> <p>Returns:</p> <ul> <li> <code>List[TableCell]</code> \u2013 <p>List[TableCell]: A list of TableCell objects representing the popped row.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.remove_row","title":"remove_row","text":"<pre><code>remove_row(row_index: int) -> List[TableCell]\n</code></pre> <p>Remove a row from the table by its index.</p> <p>Parameters:</p> <ul> <li> <code>row_index</code> (<code>int</code>) \u2013 <p>int: The index of the row to remove. (Starting from 0)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>List[TableCell]</code> \u2013 <p>List[TableCell]: A list of TableCell objects representing the removed row.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableData.remove_rows","title":"remove_rows","text":"<pre><code>remove_rows(indices: List[int]) -> List[List[TableCell]]\n</code></pre> <p>Remove rows from the table by their indices.</p> <p>Parameters:</p> <ul> <li> <code>indices</code> (<code>List[int]</code>) \u2013 <p>List[int]: A list of indices of the rows to remove. (Starting from 0)</p> </li> </ul> <p>Returns:</p> <ul> <li> <code>List[List[TableCell]]</code> \u2013 <p>List[List[TableCell]]: A list representation of the removed rows as lists of TableCell objects.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel","title":"TableCellLabel","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>TableCellLabel.</p> <p>Methods:</p> <ul> <li> <code>get_color</code> \u2013 <p>Return the RGB color associated with a given label.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>BODY</code> \u2013 </li> <li> <code>COLUMN_HEADER</code> \u2013 </li> <li> <code>ROW_HEADER</code> \u2013 </li> <li> <code>ROW_SECTION</code> \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.BODY","title":"BODY","text":"<pre><code>BODY = 'body'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.COLUMN_HEADER","title":"COLUMN_HEADER","text":"<pre><code>COLUMN_HEADER = 'col_header'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.ROW_HEADER","title":"ROW_HEADER","text":"<pre><code>ROW_HEADER = 'row_header'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.ROW_SECTION","title":"ROW_SECTION","text":"<pre><code>ROW_SECTION = 'row_section'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.TableCellLabel.get_color","title":"get_color","text":"<pre><code>get_color(label: TableCellLabel) -> Tuple[int, int, int]\n</code></pre> <p>Return the RGB color associated with a given label.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem","title":"KeyValueItem","text":"<p> Bases: <code>FloatingItem</code></p> <p>KeyValueItem.</p> <p>Methods:</p> <ul> <li> <code>caption_text</code> \u2013 <p>Computes the caption as a single text.</p> </li> <li> <code>export_to_document_tokens</code> \u2013 <p>Export key value item to document tokens format.</p> </li> <li> <code>get_annotations</code> \u2013 <p>Get the annotations of this DocItem.</p> </li> <li> <code>get_image</code> \u2013 <p>Returns the image corresponding to this FloatingItem.</p> </li> <li> <code>get_location_tokens</code> \u2013 <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code> \u2013 <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>captions</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>children</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>content_layer</code> (<code>ContentLayer</code>) \u2013 </li> <li> <code>footnotes</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>graph</code> (<code>GraphData</code>) \u2013 </li> <li> <code>image</code> (<code>Optional[ImageRef]</code>) \u2013 </li> <li> <code>label</code> (<code>Literal[KEY_VALUE_REGION]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>parent</code> (<code>Optional[RefItem]</code>) \u2013 </li> <li> <code>prov</code> (<code>List[ProvenanceItem]</code>) \u2013 </li> <li> <code>references</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>self_ref</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.captions","title":"captions","text":"<pre><code>captions: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.footnotes","title":"footnotes","text":"<pre><code>footnotes: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.graph","title":"graph","text":"<pre><code>graph: GraphData\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.image","title":"image","text":"<pre><code>image: Optional[ImageRef] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.label","title":"label","text":"<pre><code>label: Literal[KEY_VALUE_REGION] = KEY_VALUE_REGION\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.references","title":"references","text":"<pre><code>references: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.caption_text","title":"caption_text","text":"<pre><code>caption_text(doc: DoclingDocument) -> str\n</code></pre> <p>Computes the caption as a single text.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n</code></pre> <p>Export key value item to document tokens format.</p> <p>Parameters:</p> <ul> <li> <code>doc</code> (<code>DoclingDocument</code>) \u2013 <p>\"DoclingDocument\":</p> </li> <li> <code>new_line</code> (<code>str</code>, default: <code>''</code> ) \u2013 <p>str (Default value = \"\") Deprecated</p> </li> <li> <code>xsize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>ysize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>add_location</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>add_content</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -> Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this DocItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n</code></pre> <p>Returns the image corresponding to this FloatingItem.</p> <p>This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.</p> <p>In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.KeyValueItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -> RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem","title":"SectionHeaderItem","text":"<p> Bases: <code>TextItem</code></p> <p>SectionItem.</p> <p>Methods:</p> <ul> <li> <code>export_to_doctags</code> \u2013 <p>Export text element to document tokens format.</p> </li> <li> <code>export_to_document_tokens</code> \u2013 <p>Export to DocTags format.</p> </li> <li> <code>get_annotations</code> \u2013 <p>Get the annotations of this DocItem.</p> </li> <li> <code>get_image</code> \u2013 <p>Returns the image of this DocItem.</p> </li> <li> <code>get_location_tokens</code> \u2013 <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code> \u2013 <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>children</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>content_layer</code> (<code>ContentLayer</code>) \u2013 </li> <li> <code>formatting</code> (<code>Optional[Formatting]</code>) \u2013 </li> <li> <code>hyperlink</code> (<code>Optional[Union[AnyUrl, Path]]</code>) \u2013 </li> <li> <code>label</code> (<code>Literal[SECTION_HEADER]</code>) \u2013 </li> <li> <code>level</code> (<code>LevelNumber</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>orig</code> (<code>str</code>) \u2013 </li> <li> <code>parent</code> (<code>Optional[RefItem]</code>) \u2013 </li> <li> <code>prov</code> (<code>List[ProvenanceItem]</code>) \u2013 </li> <li> <code>self_ref</code> (<code>str</code>) \u2013 </li> <li> <code>text</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.formatting","title":"formatting","text":"<pre><code>formatting: Optional[Formatting] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.hyperlink","title":"hyperlink","text":"<pre><code>hyperlink: Optional[Union[AnyUrl, Path]] = Field(union_mode='left_to_right', default=None)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.label","title":"label","text":"<pre><code>label: Literal[SECTION_HEADER] = SECTION_HEADER\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.level","title":"level","text":"<pre><code>level: LevelNumber = 1\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.orig","title":"orig","text":"<pre><code>orig: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.text","title":"text","text":"<pre><code>text: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.export_to_doctags","title":"export_to_doctags","text":"<pre><code>export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_content: bool = True)\n</code></pre> <p>Export text element to document tokens format.</p> <p>Parameters:</p> <ul> <li> <code>doc</code> (<code>DoclingDocument</code>) \u2013 <p>\"DoclingDocument\":</p> </li> <li> <code>new_line</code> (<code>str</code>, default: <code>''</code> ) \u2013 <p>str (Default value = \"\") Deprecated</p> </li> <li> <code>xsize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>ysize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>add_location</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>add_content</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(*args, **kwargs)\n</code></pre> <p>Export to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -> Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this DocItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n</code></pre> <p>Returns the image of this DocItem.</p> <p>The function returns None if this DocItem has no valid provenance or if a valid image of the page containing this DocItem is not available in doc.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.SectionHeaderItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -> RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem","title":"PictureItem","text":"<p> Bases: <code>FloatingItem</code></p> <p>PictureItem.</p> <p>Methods:</p> <ul> <li> <code>caption_text</code> \u2013 <p>Computes the caption as a single text.</p> </li> <li> <code>export_to_doctags</code> \u2013 <p>Export picture to document tokens format.</p> </li> <li> <code>export_to_document_tokens</code> \u2013 <p>Export to DocTags format.</p> </li> <li> <code>export_to_html</code> \u2013 <p>Export picture to HTML format.</p> </li> <li> <code>export_to_markdown</code> \u2013 <p>Export picture to Markdown format.</p> </li> <li> <code>get_annotations</code> \u2013 <p>Get the annotations of this PictureItem.</p> </li> <li> <code>get_image</code> \u2013 <p>Returns the image corresponding to this FloatingItem.</p> </li> <li> <code>get_location_tokens</code> \u2013 <p>Get the location string for the BaseCell.</p> </li> <li> <code>get_ref</code> \u2013 <p>get_ref.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>annotations</code> (<code>List[PictureDataType]</code>) \u2013 </li> <li> <code>captions</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>children</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>content_layer</code> (<code>ContentLayer</code>) \u2013 </li> <li> <code>footnotes</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>image</code> (<code>Optional[ImageRef]</code>) \u2013 </li> <li> <code>label</code> (<code>Literal[PICTURE, CHART]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>parent</code> (<code>Optional[RefItem]</code>) \u2013 </li> <li> <code>prov</code> (<code>List[ProvenanceItem]</code>) \u2013 </li> <li> <code>references</code> (<code>List[RefItem]</code>) \u2013 </li> <li> <code>self_ref</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.annotations","title":"annotations","text":"<pre><code>annotations: List[PictureDataType] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.captions","title":"captions","text":"<pre><code>captions: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.children","title":"children","text":"<pre><code>children: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.content_layer","title":"content_layer","text":"<pre><code>content_layer: ContentLayer = BODY\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.footnotes","title":"footnotes","text":"<pre><code>footnotes: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.image","title":"image","text":"<pre><code>image: Optional[ImageRef] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.label","title":"label","text":"<pre><code>label: Literal[PICTURE, CHART] = PICTURE\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.parent","title":"parent","text":"<pre><code>parent: Optional[RefItem] = None\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.prov","title":"prov","text":"<pre><code>prov: List[ProvenanceItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.references","title":"references","text":"<pre><code>references: List[RefItem] = []\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.self_ref","title":"self_ref","text":"<pre><code>self_ref: str = Field(pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.caption_text","title":"caption_text","text":"<pre><code>caption_text(doc: DoclingDocument) -> str\n</code></pre> <p>Computes the caption as a single text.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_doctags","title":"export_to_doctags","text":"<pre><code>export_to_doctags(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500, add_location: bool = True, add_caption: bool = True, add_content: bool = True)\n</code></pre> <p>Export picture to document tokens format.</p> <p>Parameters:</p> <ul> <li> <code>doc</code> (<code>DoclingDocument</code>) \u2013 <p>\"DoclingDocument\":</p> </li> <li> <code>new_line</code> (<code>str</code>, default: <code>''</code> ) \u2013 <p>str (Default value = \"\") Deprecated</p> </li> <li> <code>xsize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>ysize</code> (<code>int</code>, default: <code>500</code> ) \u2013 <p>int: (Default value = 500)</p> </li> <li> <code>add_location</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>add_caption</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> <li> <code>add_content</code> (<code>bool</code>, default: <code>True</code> ) \u2013 <p>bool: (Default value = True)</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_document_tokens","title":"export_to_document_tokens","text":"<pre><code>export_to_document_tokens(*args, **kwargs)\n</code></pre> <p>Export to DocTags format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_html","title":"export_to_html","text":"<pre><code>export_to_html(doc: DoclingDocument, add_caption: bool = True, image_mode: ImageRefMode = PLACEHOLDER) -> str\n</code></pre> <p>Export picture to HTML format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.export_to_markdown","title":"export_to_markdown","text":"<pre><code>export_to_markdown(doc: DoclingDocument, add_caption: bool = True, image_mode: ImageRefMode = EMBEDDED, image_placeholder: str = '<!-- image -->') -> str\n</code></pre> <p>Export picture to Markdown format.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_annotations","title":"get_annotations","text":"<pre><code>get_annotations() -> Sequence[BaseAnnotation]\n</code></pre> <p>Get the annotations of this PictureItem.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_image","title":"get_image","text":"<pre><code>get_image(doc: DoclingDocument, prov_index: int = 0) -> Optional[Image]\n</code></pre> <p>Returns the image corresponding to this FloatingItem.</p> <p>This function returns the PIL image from self.image if one is available. Otherwise, it uses DocItem.get_image to get an image of this FloatingItem.</p> <p>In particular, when self.image is None, the function returns None if this FloatingItem has no valid provenance or the doc does not contain a valid image for the required page.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_location_tokens","title":"get_location_tokens","text":"<pre><code>get_location_tokens(doc: DoclingDocument, new_line: str = '', xsize: int = 500, ysize: int = 500) -> str\n</code></pre> <p>Get the location string for the BaseCell.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureItem.get_ref","title":"get_ref","text":"<pre><code>get_ref() -> RefItem\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef","title":"ImageRef","text":"<p> Bases: <code>BaseModel</code></p> <p>ImageRef.</p> <p>Methods:</p> <ul> <li> <code>from_pil</code> \u2013 <p>Construct ImageRef from a PIL Image.</p> </li> <li> <code>validate_mimetype</code> \u2013 <p>validate_mimetype.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>dpi</code> (<code>int</code>) \u2013 </li> <li> <code>mimetype</code> (<code>str</code>) \u2013 </li> <li> <code>pil_image</code> (<code>Optional[Image]</code>) \u2013 <p>Return the PIL Image.</p> </li> <li> <code>size</code> (<code>Size</code>) \u2013 </li> <li> <code>uri</code> (<code>Union[AnyUrl, Path]</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.dpi","title":"dpi","text":"<pre><code>dpi: int\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.mimetype","title":"mimetype","text":"<pre><code>mimetype: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.pil_image","title":"pil_image","text":"<pre><code>pil_image: Optional[Image]\n</code></pre> <p>Return the PIL Image.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.size","title":"size","text":"<pre><code>size: Size\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.uri","title":"uri","text":"<pre><code>uri: Union[AnyUrl, Path] = Field(union_mode='left_to_right')\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.from_pil","title":"from_pil","text":"<pre><code>from_pil(image: Image, dpi: int) -> Self\n</code></pre> <p>Construct ImageRef from a PIL Image.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRef.validate_mimetype","title":"validate_mimetype","text":"<pre><code>validate_mimetype(v)\n</code></pre> <p>validate_mimetype.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationClass","title":"PictureClassificationClass","text":"<p> Bases: <code>BaseModel</code></p> <p>PictureClassificationData.</p> <p>Attributes:</p> <ul> <li> <code>class_name</code> (<code>str</code>) \u2013 </li> <li> <code>confidence</code> (<code>float</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationClass.class_name","title":"class_name","text":"<pre><code>class_name: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationClass.confidence","title":"confidence","text":"<pre><code>confidence: float\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData","title":"PictureClassificationData","text":"<p> Bases: <code>BaseAnnotation</code></p> <p>PictureClassificationData.</p> <p>Attributes:</p> <ul> <li> <code>kind</code> (<code>Literal['classification']</code>) \u2013 </li> <li> <code>predicted_classes</code> (<code>List[PictureClassificationClass]</code>) \u2013 </li> <li> <code>provenance</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData.kind","title":"kind","text":"<pre><code>kind: Literal['classification'] = 'classification'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData.predicted_classes","title":"predicted_classes","text":"<pre><code>predicted_classes: List[PictureClassificationClass]\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.PictureClassificationData.provenance","title":"provenance","text":"<pre><code>provenance: str\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem","title":"RefItem","text":"<p> Bases: <code>BaseModel</code></p> <p>RefItem.</p> <p>Methods:</p> <ul> <li> <code>get_ref</code> \u2013 <p>get_ref.</p> </li> <li> <code>resolve</code> \u2013 <p>Resolve the path in the document.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>cref</code> (<code>str</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.cref","title":"cref","text":"<pre><code>cref: str = Field(alias='$ref', pattern=_JSON_POINTER_REGEX)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.model_config","title":"model_config","text":"<pre><code>model_config = ConfigDict(populate_by_name=True)\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.get_ref","title":"get_ref","text":"<pre><code>get_ref()\n</code></pre> <p>get_ref.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.RefItem.resolve","title":"resolve","text":"<pre><code>resolve(doc: DoclingDocument)\n</code></pre> <p>Resolve the path in the document.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox","title":"BoundingBox","text":"<p> Bases: <code>BaseModel</code></p> <p>BoundingBox.</p> <p>Methods:</p> <ul> <li> <code>area</code> \u2013 <p>area.</p> </li> <li> <code>as_tuple</code> \u2013 <p>as_tuple.</p> </li> <li> <code>enclosing_bbox</code> \u2013 <p>Create a bounding box that covers all of the given boxes.</p> </li> <li> <code>expand_by_scale</code> \u2013 <p>expand_to_size.</p> </li> <li> <code>from_tuple</code> \u2013 <p>from_tuple.</p> </li> <li> <code>intersection_area_with</code> \u2013 <p>Calculate the intersection area with another bounding box.</p> </li> <li> <code>intersection_over_self</code> \u2013 <p>intersection_over_self.</p> </li> <li> <code>intersection_over_union</code> \u2013 <p>intersection_over_union.</p> </li> <li> <code>is_above</code> \u2013 <p>is_above.</p> </li> <li> <code>is_horizontally_connected</code> \u2013 <p>is_horizontally_connected.</p> </li> <li> <code>is_left_of</code> \u2013 <p>is_left_of.</p> </li> <li> <code>is_strictly_above</code> \u2013 <p>is_strictly_above.</p> </li> <li> <code>is_strictly_left_of</code> \u2013 <p>is_strictly_left_of.</p> </li> <li> <code>normalized</code> \u2013 <p>normalized.</p> </li> <li> <code>overlaps</code> \u2013 <p>overlaps.</p> </li> <li> <code>overlaps_horizontally</code> \u2013 <p>Check if two bounding boxes overlap horizontally.</p> </li> <li> <code>overlaps_vertically</code> \u2013 <p>Check if two bounding boxes overlap vertically.</p> </li> <li> <code>overlaps_vertically_with_iou</code> \u2013 <p>overlaps_y_with_iou.</p> </li> <li> <code>resize_by_scale</code> \u2013 <p>resize_by_scale.</p> </li> <li> <code>scale_to_size</code> \u2013 <p>scale_to_size.</p> </li> <li> <code>scaled</code> \u2013 <p>scaled.</p> </li> <li> <code>to_bottom_left_origin</code> \u2013 <p>to_bottom_left_origin.</p> </li> <li> <code>to_top_left_origin</code> \u2013 <p>to_top_left_origin.</p> </li> <li> <code>union_area_with</code> \u2013 <p>Calculates the union area with another bounding box.</p> </li> <li> <code>x_overlap_with</code> \u2013 <p>Calculates the horizontal overlap with another bounding box.</p> </li> <li> <code>x_union_with</code> \u2013 <p>Calculates the horizontal union dimension with another bounding box.</p> </li> <li> <code>y_overlap_with</code> \u2013 <p>Calculates the vertical overlap with another bounding box, respecting coordinate origin.</p> </li> <li> <code>y_union_with</code> \u2013 <p>Calculates the vertical union dimension with another bounding box, respecting coordinate origin.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>b</code> (<code>float</code>) \u2013 </li> <li> <code>coord_origin</code> (<code>CoordOrigin</code>) \u2013 </li> <li> <code>height</code> \u2013 <p>height.</p> </li> <li> <code>l</code> (<code>float</code>) \u2013 </li> <li> <code>r</code> (<code>float</code>) \u2013 </li> <li> <code>t</code> (<code>float</code>) \u2013 </li> <li> <code>width</code> \u2013 <p>width.</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.b","title":"b","text":"<pre><code>b: float\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.coord_origin","title":"coord_origin","text":"<pre><code>coord_origin: CoordOrigin = TOPLEFT\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.height","title":"height","text":"<pre><code>height\n</code></pre> <p>height.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.l","title":"l","text":"<pre><code>l: float\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.r","title":"r","text":"<pre><code>r: float\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.t","title":"t","text":"<pre><code>t: float\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.width","title":"width","text":"<pre><code>width\n</code></pre> <p>width.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.area","title":"area","text":"<pre><code>area() -> float\n</code></pre> <p>area.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.as_tuple","title":"as_tuple","text":"<pre><code>as_tuple() -> Tuple[float, float, float, float]\n</code></pre> <p>as_tuple.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.enclosing_bbox","title":"enclosing_bbox","text":"<pre><code>enclosing_bbox(boxes: List[BoundingBox]) -> BoundingBox\n</code></pre> <p>Create a bounding box that covers all of the given boxes.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.expand_by_scale","title":"expand_by_scale","text":"<pre><code>expand_by_scale(x_scale: float, y_scale: float) -> BoundingBox\n</code></pre> <p>expand_to_size.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.from_tuple","title":"from_tuple","text":"<pre><code>from_tuple(coord: Tuple[float, ...], origin: CoordOrigin)\n</code></pre> <p>from_tuple.</p> <p>Parameters:</p> <ul> <li> <code>coord</code> (<code>Tuple[float, ...]</code>) \u2013 <p>Tuple[float:</p> </li> <li> <code>...]</code> \u2013 </li> <li> <code>origin</code> (<code>CoordOrigin</code>) \u2013 <p>CoordOrigin:</p> </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.intersection_area_with","title":"intersection_area_with","text":"<pre><code>intersection_area_with(other: BoundingBox) -> float\n</code></pre> <p>Calculate the intersection area with another bounding box.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.intersection_over_self","title":"intersection_over_self","text":"<pre><code>intersection_over_self(other: BoundingBox, eps: float = 1e-06) -> float\n</code></pre> <p>intersection_over_self.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.intersection_over_union","title":"intersection_over_union","text":"<pre><code>intersection_over_union(other: BoundingBox, eps: float = 1e-06) -> float\n</code></pre> <p>intersection_over_union.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_above","title":"is_above","text":"<pre><code>is_above(other: BoundingBox) -> bool\n</code></pre> <p>is_above.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_horizontally_connected","title":"is_horizontally_connected","text":"<pre><code>is_horizontally_connected(elem_i: BoundingBox, elem_j: BoundingBox) -> bool\n</code></pre> <p>is_horizontally_connected.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_left_of","title":"is_left_of","text":"<pre><code>is_left_of(other: BoundingBox) -> bool\n</code></pre> <p>is_left_of.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_strictly_above","title":"is_strictly_above","text":"<pre><code>is_strictly_above(other: BoundingBox, eps: float = 0.001) -> bool\n</code></pre> <p>is_strictly_above.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.is_strictly_left_of","title":"is_strictly_left_of","text":"<pre><code>is_strictly_left_of(other: BoundingBox, eps: float = 0.001) -> bool\n</code></pre> <p>is_strictly_left_of.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.normalized","title":"normalized","text":"<pre><code>normalized(page_size: Size)\n</code></pre> <p>normalized.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps","title":"overlaps","text":"<pre><code>overlaps(other: BoundingBox) -> bool\n</code></pre> <p>overlaps.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_horizontally","title":"overlaps_horizontally","text":"<pre><code>overlaps_horizontally(other: BoundingBox) -> bool\n</code></pre> <p>Check if two bounding boxes overlap horizontally.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_vertically","title":"overlaps_vertically","text":"<pre><code>overlaps_vertically(other: BoundingBox) -> bool\n</code></pre> <p>Check if two bounding boxes overlap vertically.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.overlaps_vertically_with_iou","title":"overlaps_vertically_with_iou","text":"<pre><code>overlaps_vertically_with_iou(other: BoundingBox, iou: float) -> bool\n</code></pre> <p>overlaps_y_with_iou.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.resize_by_scale","title":"resize_by_scale","text":"<pre><code>resize_by_scale(x_scale: float, y_scale: float)\n</code></pre> <p>resize_by_scale.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.scale_to_size","title":"scale_to_size","text":"<pre><code>scale_to_size(old_size: Size, new_size: Size)\n</code></pre> <p>scale_to_size.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.scaled","title":"scaled","text":"<pre><code>scaled(scale: float)\n</code></pre> <p>scaled.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.to_bottom_left_origin","title":"to_bottom_left_origin","text":"<pre><code>to_bottom_left_origin(page_height: float) -> BoundingBox\n</code></pre> <p>to_bottom_left_origin.</p> <p>Parameters:</p> <ul> <li> <code>page_height</code> (<code>float</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.to_top_left_origin","title":"to_top_left_origin","text":"<pre><code>to_top_left_origin(page_height: float) -> BoundingBox\n</code></pre> <p>to_top_left_origin.</p> <p>Parameters:</p> <ul> <li> <code>page_height</code> (<code>float</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.union_area_with","title":"union_area_with","text":"<pre><code>union_area_with(other: BoundingBox) -> float\n</code></pre> <p>Calculates the union area with another bounding box.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.x_overlap_with","title":"x_overlap_with","text":"<pre><code>x_overlap_with(other: BoundingBox) -> float\n</code></pre> <p>Calculates the horizontal overlap with another bounding box.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.x_union_with","title":"x_union_with","text":"<pre><code>x_union_with(other: BoundingBox) -> float\n</code></pre> <p>Calculates the horizontal union dimension with another bounding box.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.y_overlap_with","title":"y_overlap_with","text":"<pre><code>y_overlap_with(other: BoundingBox) -> float\n</code></pre> <p>Calculates the vertical overlap with another bounding box, respecting coordinate origin.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.BoundingBox.y_union_with","title":"y_union_with","text":"<pre><code>y_union_with(other: BoundingBox) -> float\n</code></pre> <p>Calculates the vertical union dimension with another bounding box, respecting coordinate origin.</p>"},{"location":"reference/docling_document/#docling_core.types.doc.CoordOrigin","title":"CoordOrigin","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>CoordOrigin.</p> <p>Attributes:</p> <ul> <li> <code>BOTTOMLEFT</code> \u2013 </li> <li> <code>TOPLEFT</code> \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.CoordOrigin.BOTTOMLEFT","title":"BOTTOMLEFT","text":"<pre><code>BOTTOMLEFT = 'BOTTOMLEFT'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.CoordOrigin.TOPLEFT","title":"TOPLEFT","text":"<pre><code>TOPLEFT = 'TOPLEFT'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode","title":"ImageRefMode","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>ImageRefMode.</p> <p>Attributes:</p> <ul> <li> <code>EMBEDDED</code> \u2013 </li> <li> <code>PLACEHOLDER</code> \u2013 </li> <li> <code>REFERENCED</code> \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode.EMBEDDED","title":"EMBEDDED","text":"<pre><code>EMBEDDED = 'embedded'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode.PLACEHOLDER","title":"PLACEHOLDER","text":"<pre><code>PLACEHOLDER = 'placeholder'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.ImageRefMode.REFERENCED","title":"REFERENCED","text":"<pre><code>REFERENCED = 'referenced'\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.Size","title":"Size","text":"<p> Bases: <code>BaseModel</code></p> <p>Size.</p> <p>Methods:</p> <ul> <li> <code>as_tuple</code> \u2013 <p>as_tuple.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>height</code> (<code>float</code>) \u2013 </li> <li> <code>width</code> (<code>float</code>) \u2013 </li> </ul>"},{"location":"reference/docling_document/#docling_core.types.doc.Size.height","title":"height","text":"<pre><code>height: float = 0.0\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.Size.width","title":"width","text":"<pre><code>width: float = 0.0\n</code></pre>"},{"location":"reference/docling_document/#docling_core.types.doc.Size.as_tuple","title":"as_tuple","text":"<pre><code>as_tuple()\n</code></pre> <p>as_tuple.</p>"},{"location":"reference/document_converter/","title":"Document converter","text":"<p>This is an automatic generated API reference of the main components of Docling.</p>"},{"location":"reference/document_converter/#docling.document_converter","title":"document_converter","text":"<p>Classes:</p> <ul> <li> <code>DocumentConverter</code> \u2013 </li> <li> <code>ConversionResult</code> \u2013 </li> <li> <code>ConversionStatus</code> \u2013 </li> <li> <code>FormatOption</code> \u2013 </li> <li> <code>InputFormat</code> \u2013 <p>A document format supported by document backend parsers.</p> </li> <li> <code>PdfFormatOption</code> \u2013 </li> <li> <code>ImageFormatOption</code> \u2013 </li> <li> <code>StandardPdfPipeline</code> \u2013 </li> <li> <code>WordFormatOption</code> \u2013 </li> <li> <code>PowerpointFormatOption</code> \u2013 </li> <li> <code>MarkdownFormatOption</code> \u2013 </li> <li> <code>AsciiDocFormatOption</code> \u2013 </li> <li> <code>HTMLFormatOption</code> \u2013 </li> <li> <code>SimplePipeline</code> \u2013 <p>SimpleModelPipeline.</p> </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter","title":"DocumentConverter","text":"<pre><code>DocumentConverter(allowed_formats: Optional[List[InputFormat]] = None, format_options: Optional[Dict[InputFormat, FormatOption]] = None)\n</code></pre> <p>Methods:</p> <ul> <li> <code>convert</code> \u2013 </li> <li> <code>convert_all</code> \u2013 </li> <li> <code>initialize_pipeline</code> \u2013 <p>Initialize the conversion pipeline for the selected format.</p> </li> </ul> <p>Attributes:</p> <ul> <li> <code>allowed_formats</code> \u2013 </li> <li> <code>format_to_options</code> \u2013 </li> <li> <code>initialized_pipelines</code> (<code>Dict[Tuple[Type[BasePipeline], str], BasePipeline]</code>) \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.allowed_formats","title":"allowed_formats <code>instance-attribute</code>","text":"<pre><code>allowed_formats = allowed_formats if allowed_formats is not None else list(InputFormat)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.format_to_options","title":"format_to_options <code>instance-attribute</code>","text":"<pre><code>format_to_options = {format: _get_default_option(format=format) if (custom_option := get(format)) is None else _maMECYbQ03vJfor format in allowed_formats}\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.initialized_pipelines","title":"initialized_pipelines <code>instance-attribute</code>","text":"<pre><code>initialized_pipelines: Dict[Tuple[Type[BasePipeline], str], BasePipeline] = {}\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.convert","title":"convert","text":"<pre><code>convert(source: Union[Path, str, DocumentStream], headers: Optional[Dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -> ConversionResult\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.convert_all","title":"convert_all","text":"<pre><code>convert_all(source: Iterable[Union[Path, str, DocumentStream]], headers: Optional[Dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -> Iterator[ConversionResult]\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.DocumentConverter.initialize_pipeline","title":"initialize_pipeline","text":"<pre><code>initialize_pipeline(format: InputFormat)\n</code></pre> <p>Initialize the conversion pipeline for the selected format.</p>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult","title":"ConversionResult","text":"<p> Bases: <code>BaseModel</code></p> <p>Attributes:</p> <ul> <li> <code>assembled</code> (<code>AssembledUnit</code>) \u2013 </li> <li> <code>confidence</code> (<code>ConfidenceReport</code>) \u2013 </li> <li> <code>document</code> (<code>DoclingDocument</code>) \u2013 </li> <li> <code>errors</code> (<code>List[ErrorItem]</code>) \u2013 </li> <li> <code>input</code> (<code>InputDocument</code>) \u2013 </li> <li> <code>legacy_document</code> \u2013 </li> <li> <code>pages</code> (<code>List[Page]</code>) \u2013 </li> <li> <code>status</code> (<code>ConversionStatus</code>) \u2013 </li> <li> <code>timings</code> (<code>Dict[str, ProfilingItem]</code>) \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.assembled","title":"assembled <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>assembled: AssembledUnit = AssembledUnit()\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.confidence","title":"confidence <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>confidence: ConfidenceReport = Field(default_factory=ConfidenceReport)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.document","title":"document <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document: DoclingDocument = _EMPTY_DOCLING_DOC\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.errors","title":"errors <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>errors: List[ErrorItem] = []\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.input","title":"input <code>instance-attribute</code>","text":"<pre><code>input: InputDocument\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.legacy_document","title":"legacy_document <code>property</code>","text":"<pre><code>legacy_document\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.pages","title":"pages <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pages: List[Page] = []\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.status","title":"status <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>status: ConversionStatus = PENDING\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionResult.timings","title":"timings <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timings: Dict[str, ProfilingItem] = {}\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus","title":"ConversionStatus","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>Attributes:</p> <ul> <li> <code>FAILURE</code> \u2013 </li> <li> <code>PARTIAL_SUCCESS</code> \u2013 </li> <li> <code>PENDING</code> \u2013 </li> <li> <code>SKIPPED</code> \u2013 </li> <li> <code>STARTED</code> \u2013 </li> <li> <code>SUCCESS</code> \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.FAILURE","title":"FAILURE <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>FAILURE = 'failure'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.PARTIAL_SUCCESS","title":"PARTIAL_SUCCESS <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PARTIAL_SUCCESS = 'partial_success'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.PENDING","title":"PENDING <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PENDING = 'pending'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.SKIPPED","title":"SKIPPED <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SKIPPED = 'skipped'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.STARTED","title":"STARTED <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>STARTED = 'started'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ConversionStatus.SUCCESS","title":"SUCCESS <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>SUCCESS = 'success'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption","title":"FormatOption","text":"<p> Bases: <code>BaseModel</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code> \u2013 </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code> (<code>Type[AbstractDocumentBackend]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>pipeline_cls</code> (<code>Type[BasePipeline]</code>) \u2013 </li> <li> <code>pipeline_options</code> (<code>Optional[PipelineOptions]</code>) \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.backend","title":"backend <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend]\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.pipeline_cls","title":"pipeline_cls <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type[BasePipeline]\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.pipeline_options","title":"pipeline_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.FormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -> FormatOption\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat","title":"InputFormat","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>A document format supported by document backend parsers.</p> <p>Attributes:</p> <ul> <li> <code>ASCIIDOC</code> \u2013 </li> <li> <code>AUDIO</code> \u2013 </li> <li> <code>CSV</code> \u2013 </li> <li> <code>DOCX</code> \u2013 </li> <li> <code>HTML</code> \u2013 </li> <li> <code>IMAGE</code> \u2013 </li> <li> <code>JSON_DOCLING</code> \u2013 </li> <li> <code>MD</code> \u2013 </li> <li> <code>PDF</code> \u2013 </li> <li> <code>PPTX</code> \u2013 </li> <li> <code>XLSX</code> \u2013 </li> <li> <code>XML_JATS</code> \u2013 </li> <li> <code>XML_USPTO</code> \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.ASCIIDOC","title":"ASCIIDOC <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ASCIIDOC = 'asciidoc'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.AUDIO","title":"AUDIO <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>AUDIO = 'audio'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.CSV","title":"CSV <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>CSV = 'csv'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.DOCX","title":"DOCX <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DOCX = 'docx'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.HTML","title":"HTML <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>HTML = 'html'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.IMAGE","title":"IMAGE <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>IMAGE = 'image'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.JSON_DOCLING","title":"JSON_DOCLING <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>JSON_DOCLING = 'json_docling'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.MD","title":"MD <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>MD = 'md'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.PDF","title":"PDF <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PDF = 'pdf'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.PPTX","title":"PPTX <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PPTX = 'pptx'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XLSX","title":"XLSX <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XLSX = 'xlsx'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XML_JATS","title":"XML_JATS <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML_JATS = 'xml_jats'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.InputFormat.XML_USPTO","title":"XML_USPTO <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>XML_USPTO = 'xml_uspto'\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption","title":"PdfFormatOption","text":"<p> Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code> \u2013 </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code> (<code>Type[AbstractDocumentBackend]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>pipeline_cls</code> (<code>Type</code>) \u2013 </li> <li> <code>pipeline_options</code> (<code>Optional[PipelineOptions]</code>) \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.backend","title":"backend <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.pipeline_cls","title":"pipeline_cls <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = StandardPdfPipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.pipeline_options","title":"pipeline_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PdfFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -> FormatOption\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption","title":"ImageFormatOption","text":"<p> Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code> \u2013 </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code> (<code>Type[AbstractDocumentBackend]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>pipeline_cls</code> (<code>Type</code>) \u2013 </li> <li> <code>pipeline_options</code> (<code>Optional[PipelineOptions]</code>) \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.backend","title":"backend <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.pipeline_cls","title":"pipeline_cls <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = StandardPdfPipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.pipeline_options","title":"pipeline_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.ImageFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -> FormatOption\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline","title":"StandardPdfPipeline","text":"<pre><code>StandardPdfPipeline(pipeline_options: PdfPipelineOptions)\n</code></pre> <p> Bases: <code>PaginatedPipeline</code></p> <p>Methods:</p> <ul> <li> <code>download_models_hf</code> \u2013 </li> <li> <code>execute</code> \u2013 </li> <li> <code>get_default_options</code> \u2013 </li> <li> <code>get_ocr_model</code> \u2013 </li> <li> <code>get_picture_description_model</code> \u2013 </li> <li> <code>initialize_page</code> \u2013 </li> <li> <code>is_backend_supported</code> \u2013 </li> </ul> <p>Attributes:</p> <ul> <li> <code>build_pipe</code> \u2013 </li> <li> <code>enrichment_pipe</code> \u2013 </li> <li> <code>keep_backend</code> \u2013 </li> <li> <code>keep_images</code> \u2013 </li> <li> <code>pipeline_options</code> (<code>PdfPipelineOptions</code>) \u2013 </li> <li> <code>reading_order_model</code> \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.build_pipe","title":"build_pipe <code>instance-attribute</code>","text":"<pre><code>build_pipe = [PagePreprocessingModel(options=PagePreprocessingOptions(images_scale=images_scale)), ocr_model, LayoutModel(artifacts_path=artifacts_path, accelerator_options=accelerator_options, options=layout_options), TableStructureModel(enabled=do_table_structure, artifacts_path=artifacts_path, options=table_structure_options, accelerator_options=accelerator_options), PageAssembleModel(options=PageAssembleOptions())]\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.enrichment_pipe","title":"enrichment_pipe <code>instance-attribute</code>","text":"<pre><code>enrichment_pipe = [CodeFormulaModel(enabled=do_code_enrichment or do_formula_enrichment, artifacts_path=artifacts_path, options=CodeFormulaModelOptions(do_code_enrichment=do_code_enrichment, do_formula_enrichment=do_formula_enrichment), accelerator_options=accelerator_options), DocumentPictureClassifier(enabled=do_picture_classification, artifacts_path=artifacts_path, options=DocumentPictureClassifierOptions(), accelerator_options=accelerator_options), picture_description_model]\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.keep_backend","title":"keep_backend <code>instance-attribute</code>","text":"<pre><code>keep_backend = True\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.keep_images","title":"keep_images <code>instance-attribute</code>","text":"<pre><code>keep_images = generate_page_images or generate_picture_images or generate_table_images\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.pipeline_options","title":"pipeline_options <code>instance-attribute</code>","text":"<pre><code>pipeline_options: PdfPipelineOptions\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.reading_order_model","title":"reading_order_model <code>instance-attribute</code>","text":"<pre><code>reading_order_model = ReadingOrderModel(options=ReadingOrderOptions())\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.download_models_hf","title":"download_models_hf <code>staticmethod</code>","text":"<pre><code>download_models_hf(local_dir: Optional[Path] = None, force: bool = False) -> Path\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.execute","title":"execute","text":"<pre><code>execute(in_doc: InputDocument, raises_on_error: bool) -> ConversionResult\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.get_default_options","title":"get_default_options <code>classmethod</code>","text":"<pre><code>get_default_options() -> PdfPipelineOptions\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.get_ocr_model","title":"get_ocr_model","text":"<pre><code>get_ocr_model(artifacts_path: Optional[Path] = None) -> BaseOcrModel\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.get_picture_description_model","title":"get_picture_description_model","text":"<pre><code>get_picture_description_model(artifacts_path: Optional[Path] = None) -> Optional[PictureDescriptionBaseModel]\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.initialize_page","title":"initialize_page","text":"<pre><code>initialize_page(conv_res: ConversionResult, page: Page) -> Page\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.StandardPdfPipeline.is_backend_supported","title":"is_backend_supported <code>classmethod</code>","text":"<pre><code>is_backend_supported(backend: AbstractDocumentBackend)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption","title":"WordFormatOption","text":"<p> Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code> \u2013 </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code> (<code>Type[AbstractDocumentBackend]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>pipeline_cls</code> (<code>Type</code>) \u2013 </li> <li> <code>pipeline_options</code> (<code>Optional[PipelineOptions]</code>) \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.backend","title":"backend <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = MsWordDocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.pipeline_cls","title":"pipeline_cls <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = SimplePipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.pipeline_options","title":"pipeline_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.WordFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -> FormatOption\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption","title":"PowerpointFormatOption","text":"<p> Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code> \u2013 </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code> (<code>Type[AbstractDocumentBackend]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>pipeline_cls</code> (<code>Type</code>) \u2013 </li> <li> <code>pipeline_options</code> (<code>Optional[PipelineOptions]</code>) \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.backend","title":"backend <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = MsPowerpointDocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.pipeline_cls","title":"pipeline_cls <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = SimplePipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.pipeline_options","title":"pipeline_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.PowerpointFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -> FormatOption\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption","title":"MarkdownFormatOption","text":"<p> Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code> \u2013 </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code> (<code>Type[AbstractDocumentBackend]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>pipeline_cls</code> (<code>Type</code>) \u2013 </li> <li> <code>pipeline_options</code> (<code>Optional[PipelineOptions]</code>) \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.backend","title":"backend <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = MarkdownDocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.pipeline_cls","title":"pipeline_cls <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = SimplePipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.pipeline_options","title":"pipeline_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.MarkdownFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -> FormatOption\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption","title":"AsciiDocFormatOption","text":"<p> Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code> \u2013 </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code> (<code>Type[AbstractDocumentBackend]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>pipeline_cls</code> (<code>Type</code>) \u2013 </li> <li> <code>pipeline_options</code> (<code>Optional[PipelineOptions]</code>) \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.backend","title":"backend <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = AsciiDocBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.pipeline_cls","title":"pipeline_cls <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = SimplePipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.pipeline_options","title":"pipeline_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.AsciiDocFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -> FormatOption\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption","title":"HTMLFormatOption","text":"<p> Bases: <code>FormatOption</code></p> <p>Methods:</p> <ul> <li> <code>set_optional_field_default</code> \u2013 </li> </ul> <p>Attributes:</p> <ul> <li> <code>backend</code> (<code>Type[AbstractDocumentBackend]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>pipeline_cls</code> (<code>Type</code>) \u2013 </li> <li> <code>pipeline_options</code> (<code>Optional[PipelineOptions]</code>) \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.backend","title":"backend <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>backend: Type[AbstractDocumentBackend] = HTMLDocumentBackend\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(arbitrary_types_allowed=True)\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.pipeline_cls","title":"pipeline_cls <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_cls: Type = SimplePipeline\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.pipeline_options","title":"pipeline_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>pipeline_options: Optional[PipelineOptions] = None\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.HTMLFormatOption.set_optional_field_default","title":"set_optional_field_default","text":"<pre><code>set_optional_field_default() -> FormatOption\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline","title":"SimplePipeline","text":"<pre><code>SimplePipeline(pipeline_options: PipelineOptions)\n</code></pre> <p> Bases: <code>BasePipeline</code></p> <p>SimpleModelPipeline.</p> <p>This class is used at the moment for formats / backends which produce straight DoclingDocument output.</p> <p>Methods:</p> <ul> <li> <code>execute</code> \u2013 </li> <li> <code>get_default_options</code> \u2013 </li> <li> <code>is_backend_supported</code> \u2013 </li> </ul> <p>Attributes:</p> <ul> <li> <code>build_pipe</code> (<code>List[Callable]</code>) \u2013 </li> <li> <code>enrichment_pipe</code> (<code>List[GenericEnrichmentModel[Any]]</code>) \u2013 </li> <li> <code>keep_images</code> \u2013 </li> <li> <code>pipeline_options</code> \u2013 </li> </ul>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.build_pipe","title":"build_pipe <code>instance-attribute</code>","text":"<pre><code>build_pipe: List[Callable] = []\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.enrichment_pipe","title":"enrichment_pipe <code>instance-attribute</code>","text":"<pre><code>enrichment_pipe: List[GenericEnrichmentModel[Any]] = []\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.keep_images","title":"keep_images <code>instance-attribute</code>","text":"<pre><code>keep_images = False\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.pipeline_options","title":"pipeline_options <code>instance-attribute</code>","text":"<pre><code>pipeline_options = pipeline_options\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.execute","title":"execute","text":"<pre><code>execute(in_doc: InputDocument, raises_on_error: bool) -> ConversionResult\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.get_default_options","title":"get_default_options <code>classmethod</code>","text":"<pre><code>get_default_options() -> PipelineOptions\n</code></pre>"},{"location":"reference/document_converter/#docling.document_converter.SimplePipeline.is_backend_supported","title":"is_backend_supported <code>classmethod</code>","text":"<pre><code>is_backend_supported(backend: AbstractDocumentBackend)\n</code></pre>"},{"location":"reference/pipeline_options/","title":"Pipeline options","text":"<p>Pipeline options allow to customize the execution of the models during the conversion pipeline. This includes options for the OCR engines, the table model as well as enrichment options which can be enabled with <code>do_xyz = True</code>.</p> <p>This is an automatic generated API reference of the all the pipeline options available in Docling.</p>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options","title":"pipeline_options","text":"<p>Classes:</p> <ul> <li> <code>AsrPipelineOptions</code> \u2013 </li> <li> <code>BaseOptions</code> \u2013 <p>Base class for options.</p> </li> <li> <code>EasyOcrOptions</code> \u2013 <p>Options for the EasyOCR engine.</p> </li> <li> <code>LayoutOptions</code> \u2013 <p>Options for layout processing.</p> </li> <li> <code>OcrEngine</code> \u2013 <p>Enum of valid OCR engines.</p> </li> <li> <code>OcrMacOptions</code> \u2013 <p>Options for the Mac OCR engine.</p> </li> <li> <code>OcrOptions</code> \u2013 <p>OCR options.</p> </li> <li> <code>PaginatedPipelineOptions</code> \u2013 </li> <li> <code>PdfBackend</code> \u2013 <p>Enum of valid PDF backends.</p> </li> <li> <code>PdfPipelineOptions</code> \u2013 <p>Options for the PDF pipeline.</p> </li> <li> <code>PictureDescriptionApiOptions</code> \u2013 </li> <li> <code>PictureDescriptionBaseOptions</code> \u2013 </li> <li> <code>PictureDescriptionVlmOptions</code> \u2013 </li> <li> <code>PipelineOptions</code> \u2013 <p>Base pipeline options.</p> </li> <li> <code>ProcessingPipeline</code> \u2013 </li> <li> <code>RapidOcrOptions</code> \u2013 <p>Options for the RapidOCR engine.</p> </li> <li> <code>TableFormerMode</code> \u2013 <p>Modes for the TableFormer model.</p> </li> <li> <code>TableStructureOptions</code> \u2013 <p>Options for the table structure.</p> </li> <li> <code>TesseractCliOcrOptions</code> \u2013 <p>Options for the TesseractCli engine.</p> </li> <li> <code>TesseractOcrOptions</code> \u2013 <p>Options for the Tesseract engine.</p> </li> <li> <code>VlmPipelineOptions</code> \u2013 </li> </ul> <p>Attributes:</p> <ul> <li> <code>granite_picture_description</code> \u2013 </li> <li> <code>smolvlm_picture_description</code> \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.granite_picture_description","title":"granite_picture_description <code>module-attribute</code>","text":"<pre><code>granite_picture_description = PictureDescriptionVlmOptions(repo_id='ibm-granite/granite-vision-3.3-2b', prompt='What is shown in this image?')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.smolvlm_picture_description","title":"smolvlm_picture_description <code>module-attribute</code>","text":"<pre><code>smolvlm_picture_description = PictureDescriptionVlmOptions(repo_id='HuggingFaceTB/SmolVLM-256M-Instruct')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions","title":"AsrPipelineOptions","text":"<p> Bases: <code>PipelineOptions</code></p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code> (<code>AcceleratorOptions</code>) \u2013 </li> <li> <code>allow_external_plugins</code> (<code>bool</code>) \u2013 </li> <li> <code>artifacts_path</code> (<code>Optional[Union[Path, str]]</code>) \u2013 </li> <li> <code>asr_options</code> (<code>Union[InlineAsrOptions]</code>) \u2013 </li> <li> <code>create_legacy_output</code> (<code>bool</code>) \u2013 </li> <li> <code>document_timeout</code> (<code>Optional[float]</code>) \u2013 </li> <li> <code>enable_remote_services</code> (<code>bool</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.accelerator_options","title":"accelerator_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.allow_external_plugins","title":"allow_external_plugins <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.artifacts_path","title":"artifacts_path <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.asr_options","title":"asr_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>asr_options: Union[InlineAsrOptions] = WHISPER_TINY\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.create_legacy_output","title":"create_legacy_output <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_legacy_output: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.document_timeout","title":"document_timeout <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.AsrPipelineOptions.enable_remote_services","title":"enable_remote_services <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseOptions","title":"BaseOptions","text":"<p> Bases: <code>BaseModel</code></p> <p>Base class for options.</p> <p>Attributes:</p> <ul> <li> <code>kind</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.BaseOptions.kind","title":"kind <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions","title":"EasyOcrOptions","text":"<p> Bases: <code>OcrOptions</code></p> <p>Options for the EasyOCR engine.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code> (<code>float</code>) \u2013 </li> <li> <code>confidence_threshold</code> (<code>float</code>) \u2013 </li> <li> <code>download_enabled</code> (<code>bool</code>) \u2013 </li> <li> <code>force_full_page_ocr</code> (<code>bool</code>) \u2013 </li> <li> <code>kind</code> (<code>Literal['easyocr']</code>) \u2013 </li> <li> <code>lang</code> (<code>List[str]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>model_storage_directory</code> (<code>Optional[str]</code>) \u2013 </li> <li> <code>recog_network</code> (<code>Optional[str]</code>) \u2013 </li> <li> <code>use_gpu</code> (<code>Optional[bool]</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.bitmap_area_threshold","title":"bitmap_area_threshold <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.confidence_threshold","title":"confidence_threshold <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>confidence_threshold: float = 0.5\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.download_enabled","title":"download_enabled <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>download_enabled: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.force_full_page_ocr","title":"force_full_page_ocr <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.kind","title":"kind <code>class-attribute</code>","text":"<pre><code>kind: Literal['easyocr'] = 'easyocr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.lang","title":"lang <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lang: List[str] = ['fr', 'de', 'es', 'en']\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='forbid', protected_namespaces=())\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.model_storage_directory","title":"model_storage_directory <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_storage_directory: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.recog_network","title":"recog_network <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>recog_network: Optional[str] = 'standard'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions.use_gpu","title":"use_gpu <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>use_gpu: Optional[bool] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions","title":"LayoutOptions","text":"<p> Bases: <code>BaseModel</code></p> <p>Options for layout processing.</p> <p>Attributes:</p> <ul> <li> <code>create_orphan_clusters</code> (<code>bool</code>) \u2013 </li> <li> <code>keep_empty_clusters</code> (<code>bool</code>) \u2013 </li> <li> <code>model_spec</code> (<code>LayoutModelConfig</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.create_orphan_clusters","title":"create_orphan_clusters <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_orphan_clusters: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.keep_empty_clusters","title":"keep_empty_clusters <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>keep_empty_clusters: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.LayoutOptions.model_spec","title":"model_spec <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_spec: LayoutModelConfig = DOCLING_LAYOUT_V2\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine","title":"OcrEngine","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>Enum of valid OCR engines.</p> <p>Attributes:</p> <ul> <li> <code>EASYOCR</code> \u2013 </li> <li> <code>OCRMAC</code> \u2013 </li> <li> <code>RAPIDOCR</code> \u2013 </li> <li> <code>TESSERACT</code> \u2013 </li> <li> <code>TESSERACT_CLI</code> \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.EASYOCR","title":"EASYOCR <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>EASYOCR = 'easyocr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.OCRMAC","title":"OCRMAC <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>OCRMAC = 'ocrmac'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.RAPIDOCR","title":"RAPIDOCR <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>RAPIDOCR = 'rapidocr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.TESSERACT","title":"TESSERACT <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TESSERACT = 'tesseract'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrEngine.TESSERACT_CLI","title":"TESSERACT_CLI <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>TESSERACT_CLI = 'tesseract_cli'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions","title":"OcrMacOptions","text":"<p> Bases: <code>OcrOptions</code></p> <p>Options for the Mac OCR engine.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code> (<code>float</code>) \u2013 </li> <li> <code>force_full_page_ocr</code> (<code>bool</code>) \u2013 </li> <li> <code>framework</code> (<code>str</code>) \u2013 </li> <li> <code>kind</code> (<code>Literal['ocrmac']</code>) \u2013 </li> <li> <code>lang</code> (<code>List[str]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>recognition</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.bitmap_area_threshold","title":"bitmap_area_threshold <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.force_full_page_ocr","title":"force_full_page_ocr <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.framework","title":"framework <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>framework: str = 'vision'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.kind","title":"kind <code>class-attribute</code>","text":"<pre><code>kind: Literal['ocrmac'] = 'ocrmac'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.lang","title":"lang <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lang: List[str] = ['fr-FR', 'de-DE', 'es-ES', 'en-US']\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrMacOptions.recognition","title":"recognition <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>recognition: str = 'accurate'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions","title":"OcrOptions","text":"<p> Bases: <code>BaseOptions</code></p> <p>OCR options.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code> (<code>float</code>) \u2013 </li> <li> <code>force_full_page_ocr</code> (<code>bool</code>) \u2013 </li> <li> <code>kind</code> (<code>str</code>) \u2013 </li> <li> <code>lang</code> (<code>List[str]</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.bitmap_area_threshold","title":"bitmap_area_threshold <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.force_full_page_ocr","title":"force_full_page_ocr <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.kind","title":"kind <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.OcrOptions.lang","title":"lang <code>instance-attribute</code>","text":"<pre><code>lang: List[str]\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions","title":"PaginatedPipelineOptions","text":"<p> Bases: <code>PipelineOptions</code></p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code> (<code>AcceleratorOptions</code>) \u2013 </li> <li> <code>allow_external_plugins</code> (<code>bool</code>) \u2013 </li> <li> <code>artifacts_path</code> (<code>Optional[Union[Path, str]]</code>) \u2013 </li> <li> <code>create_legacy_output</code> (<code>bool</code>) \u2013 </li> <li> <code>document_timeout</code> (<code>Optional[float]</code>) \u2013 </li> <li> <code>enable_remote_services</code> (<code>bool</code>) \u2013 </li> <li> <code>generate_page_images</code> (<code>bool</code>) \u2013 </li> <li> <code>generate_picture_images</code> (<code>bool</code>) \u2013 </li> <li> <code>images_scale</code> (<code>float</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.accelerator_options","title":"accelerator_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.allow_external_plugins","title":"allow_external_plugins <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.artifacts_path","title":"artifacts_path <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.create_legacy_output","title":"create_legacy_output <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_legacy_output: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.document_timeout","title":"document_timeout <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.enable_remote_services","title":"enable_remote_services <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.generate_page_images","title":"generate_page_images <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_page_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.generate_picture_images","title":"generate_picture_images <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_picture_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PaginatedPipelineOptions.images_scale","title":"images_scale <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>images_scale: float = 1.0\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend","title":"PdfBackend","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>Enum of valid PDF backends.</p> <p>Attributes:</p> <ul> <li> <code>DLPARSE_V1</code> \u2013 </li> <li> <code>DLPARSE_V2</code> \u2013 </li> <li> <code>DLPARSE_V4</code> \u2013 </li> <li> <code>PYPDFIUM2</code> \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.DLPARSE_V1","title":"DLPARSE_V1 <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DLPARSE_V1 = 'dlparse_v1'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.DLPARSE_V2","title":"DLPARSE_V2 <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DLPARSE_V2 = 'dlparse_v2'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.DLPARSE_V4","title":"DLPARSE_V4 <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>DLPARSE_V4 = 'dlparse_v4'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfBackend.PYPDFIUM2","title":"PYPDFIUM2 <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>PYPDFIUM2 = 'pypdfium2'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions","title":"PdfPipelineOptions","text":"<p> Bases: <code>PaginatedPipelineOptions</code></p> <p>Options for the PDF pipeline.</p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code> (<code>AcceleratorOptions</code>) \u2013 </li> <li> <code>allow_external_plugins</code> (<code>bool</code>) \u2013 </li> <li> <code>artifacts_path</code> (<code>Optional[Union[Path, str]]</code>) \u2013 </li> <li> <code>create_legacy_output</code> (<code>bool</code>) \u2013 </li> <li> <code>do_code_enrichment</code> (<code>bool</code>) \u2013 </li> <li> <code>do_formula_enrichment</code> (<code>bool</code>) \u2013 </li> <li> <code>do_ocr</code> (<code>bool</code>) \u2013 </li> <li> <code>do_picture_classification</code> (<code>bool</code>) \u2013 </li> <li> <code>do_picture_description</code> (<code>bool</code>) \u2013 </li> <li> <code>do_table_structure</code> (<code>bool</code>) \u2013 </li> <li> <code>document_timeout</code> (<code>Optional[float]</code>) \u2013 </li> <li> <code>enable_remote_services</code> (<code>bool</code>) \u2013 </li> <li> <code>force_backend_text</code> (<code>bool</code>) \u2013 </li> <li> <code>generate_page_images</code> (<code>bool</code>) \u2013 </li> <li> <code>generate_parsed_pages</code> (<code>Literal[True]</code>) \u2013 </li> <li> <code>generate_picture_images</code> (<code>bool</code>) \u2013 </li> <li> <code>generate_table_images</code> (<code>bool</code>) \u2013 </li> <li> <code>images_scale</code> (<code>float</code>) \u2013 </li> <li> <code>layout_options</code> (<code>LayoutOptions</code>) \u2013 </li> <li> <code>ocr_options</code> (<code>OcrOptions</code>) \u2013 </li> <li> <code>picture_description_options</code> (<code>PictureDescriptionBaseOptions</code>) \u2013 </li> <li> <code>table_structure_options</code> (<code>TableStructureOptions</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.accelerator_options","title":"accelerator_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.allow_external_plugins","title":"allow_external_plugins <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.artifacts_path","title":"artifacts_path <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.create_legacy_output","title":"create_legacy_output <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_legacy_output: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_code_enrichment","title":"do_code_enrichment <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_code_enrichment: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_formula_enrichment","title":"do_formula_enrichment <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_formula_enrichment: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_ocr","title":"do_ocr <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_ocr: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_picture_classification","title":"do_picture_classification <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_classification: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_picture_description","title":"do_picture_description <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_picture_description: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.do_table_structure","title":"do_table_structure <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_table_structure: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.document_timeout","title":"document_timeout <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.enable_remote_services","title":"enable_remote_services <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.force_backend_text","title":"force_backend_text <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_backend_text: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_page_images","title":"generate_page_images <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_page_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_parsed_pages","title":"generate_parsed_pages <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_parsed_pages: Literal[True] = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_picture_images","title":"generate_picture_images <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_picture_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.generate_table_images","title":"generate_table_images <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_table_images: bool = Field(default=False, deprecated='Field `generate_table_images` is deprecated. To obtain table images, set `PdfPipelineOptions.generate_page_images = True` before conversion and then use the `TableItem.get_image` function.')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.images_scale","title":"images_scale <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>images_scale: float = 1.0\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.layout_options","title":"layout_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>layout_options: LayoutOptions = LayoutOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.ocr_options","title":"ocr_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ocr_options: OcrOptions = EasyOcrOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.picture_description_options","title":"picture_description_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_description_options: PictureDescriptionBaseOptions = smolvlm_picture_description\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions.table_structure_options","title":"table_structure_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>table_structure_options: TableStructureOptions = TableStructureOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions","title":"PictureDescriptionApiOptions","text":"<p> Bases: <code>PictureDescriptionBaseOptions</code></p> <p>Attributes:</p> <ul> <li> <code>batch_size</code> (<code>int</code>) \u2013 </li> <li> <code>concurrency</code> (<code>int</code>) \u2013 </li> <li> <code>headers</code> (<code>Dict[str, str]</code>) \u2013 </li> <li> <code>kind</code> (<code>Literal['api']</code>) \u2013 </li> <li> <code>params</code> (<code>Dict[str, Any]</code>) \u2013 </li> <li> <code>picture_area_threshold</code> (<code>float</code>) \u2013 </li> <li> <code>prompt</code> (<code>str</code>) \u2013 </li> <li> <code>provenance</code> (<code>str</code>) \u2013 </li> <li> <code>scale</code> (<code>float</code>) \u2013 </li> <li> <code>timeout</code> (<code>float</code>) \u2013 </li> <li> <code>url</code> (<code>AnyUrl</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.batch_size","title":"batch_size <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>batch_size: int = 8\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.concurrency","title":"concurrency <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>concurrency: int = 1\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.headers","title":"headers <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>headers: Dict[str, str] = {}\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.kind","title":"kind <code>class-attribute</code>","text":"<pre><code>kind: Literal['api'] = 'api'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.params","title":"params <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>params: Dict[str, Any] = {}\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.picture_area_threshold","title":"picture_area_threshold <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.prompt","title":"prompt <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>prompt: str = 'Describe this image in a few sentences.'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.provenance","title":"provenance <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>provenance: str = ''\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.scale","title":"scale <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scale: float = 2\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.timeout","title":"timeout <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>timeout: float = 20\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionApiOptions.url","title":"url <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>url: AnyUrl = AnyUrl('http://localhost:8000/v1/chat/completions')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions","title":"PictureDescriptionBaseOptions","text":"<p> Bases: <code>BaseOptions</code></p> <p>Attributes:</p> <ul> <li> <code>batch_size</code> (<code>int</code>) \u2013 </li> <li> <code>kind</code> (<code>str</code>) \u2013 </li> <li> <code>picture_area_threshold</code> (<code>float</code>) \u2013 </li> <li> <code>scale</code> (<code>float</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.batch_size","title":"batch_size <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>batch_size: int = 8\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.kind","title":"kind <code>class-attribute</code>","text":"<pre><code>kind: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.picture_area_threshold","title":"picture_area_threshold <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionBaseOptions.scale","title":"scale <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scale: float = 2\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions","title":"PictureDescriptionVlmOptions","text":"<p> Bases: <code>PictureDescriptionBaseOptions</code></p> <p>Attributes:</p> <ul> <li> <code>batch_size</code> (<code>int</code>) \u2013 </li> <li> <code>generation_config</code> (<code>Dict[str, Any]</code>) \u2013 </li> <li> <code>kind</code> (<code>Literal['vlm']</code>) \u2013 </li> <li> <code>picture_area_threshold</code> (<code>float</code>) \u2013 </li> <li> <code>prompt</code> (<code>str</code>) \u2013 </li> <li> <code>repo_cache_folder</code> (<code>str</code>) \u2013 </li> <li> <code>repo_id</code> (<code>str</code>) \u2013 </li> <li> <code>scale</code> (<code>float</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.batch_size","title":"batch_size <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>batch_size: int = 8\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.generation_config","title":"generation_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generation_config: Dict[str, Any] = dict(max_new_tokens=200, do_sample=False)\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.kind","title":"kind <code>class-attribute</code>","text":"<pre><code>kind: Literal['vlm'] = 'vlm'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.picture_area_threshold","title":"picture_area_threshold <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>picture_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.prompt","title":"prompt <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>prompt: str = 'Describe this image in a few sentences.'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.repo_cache_folder","title":"repo_cache_folder <code>property</code>","text":"<pre><code>repo_cache_folder: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.repo_id","title":"repo_id <code>instance-attribute</code>","text":"<pre><code>repo_id: str\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PictureDescriptionVlmOptions.scale","title":"scale <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>scale: float = 2\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions","title":"PipelineOptions","text":"<p> Bases: <code>BaseModel</code></p> <p>Base pipeline options.</p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code> (<code>AcceleratorOptions</code>) \u2013 </li> <li> <code>allow_external_plugins</code> (<code>bool</code>) \u2013 </li> <li> <code>create_legacy_output</code> (<code>bool</code>) \u2013 </li> <li> <code>document_timeout</code> (<code>Optional[float]</code>) \u2013 </li> <li> <code>enable_remote_services</code> (<code>bool</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.accelerator_options","title":"accelerator_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.allow_external_plugins","title":"allow_external_plugins <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.create_legacy_output","title":"create_legacy_output <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_legacy_output: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.document_timeout","title":"document_timeout <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.PipelineOptions.enable_remote_services","title":"enable_remote_services <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline","title":"ProcessingPipeline","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>Attributes:</p> <ul> <li> <code>ASR</code> \u2013 </li> <li> <code>STANDARD</code> \u2013 </li> <li> <code>VLM</code> \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.ASR","title":"ASR <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ASR = 'asr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.STANDARD","title":"STANDARD <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>STANDARD = 'standard'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.ProcessingPipeline.VLM","title":"VLM <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>VLM = 'vlm'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions","title":"RapidOcrOptions","text":"<p> Bases: <code>OcrOptions</code></p> <p>Options for the RapidOCR engine.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code> (<code>float</code>) \u2013 </li> <li> <code>cls_model_path</code> (<code>Optional[str]</code>) \u2013 </li> <li> <code>det_model_path</code> (<code>Optional[str]</code>) \u2013 </li> <li> <code>force_full_page_ocr</code> (<code>bool</code>) \u2013 </li> <li> <code>kind</code> (<code>Literal['rapidocr']</code>) \u2013 </li> <li> <code>lang</code> (<code>List[str]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>print_verbose</code> (<code>bool</code>) \u2013 </li> <li> <code>rec_keys_path</code> (<code>Optional[str]</code>) \u2013 </li> <li> <code>rec_model_path</code> (<code>Optional[str]</code>) \u2013 </li> <li> <code>text_score</code> (<code>float</code>) \u2013 </li> <li> <code>use_cls</code> (<code>Optional[bool]</code>) \u2013 </li> <li> <code>use_det</code> (<code>Optional[bool]</code>) \u2013 </li> <li> <code>use_rec</code> (<code>Optional[bool]</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.bitmap_area_threshold","title":"bitmap_area_threshold <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.cls_model_path","title":"cls_model_path <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>cls_model_path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.det_model_path","title":"det_model_path <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>det_model_path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.force_full_page_ocr","title":"force_full_page_ocr <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.kind","title":"kind <code>class-attribute</code>","text":"<pre><code>kind: Literal['rapidocr'] = 'rapidocr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.lang","title":"lang <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lang: List[str] = ['english', 'chinese']\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.print_verbose","title":"print_verbose <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>print_verbose: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rec_keys_path","title":"rec_keys_path <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rec_keys_path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.rec_model_path","title":"rec_model_path <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>rec_model_path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.text_score","title":"text_score <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>text_score: float = 0.5\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_cls","title":"use_cls <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>use_cls: Optional[bool] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_det","title":"use_det <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>use_det: Optional[bool] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.RapidOcrOptions.use_rec","title":"use_rec <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>use_rec: Optional[bool] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableFormerMode","title":"TableFormerMode","text":"<p> Bases: <code>str</code>, <code>Enum</code></p> <p>Modes for the TableFormer model.</p> <p>Attributes:</p> <ul> <li> <code>ACCURATE</code> \u2013 </li> <li> <code>FAST</code> \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableFormerMode.ACCURATE","title":"ACCURATE <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>ACCURATE = 'accurate'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableFormerMode.FAST","title":"FAST <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>FAST = 'fast'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions","title":"TableStructureOptions","text":"<p> Bases: <code>BaseModel</code></p> <p>Options for the table structure.</p> <p>Attributes:</p> <ul> <li> <code>do_cell_matching</code> (<code>bool</code>) \u2013 </li> <li> <code>mode</code> (<code>TableFormerMode</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions.do_cell_matching","title":"do_cell_matching <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>do_cell_matching: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TableStructureOptions.mode","title":"mode <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>mode: TableFormerMode = ACCURATE\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions","title":"TesseractCliOcrOptions","text":"<p> Bases: <code>OcrOptions</code></p> <p>Options for the TesseractCli engine.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code> (<code>float</code>) \u2013 </li> <li> <code>force_full_page_ocr</code> (<code>bool</code>) \u2013 </li> <li> <code>kind</code> (<code>Literal['tesseract']</code>) \u2013 </li> <li> <code>lang</code> (<code>List[str]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>path</code> (<code>Optional[str]</code>) \u2013 </li> <li> <code>tesseract_cmd</code> (<code>str</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.bitmap_area_threshold","title":"bitmap_area_threshold <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.force_full_page_ocr","title":"force_full_page_ocr <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.kind","title":"kind <code>class-attribute</code>","text":"<pre><code>kind: Literal['tesseract'] = 'tesseract'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.lang","title":"lang <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lang: List[str] = ['fra', 'deu', 'spa', 'eng']\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.path","title":"path <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractCliOcrOptions.tesseract_cmd","title":"tesseract_cmd <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>tesseract_cmd: str = 'tesseract'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions","title":"TesseractOcrOptions","text":"<p> Bases: <code>OcrOptions</code></p> <p>Options for the Tesseract engine.</p> <p>Attributes:</p> <ul> <li> <code>bitmap_area_threshold</code> (<code>float</code>) \u2013 </li> <li> <code>force_full_page_ocr</code> (<code>bool</code>) \u2013 </li> <li> <code>kind</code> (<code>Literal['tesserocr']</code>) \u2013 </li> <li> <code>lang</code> (<code>List[str]</code>) \u2013 </li> <li> <code>model_config</code> \u2013 </li> <li> <code>path</code> (<code>Optional[str]</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.bitmap_area_threshold","title":"bitmap_area_threshold <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>bitmap_area_threshold: float = 0.05\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.force_full_page_ocr","title":"force_full_page_ocr <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_full_page_ocr: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.kind","title":"kind <code>class-attribute</code>","text":"<pre><code>kind: Literal['tesserocr'] = 'tesserocr'\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.lang","title":"lang <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>lang: List[str] = ['fra', 'deu', 'spa', 'eng']\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.model_config","title":"model_config <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>model_config = ConfigDict(extra='forbid')\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.TesseractOcrOptions.path","title":"path <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>path: Optional[str] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions","title":"VlmPipelineOptions","text":"<p> Bases: <code>PaginatedPipelineOptions</code></p> <p>Attributes:</p> <ul> <li> <code>accelerator_options</code> (<code>AcceleratorOptions</code>) \u2013 </li> <li> <code>allow_external_plugins</code> (<code>bool</code>) \u2013 </li> <li> <code>artifacts_path</code> (<code>Optional[Union[Path, str]]</code>) \u2013 </li> <li> <code>create_legacy_output</code> (<code>bool</code>) \u2013 </li> <li> <code>document_timeout</code> (<code>Optional[float]</code>) \u2013 </li> <li> <code>enable_remote_services</code> (<code>bool</code>) \u2013 </li> <li> <code>force_backend_text</code> (<code>bool</code>) \u2013 </li> <li> <code>generate_page_images</code> (<code>bool</code>) \u2013 </li> <li> <code>generate_picture_images</code> (<code>bool</code>) \u2013 </li> <li> <code>images_scale</code> (<code>float</code>) \u2013 </li> <li> <code>vlm_options</code> (<code>Union[InlineVlmOptions, ApiVlmOptions]</code>) \u2013 </li> </ul>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.accelerator_options","title":"accelerator_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>accelerator_options: AcceleratorOptions = AcceleratorOptions()\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.allow_external_plugins","title":"allow_external_plugins <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>allow_external_plugins: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.artifacts_path","title":"artifacts_path <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>artifacts_path: Optional[Union[Path, str]] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.create_legacy_output","title":"create_legacy_output <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>create_legacy_output: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.document_timeout","title":"document_timeout <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>document_timeout: Optional[float] = None\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.enable_remote_services","title":"enable_remote_services <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>enable_remote_services: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.force_backend_text","title":"force_backend_text <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>force_backend_text: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.generate_page_images","title":"generate_page_images <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_page_images: bool = True\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.generate_picture_images","title":"generate_picture_images <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>generate_picture_images: bool = False\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.images_scale","title":"images_scale <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>images_scale: float = 1.0\n</code></pre>"},{"location":"reference/pipeline_options/#docling.datamodel.pipeline_options.VlmPipelineOptions.vlm_options","title":"vlm_options <code>class-attribute</code> <code>instance-attribute</code>","text":"<pre><code>vlm_options: Union[InlineVlmOptions, ApiVlmOptions] = SMOLDOCLING_TRANSFORMERS\n</code></pre>"},{"location":"usage/","title":"Usage","text":""},{"location":"usage/#conversion","title":"Conversion","text":""},{"location":"usage/#convert-a-single-document","title":"Convert a single document","text":"<p>To convert individual PDF documents, use <code>convert()</code>, for example:</p> <pre><code>from docling.document_converter import DocumentConverter\n\nsource = \"https://arxiv.org/pdf/2408.09869\" # PDF path or URL\nconverter = DocumentConverter()\nresult = converter.convert(source)\nprint(result.document.export_to_markdown()) # output: \"### Docling Technical Report[...]\"\n</code></pre>"},{"location":"usage/#cli","title":"CLI","text":"<p>You can also use Docling directly from your command line to convert individual files \u2014be it local or by URL\u2014 or whole directories.</p> <p><pre><code>docling https://arxiv.org/pdf/2206.01062\n</code></pre> You can also use \ud83e\udd5aSmolDocling and other VLMs via Docling CLI: <pre><code>docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062\n</code></pre> This will use MLX acceleration on supported Apple Silicon hardware.</p> <p>To see all available options (export formats etc.) run <code>docling --help</code>. More details in the CLI reference page.</p>"},{"location":"usage/#advanced-options","title":"Advanced options","text":""},{"location":"usage/#model-prefetching-and-offline-usage","title":"Model prefetching and offline usage","text":"<p>By default, models are downloaded automatically upon first usage. If you would prefer to explicitly prefetch them for offline use (e.g. in air-gapped environments) you can do that as follows:</p> <p>Step 1: Prefetch the models</p> <p>Use the <code>docling-tools models download</code> utility:</p> <pre><code>$ docling-tools models download\nDownloading layout model...\nDownloading tableformer model...\nDownloading picture classifier model...\nDownloading code formula model...\nDownloading easyocr models...\nModels downloaded into $HOME/.cache/docling/models.\n</code></pre> <p>Alternatively, models can be programmatically downloaded using <code>docling.utils.model_downloader.download_models()</code>.</p> <p>Step 2: Use the prefetched models</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\nartifacts_path = \"/local/path/to/models\"\n\npipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path)\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n</code></pre> <p>Or using the CLI:</p> <pre><code>docling --artifacts-path=\"/local/path/to/models\" FILE\n</code></pre> <p>Or using the <code>DOCLING_ARTIFACTS_PATH</code> environment variable:</p> <pre><code>export DOCLING_ARTIFACTS_PATH=\"/local/path/to/models\"\npython my_docling_script.py\n</code></pre>"},{"location":"usage/#using-remote-services","title":"Using remote services","text":"<p>The main purpose of Docling is to run local models which are not sharing any user data with remote services. Anyhow, there are valid use cases for processing part of the pipeline using remote services, for example invoking OCR engines from cloud vendors or the usage of hosted LLMs.</p> <p>In Docling we decided to allow such models, but we require the user to explicitly opt-in in communicating with external services.</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\n\npipeline_options = PdfPipelineOptions(enable_remote_services=True)\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n</code></pre> <p>When the value <code>enable_remote_services=True</code> is not set, the system will raise an exception <code>OperationNotAllowed()</code>.</p> <p>Note: This option is only related to the system sending user data to remote services. Control of pulling data (e.g. model weights) follows the logic described in Model prefetching and offline usage.</p>"},{"location":"usage/#list-of-remote-model-services","title":"List of remote model services","text":"<p>The options in this list require the explicit <code>enable_remote_services=True</code> when processing the documents.</p> <ul> <li><code>PictureDescriptionApiOptions</code>: Using vision models via API calls.</li> </ul>"},{"location":"usage/#adjust-pipeline-features","title":"Adjust pipeline features","text":"<p>The example file custom_convert.py contains multiple ways one can adjust the conversion pipeline and features.</p>"},{"location":"usage/#control-pdf-table-extraction-options","title":"Control PDF table extraction options","text":"<p>You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself. This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\n\npipeline_options = PdfPipelineOptions(do_table_structure=True)\npipeline_options.table_structure_options.do_cell_matching = False # uses text cells predicted from table structure model\n\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n</code></pre> <p>Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between <code>TableFormerMode.FAST</code> (faster but less accurate) and <code>TableFormerMode.ACCURATE</code> (default) to receive better quality with difficult table structures.</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode\n\npipeline_options = PdfPipelineOptions(do_table_structure=True)\npipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model\n\ndoc_converter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n }\n)\n</code></pre>"},{"location":"usage/#impose-limits-on-the-document-size","title":"Impose limits on the document size","text":"<p>You can limit the file size and number of pages which should be allowed to process per document:</p> <pre><code>from pathlib import Path\nfrom docling.document_converter import DocumentConverter\n\nsource = \"https://arxiv.org/pdf/2408.09869\"\nconverter = DocumentConverter()\nresult = converter.convert(source, max_num_pages=100, max_file_size=20971520)\n</code></pre>"},{"location":"usage/#convert-from-binary-pdf-streams","title":"Convert from binary PDF streams","text":"<p>You can convert PDFs from a binary stream instead of from the filesystem as follows:</p> <pre><code>from io import BytesIO\nfrom docling.datamodel.base_models import DocumentStream\nfrom docling.document_converter import DocumentConverter\n\nbuf = BytesIO(your_binary_stream)\nsource = DocumentStream(name=\"my_doc.pdf\", stream=buf)\nconverter = DocumentConverter()\nresult = converter.convert(source)\n</code></pre>"},{"location":"usage/#limit-resource-usage","title":"Limit resource usage","text":"<p>You can limit the CPU threads used by Docling by setting the environment variable <code>OMP_NUM_THREADS</code> accordingly. The default setting is using 4 CPU threads.</p>"},{"location":"usage/#use-specific-backend-converters","title":"Use specific backend converters","text":"<p>Note</p> <p>This section discusses directly invoking a backend, i.e. using a low-level API. This should only be done when necessary. For most cases, using a <code>DocumentConverter</code> (high-level API) as discussed in the sections above should suffice\u00a0\u2014\u00a0and is the recommended way.</p> <p>By default, Docling will try to identify the document format to apply the appropriate conversion backend (see the list of supported formats). You can restrict the <code>DocumentConverter</code> to a set of allowed document formats, as shown in the Multi-format conversion example. Alternatively, you can also use the specific backend that matches your document content. For instance, you can use <code>HTMLDocumentBackend</code> for HTML pages:</p> <pre><code>import urllib.request\nfrom io import BytesIO\nfrom docling.backend.html_backend import HTMLDocumentBackend\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.datamodel.document import InputDocument\n\nurl = \"https://en.wikipedia.org/wiki/Duck\"\ntext = urllib.request.urlopen(url).read()\nin_doc = InputDocument(\n path_or_stream=BytesIO(text),\n format=InputFormat.HTML,\n backend=HTMLDocumentBackend,\n filename=\"duck.html\",\n)\nbackend = HTMLDocumentBackend(in_doc=in_doc, path_or_stream=BytesIO(text))\ndl_doc = backend.convert()\nprint(dl_doc.export_to_markdown())\n</code></pre>"},{"location":"usage/#chunking","title":"Chunking","text":"<p>You can chunk a Docling document using a chunker, such as a <code>HybridChunker</code>, as shown below (for more details check out this example):</p> <pre><code>from docling.document_converter import DocumentConverter\nfrom docling.chunking import HybridChunker\n\nconv_res = DocumentConverter().convert(\"https://arxiv.org/pdf/2206.01062\")\ndoc = conv_res.document\n\nchunker = HybridChunker(tokenizer=\"BAAI/bge-small-en-v1.5\") # set tokenizer as needed\nchunk_iter = chunker.chunk(doc)\n</code></pre> <p>An example chunk would look like this:</p> <pre><code>print(list(chunk_iter)[11])\n# {\n# \"text\": \"In this paper, we present the DocLayNet dataset. [...]\",\n# \"meta\": {\n# \"doc_items\": [{\n# \"self_ref\": \"#/texts/28\",\n# \"label\": \"text\",\n# \"prov\": [{\n# \"page_no\": 2,\n# \"bbox\": {\"l\": 53.29, \"t\": 287.14, \"r\": 295.56, \"b\": 212.37, ...},\n# }], ...,\n# }, ...],\n# \"headings\": [\"1 INTRODUCTION\"],\n# }\n# }\n</code></pre>"},{"location":"usage/enrichments/","title":"Enrichment features","text":"<p>Docling allows to enrich the conversion pipeline with additional steps which process specific document components, e.g. code blocks, pictures, etc. The extra steps usually require extra models executions which may increase the processing time consistently. For this reason most enrichment models are disabled by default.</p> <p>The following table provides an overview of the default enrichment models available in Docling.</p> Feature Parameter Processed item Description Code understanding <code>do_code_enrichment</code> <code>CodeItem</code> See docs below. Formula understanding <code>do_formula_enrichment</code> <code>TextItem</code> with label <code>FORMULA</code> See docs below. Picture classification <code>do_picture_classification</code> <code>PictureItem</code> See docs below. Picture description <code>do_picture_description</code> <code>PictureItem</code> See docs below."},{"location":"usage/enrichments/#enrichments-details","title":"Enrichments details","text":""},{"location":"usage/enrichments/#code-understanding","title":"Code understanding","text":"<p>The code understanding step allows to use advance parsing for code blocks found in the document. This enrichment model also set the <code>code_language</code> property of the <code>CodeItem</code>.</p> <p>Model specs: see the <code>CodeFormula</code> model card.</p> <p>Example command line:</p> <pre><code>docling --enrich-code FILE\n</code></pre> <p>Example code:</p> <pre><code>from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_code_enrichment = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n</code></pre>"},{"location":"usage/enrichments/#formula-understanding","title":"Formula understanding","text":"<p>The formula understanding step will analize the equation formulas in documents and extract their LaTeX representation. The HTML export functions in the DoclingDocument will leverage the formula and visualize the result using the mathml html syntax.</p> <p>Model specs: see the <code>CodeFormula</code> model card.</p> <p>Example command line:</p> <pre><code>docling --enrich-formula FILE\n</code></pre> <p>Example code:</p> <pre><code>from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_formula_enrichment = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n</code></pre>"},{"location":"usage/enrichments/#picture-classification","title":"Picture classification","text":"<p>The picture classification step classifies the <code>PictureItem</code> elements in the document with the <code>DocumentFigureClassifier</code> model. This model is specialized to understand the classes of pictures found in documents, e.g. different chart types, flow diagrams, logos, signatures, etc.</p> <p>Model specs: see the <code>DocumentFigureClassifier</code> model card.</p> <p>Example command line:</p> <pre><code>docling --enrich-picture-classes FILE\n</code></pre> <p>Example code:</p> <pre><code>from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.generate_picture_images = True\npipeline_options.images_scale = 2\npipeline_options.do_picture_classification = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n</code></pre>"},{"location":"usage/enrichments/#picture-description","title":"Picture description","text":"<p>The picture description step allows to annotate a picture with a vision model. This is also known as a \"captioning\" task. The Docling pipeline allows to load and run models completely locally as well as connecting to remote API which support the chat template. Below follow a few examples on how to use some common vision model and remote services.</p> <pre><code>from docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.datamodel.pipeline_options import PdfPipelineOptions\nfrom docling.datamodel.base_models import InputFormat\n\npipeline_options = PdfPipelineOptions()\npipeline_options.do_picture_description = True\n\nconverter = DocumentConverter(format_options={\n InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)\n})\n\nresult = converter.convert(\"https://arxiv.org/pdf/2501.17887\")\ndoc = result.document\n</code></pre>"},{"location":"usage/enrichments/#granite-vision-model","title":"Granite Vision model","text":"<p>Model specs: see the <code>ibm-granite/granite-vision-3.1-2b-preview</code> model card.</p> <p>Usage in Docling:</p> <pre><code>from docling.datamodel.pipeline_options import granite_picture_description\n\npipeline_options.picture_description_options = granite_picture_description\n</code></pre>"},{"location":"usage/enrichments/#smolvlm-model","title":"SmolVLM model","text":"<p>Model specs: see the <code>HuggingFaceTB/SmolVLM-256M-Instruct</code> model card.</p> <p>Usage in Docling:</p> <pre><code>from docling.datamodel.pipeline_options import smolvlm_picture_description\n\npipeline_options.picture_description_options = smolvlm_picture_description\n</code></pre>"},{"location":"usage/enrichments/#other-vision-models","title":"Other vision models","text":"<p>The option class <code>PictureDescriptionVlmOptions</code> allows to use any another model from the Hugging Face Hub.</p> <pre><code>from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions\n\npipeline_options.picture_description_options = PictureDescriptionVlmOptions(\n repo_id=\"\", # <-- add here the Hugging Face repo_id of your favorite VLM\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n)\n</code></pre>"},{"location":"usage/enrichments/#remote-vision-model","title":"Remote vision model","text":"<p>The option class <code>PictureDescriptionApiOptions</code> allows to use models hosted on remote platforms, e.g. on local endpoints served by VLLM, Ollama and others, or cloud providers like IBM watsonx.ai, etc.</p> <p>Note: in most cases this option will send your data to the remote service provider.</p> <p>Usage in Docling:</p> <pre><code>from docling.datamodel.pipeline_options import PictureDescriptionApiOptions\n\n# Enable connections to remote services\npipeline_options.enable_remote_services=True # <-- this is required!\n\n# Example using a model running locally, e.g. via VLLM\n# $ vllm serve MODEL_NAME\npipeline_options.picture_description_options = PictureDescriptionApiOptions(\n url=\"http://localhost:8000/v1/chat/completions\",\n params=dict(\n model=\"MODEL NAME\",\n seed=42,\n max_completion_tokens=200,\n ),\n prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n timeout=90,\n)\n</code></pre> <p>End-to-end code snippets for cloud providers are available in the examples section:</p> <ul> <li>IBM watsonx.ai</li> </ul>"},{"location":"usage/enrichments/#develop-new-enrichment-models","title":"Develop new enrichment models","text":"<p>Beside looking at the implementation of all the models listed above, the Docling documentation has a few examples dedicated to the implementation of enrichment models.</p> <ul> <li>Develop picture enrichment</li> <li>Develop formula enrichment</li> </ul>"},{"location":"usage/supported_formats/","title":"Supported formats","text":"<p>Docling can parse various documents formats into a unified representation (Docling Document), which it can export to different formats too \u2014 check out Architecture for more details.</p> <p>Below you can find a listing of all supported input and output formats.</p>"},{"location":"usage/supported_formats/#supported-input-formats","title":"Supported input formats","text":"Format Description PDF DOCX, XLSX, PPTX Default formats in MS Office 2007+, based on Office Open XML Markdown AsciiDoc HTML, XHTML CSV PNG, JPEG, TIFF, BMP, WEBP Image formats <p>Schema-specific support:</p> Format Description USPTO XML XML format followed by USPTO patents JATS XML XML format followed by JATS articles Docling JSON JSON-serialized Docling Document"},{"location":"usage/supported_formats/#supported-output-formats","title":"Supported output formats","text":"Format Description HTML Both image embedding and referencing are supported Markdown JSON Lossless serialization of Docling Document Text Plain text, i.e. without Markdown markers Doctags"},{"location":"usage/vision_models/","title":"Vision models","text":"<p>The <code>VlmPipeline</code> in Docling allows you to convert documents end-to-end using a vision-language model.</p> <p>Docling supports vision-language models which output:</p> <ul> <li>DocTags (e.g. SmolDocling), the preferred choice</li> <li>Markdown</li> <li>HTML</li> </ul> <p>For running Docling using local models with the <code>VlmPipeline</code>:</p> CLIPython <pre><code>docling --pipeline vlm FILE\n</code></pre> <p>See also the example minimal_vlm_pipeline.py.</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n ),\n }\n)\n\ndoc = converter.convert(source=\"FILE\").document\n</code></pre>"},{"location":"usage/vision_models/#available-local-models","title":"Available local models","text":"<p>By default, the vision-language models are running locally. Docling allows to choose between the Hugging Face Transformers framework and the MLX (for Apple devices with MPS acceleration) one.</p> <p>The following table reports the models currently available out-of-the-box.</p> Model instance Model Framework Device Num pages Inference time (sec) <code>vlm_model_specs.SMOLDOCLING_TRANSFORMERS</code> ds4sd/SmolDocling-256M-preview <code>Transformers/AutoModelForVision2Seq</code> MPS 1 102.212 <code>vlm_model_specs.SMOLDOCLING_MLX</code> ds4sd/SmolDocling-256M-preview-mlx-bf16 <code>MLX</code> MPS 1 6.15453 <code>vlm_model_specs.QWEN25_VL_3B_MLX</code> mlx-community/Qwen2.5-VL-3B-Instruct-bf16 <code>MLX</code> MPS 1 23.4951 <code>vlm_model_specs.PIXTRAL_12B_MLX</code> mlx-community/pixtral-12b-bf16 <code>MLX</code> MPS 1 308.856 <code>vlm_model_specs.GEMMA3_12B_MLX</code> mlx-community/gemma-3-12b-it-bf16 <code>MLX</code> MPS 1 378.486 <code>vlm_model_specs.GRANITE_VISION_TRANSFORMERS</code> ibm-granite/granite-vision-3.2-2b <code>Transformers/AutoModelForVision2Seq</code> MPS 1 104.75 <code>vlm_model_specs.PHI4_TRANSFORMERS</code> microsoft/Phi-4-multimodal-instruct <code>Transformers/AutoModelForCasualLM</code> CPU 1 1175.67 <code>vlm_model_specs.PIXTRAL_12B_TRANSFORMERS</code> mistral-community/pixtral-12b <code>Transformers/AutoModelForVision2Seq</code> CPU 1 1828.21 <p>Inference time is computed on a Macbook M3 Max using the example page <code>tests/data/pdf/2305.03393v1-pg9.pdf</code>. The comparison is done with the example compare_vlm_models.py.</p> <p>For choosing the model, the code snippet above can be extended as follow</p> <pre><code>from docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import DocumentConverter, PdfFormatOption\nfrom docling.pipeline.vlm_pipeline import VlmPipeline\nfrom docling.datamodel.pipeline_options import (\n VlmPipelineOptions,\n)\nfrom docling.datamodel import vlm_model_specs\n\npipeline_options = VlmPipelineOptions(\n vlm_options=vlm_model_specs.SMOLDOCLING_MLX, # <-- change the model here\n)\n\nconverter = DocumentConverter(\n format_options={\n InputFormat.PDF: PdfFormatOption(\n pipeline_cls=VlmPipeline,\n pipeline_options=pipeline_options,\n ),\n }\n)\n\ndoc = converter.convert(source=\"FILE\").document\n</code></pre>"},{"location":"usage/vision_models/#other-models","title":"Other models","text":"<p>Other models can be configured by directly providing the Hugging Face <code>repo_id</code>, the prompt and a few more options.</p> <p>For example:</p> <pre><code>from docling.datamodel.pipeline_options_vlm_model import InlineVlmOptions, InferenceFramework, TransformersModelType\n\npipeline_options = VlmPipelineOptions(\n vlm_options=InlineVlmOptions(\n repo_id=\"ibm-granite/granite-vision-3.2-2b\",\n prompt=\"Convert this page to markdown. Do not miss any text and only output the bare markdown!\",\n response_format=ResponseFormat.MARKDOWN,\n inference_framework=InferenceFramework.TRANSFORMERS,\n transformers_model_type=TransformersModelType.AUTOMODEL_VISION2SEQ,\n supported_devices=[\n AcceleratorDevice.CPU,\n AcceleratorDevice.CUDA,\n AcceleratorDevice.MPS,\n ],\n scale=2.0,\n temperature=0.0,\n )\n)\n</code></pre>"},{"location":"usage/vision_models/#remote-models","title":"Remote models","text":"<p>Additionally to local models, the <code>VlmPipeline</code> allows to offload the inference to a remote service hosting the models. Many remote inference services are provided, the key requirement is to offer an OpenAI-compatible API. This includes vLLM, Ollama, etc.</p> <p>More examples on how to connect with the remote inference services can be found in the following examples:</p> <ul> <li>vlm_pipeline_api_model.py</li> </ul>"}]} |